About the job: Red Hat is looking for Site Reliability Engineers (SREs) to be part of the Infrastructure Customer Engineering team (R&D; production development), under Red Hat Cloud Telco Team. As a Cloud Infrastructure SRE, you will join a special R&D; task force dedicated to preventing and solving the most critical and strategic customer issues encountered in the field. Our sizeable team of experienced engineers from all corners of the globe has a deep understanding of the lower layers of private cloud infrastructure. Together, we work on practical solutions, always learning and adapting. We collaborate well, sharing knowledge and tackling challenges head-on. If you're an engineer keen on understanding the core of cloud, and you're looking for a team that values hands-on expertise, you'll fit right in with us. What you will do: - Maintain and enhance private cloud solutions: you’ll be deeply involved in private clouds based on Kubernetes and OpenStack - Deep Dive Troubleshoot: We take pride in our ability to solve complex problems. Starting with high-level K8s error messages, you’ll traverse layers until pinpointing a kernel bug becomes second nature. - Automate and Develop: We spend 30-50% of our time on development, to automate the issues that we solved on the field. Solve it once, automate for the future. Use your Python skills to automate health checks and prevent recurrent issues. - Learn: In our rapidly evolving tech landscape, continuous learning is a cornerstone. - Collaborate with Developers and Other Engineers: Work closely with our developers and product engineers to bridge the gap between infrastructure and software, ensuring seamless product delivery. - Provide On-call Duties: Participate in regular on-call rotations, averaging around 4 days a month, to ensure business continuity for our customers in emergencies. What you will bring: - Linux Expertise: Familiarity with any Linux distribution is crucial. However, we predominantly use Red Hat and CentOS in our products. - Networking Foundations: A solid grasp of computer networking fundamentals, such as understanding of VLANs and IP routing, is a must. Familiarity with advanced tools and technologies such as Calico, Multus, and Open vSwitch is an advantage. - Problem-Solving Mindset: Possess a sharp troubleshooting skillset combined with an analytical mindset to dissect and address complex challenges. - Scripting and Automation: Proficiency in scripting with Bash and Python, or the willingness to learn and adapt, as well as familiarity with Ansible. - Containerization & Virtualization: Knowledge in areas like Podman, Kubernetes, Helm and/or OpenStack, KVM/QEMU is a significant advantage. - Storage Systems: Proficiency with storage solutions such as CEPH and Rook will stand you in good stead. - Database Expertise: Understanding of relational databases such as MySQL and MariaDB, as well as experience with ETCD, is beneficial. - Monitoring and Logging: Experience with monitoring and logging tools like ELK (Elasticsearch, Logstash, Kibana), Prometheus, and Grafana will be invaluable in this role. - Fluent English is mandatory LI-JR1 LI-REMOTE