1 week ago Be among the first 25 applicants EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential. We are looking for a Senior Site Reliability Engineer to join our team and play a key role in ensuring the reliability, scalability, and performance of our systems. This position involves working across the entire service lifecycle, from design and deployment to monitoring and optimization. You will collaborate with global teams, tackle complex challenges, and implement automation strategies to improve system resilience and efficiency. Your expertise will be instrumental in maintaining the stability of critical systems and driving continuous improvement. Responsibilities Participate in and enhance the full lifecycle of services, including design, deployment, operation, and refinement Analyze ITSM activities for the platform and provide feedback to development teams to address operational gaps and improve resiliency Support services pre-launch through system design consultation, capacity planning, and launch reviews Monitor live services by tracking availability, latency, and overall system health Scale systems sustainably through automation and advocate for changes that enhance reliability and velocity Lead application automation efforts to validate and promote software across environments while adhering to best practices Practice incident response with a focus on sustainable solutions and conduct blameless postmortems Take a proactive approach to problem-solving, connecting insights across the technology stack during production events to minimize recovery time Collaborate with global teams across multiple regions and time zones to ensure consistent support and operations Share expertise and provide mentorship to junior team members Requirements Bachelor’s degree in Computer Science, or a related technical field involving coding (e.g., physics or mathematics), or equivalent practical experience At least three years of hands-on experience as a Site Reliability Engineer Experience with technologies such as COBOL, JCL, VSAM, DB2, CICS, and MQ Strong knowledge of algorithms, data structures, scripting, pipeline management, and software design A systematic approach to problem-solving combined with excellent communication skills and a strong sense of ownership and drive Proficiency in debugging and optimizing code, as well as automating routine tasks Experience working with diverse stakeholders and handling urgent situations while making effective decisions Interest and expertise in designing, analyzing, and troubleshooting large-scale distributed systems English proficiency at a B2 level or higher, with strong verbal and written communication skills Nice to have Familiarity with cloud-native tools and platforms for enhancing system performance and scalability Experience implementing observability solutions to monitor and optimize distributed systems Knowledge of containerization and orchestration tools such as Docker and Kubernetes for managing application environments We offer International projects with top brands Work with global teams of highly skilled, diverse peers Healthcare benefits Employee financial programs Paid time off and sick leave Upskilling, reskilling and certification courses Unlimited access to the LinkedIn Learning library and 22,000+ courses Global career opportunities Volunteer and community involvement opportunities EPAM Employee Groups Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn Seniority level Seniority level Mid-Senior level Employment type Employment type Full-time Job function Job function Engineering, Information Technology, and Business Development Industries Software Development, IT Services and IT Consulting, and Banking Referrals increase your chances of interviewing at EPAM Systems by 2x Get notified about new Senior Site Reliability Engineer jobs in Colombia . Junior DevOps Engineer (Remote - Anywhere) Bogota, D.C., Capital District, Colombia 3 weeks ago Bogota, D.C., Capital District, Colombia 3 months ago Bogota, D.C., Capital District, Colombia 2 months ago Bogota, D.C., Capital District, Colombia 1 week ago Bogota, D.C., Capital District, Colombia 3 days ago Bogota, D.C., Capital District, Colombia 2 weeks ago Senior Site Reliability / Gitops Engineer Bogota, D.C., Capital District, Colombia 2 weeks ago Senior Site Reliability Engineer (SRE) / Dynatrace - GP Bogota, D.C., Capital District, Colombia 4 months ago DevOps Engineer - (Remote multiple locations) Middle/Senior C++ Engineer (High Performance Computing) We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI. #J-18808-Ljbffr