1 week ago Be among the first 25 applicants EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential. We are seeking a highly skilled Site Reliability Engineer/Architect (SRE) to join our innovative and fast-paced team. In this role, you will be responsible for architecting and implementing state-of-the-art SRE practices to ensure the reliability and scalability of our enterprise-grade Generative AI (GenAI) integration platform. You will play a critical role in driving operational excellence by adopting cutting-edge methodologies and tools while collaborating with key stakeholders across technical and business units. Responsibilities Define and implement Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to establish reliability standards and ensure systemic health Architect and maintain resilient production systems that leverage canary deployments, shadow traffic, and testing-in-production methodologies Develop strategies for incident management and automate on-call operations to reduce downtime and improve system stability Create and enhance observability frameworks, including logging, tracing, and monitoring, for real-time system visibility and proactive troubleshooting Automate scalability, performance optimization, and routine operational tasks to improve efficiency and reliability Lead collaboration sessions with engineering teams to embed SRE principles in system design and development Provide strategic leadership in implementing site reliability solutions across multi-cloud, multi-tenant environments for enterprise customers Act as a trusted advisor to executive stakeholders, offering insights and recommendations to align SRE strategies with business and technical objectives Champion a culture of innovation and operational reliability by mentoring teams and driving adoption of industry-leading best practices Partner with architecture and DevOps teams to ensure the platform's infrastructure supports high availability and scalability Advocate for continuous improvement in operational processes while identifying opportunities for innovation and optimization Requirements 8+ years of professional experience in SRE, DevOps, or related fields, including direct involvement with production systems Expertise in SRE methodologies such as SLOs, SLIs, canary testing, and incident management Proficiency with cloud technologies including AWS, Google Cloud Platform, or Azure, with hands-on experience in multi-cloud environments Background in observability tools such as Prometheus, Grafana, or ELK Stack, coupled with monitoring practices for distributed systems Skills in automation platforms like Terraform, Ansible, or Kubernetes, driving infrastructure-as-code practices Familiarity with programming languages for automation solutions, such as Python, Go, or Bash Strong understanding of CI/CD pipelines, containerization technologies, and orchestration frameworks Competency in architecting systems for fault tolerance, redundancy, and performance optimization Showcase of effective collaboration with stakeholders from technical teams to executive-level managers Background in handling enterprise-scale systems and multi-tenant platform deployments Nice to have Knowledge of Generative AI technologies and frameworks, including their integration processes Understanding of managed database services such as Amazon RDS, Google Spanner, or Azure SQL Familiarity with security best practices specific to multi-cloud environments and enterprise platforms Experience influencing technical roadmaps for large-scale distributed systems Capability to lead initiatives around Chaos Engineering or disaster recovery strategies We offer International projects with top brands Work with global teams of highly skilled, diverse peers Healthcare benefits Employee financial programs Paid time off and sick leave Upskilling, reskilling and certification courses Unlimited access to the LinkedIn Learning library and 22,000+ courses Global career opportunities Volunteer and community involvement opportunities EPAM Employee Groups Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn Seniority level Seniority level Associate Employment type Employment type Full-time Job function Job function Information Technology, Engineering, and Business Development Industries Software Development, IT Services and IT Consulting, and Media and Telecommunications Referrals increase your chances of interviewing at EPAM Systems by 2x Get notified about new Site Reliability Engineer jobs in Colombia . Junior DevOps Engineer (Remote - Anywhere) Software Engineer (Python) Career Opportunities at Dev.Pro - 01 Bogota, D.C., Capital District, Colombia 3 weeks ago Bogota, D.C., Capital District, Colombia 3 months ago Bogota, D.C., Capital District, Colombia 1 month ago Bogota, D.C., Capital District, Colombia 3 weeks ago Bogota, D.C., Capital District, Colombia 1 week ago Trainee Software Engineer (Colombia, Remote) Bogota, D.C., Capital District, Colombia 4 days ago Bogota, D.C., Capital District, Colombia 4 months ago DevOps Engineer - (Remote multiple locations) Bogota, D.C., Capital District, Colombia 3 days ago Bogota, D.C., Capital District, Colombia 2 months ago Bogota, D.C., Capital District, Colombia 1 week ago Bogota, D.C., Capital District, Colombia 3 months ago Junior Software Engineer (Remote - LATAM) Bogota, D.C., Capital District, Colombia 3 months ago Bogota, D.C., Capital District, Colombia 2 weeks ago We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI. #J-18808-Ljbffr