Engineering Manager, Site Reliability Engineering (SRE)
Athenahealth
Prepare for this role
Benefits
Job Type
Description
Join us as we work to create a thriving ecosystem that delivers accessible, high-quality, and sustainable healthcare for all.
Position Summary: We are seeking an Engineering Manager, Site Reliability Engineering (SRE), who is a hands-on technical people leader to lead the Service Operations Site Reliability Engineering team in Chennai within the Cloud Infrastructure Engineering (CIE) division. This role is responsible for driving reliability, observability, automation, and operational readiness across systems supporting Service Operations. The ideal candidate brings deep expertise in Linux infrastructure, observability platforms, infrastructure automation, incident management, and engineering leadership. This individual will partner closely with global engineering and operations teams to reduce toil, improve service reliability, and deliver scalable, resilient solutions that support athenahealth's mission of providing
About the Team: The Service Operations Site Reliability Engineering team is part of the Network Operations Center (NOC) organization and sits within the Cloud Infrastructure Engineering (CIE) division. The team is responsible for delivering highly available SaaS infrastructure, operational tooling, observability solutions, and automation capabilities that support Service Operations and Cloud Infrastructure teams. Working closely with R&D and Infrastructure stakeholders across India and the United States, the team focuses on improving operational excellence through automation, standardized onboarding, actionable monitoring, and continuous reduction of operational toil.
Essential Job Responsibilities:
- Lead, coach, mentor, and develop a team of Site Reliability and Infrastructure Engineers based in India.
- Remain technically hands-on by reviewing designs, guiding implementation efforts, troubleshooting complex issues, and contributing to technical solutions when required.
- Own team delivery across infrastructure management, observability, service onboarding, alerting, automation, and operational readiness initiatives.
- Drive observability strategy across metrics, logs, traces, synthetic monitoring, health checks, dashboards, and actionable alerting frameworks.
- Manage provisioning and lifecycle management of physical and virtual Linux systems using tools such as Puppet, Ansible, Terraform, and related automation platforms.
- Partner with engineering teams operating within SaaS, hybrid cloud, Kubernetes, and Amazon EKS environments to ensure complete monitoring and operational coverage.
- Identify, measure, and reduce operational toil through automation, self-service capabilities, documentation, and scalable operational processes.
- Lead Agile delivery practices including sprint planning, backlog prioritization, stakeholder communication, and continuous improvement activities.
Additional Job Responsibilities:
- Build and enhance monitoring integrations across platforms including New Relic, Prometheus, Alertmanager, OpenSearch, Grafana, Icinga, Unified Assurance, and related technologies.
- Establish Infrastructure-as-Code (IaC), Configuration-as-Code, Monitoring-as-Code, and Alerting-as-Code standards and practices.
- Improve alert quality by ensuring alerts contain actionable context, ownership, severity levels, routing information, and runbook references.
- Partner with NOC and Service Operations teams to standardize service onboarding, escalation management, operational handoffs, and response workflows.
- Manage hiring, onboarding, performance management, feedback, career development, and technical growth of direct reports.
- Participate in incident response activities, escalation reviews, post-incident analysis, and on-call planning processes.
- Develop and report operational metrics including alert quality, automation coverage, service health, onboarding throughput, toil reduction, and reliability improvements.
- Ensure operational excellence through comprehensive documentation, SOPs, runbooks, architecture diagrams, and support procedures while collaborating effectively with global teams.
Expected Education & Experience:
- Bachelor's degree in Computer Science, Information Technology, Engineering, or a related technical discipline; equivalent experience will also be considered.
- 10+ years of experience in Infrastructure Engineering, Site Reliability Engineering, Systems Engineering, Platform Engineering, or Technical Operations.
- 2+ years of experience managing or formally leading technical engineering teams.
- Strong hands-on experience administering, provisioning, and operating Linux systems in large-scale production environments.
- Proven experience with observability platforms, monitoring, logging, tracing, dashboarding, alerting, and synthetic monitoring solutions.
- Experience with Infrastructure-as-Code and configuration management tools such as Terraform, Puppet, Ansible, Chef, or similar technologies, along with scripting in Python, Go, Bash, Ruby, Java, or related languages.
- Experience supporting SaaS, hybrid cloud, Kubernetes/EKS environments, CI/CD pipelines, incident management, operational readiness, and modern engineering practices, with strong communication and stakeholder management skills.
About athenahealth
Our vision: In an industry that becomes more complex by the day, we stand for simplicity. We offer IT solutions and expert services that eliminate the daily hurdles preventing healthcare providers from focusing entirely on their patients — powered by our vision to create a thriving ecosystem that delivers accessible, high-quality, and sustainable healthcare for all.
Our company culture: Our talented employees — or athenistas, as we call ourselves — spark the innovation and passion needed to accomplish our vision. We are a diverse group of dreamers and do-ers with unique knowledge, expertise, backgrounds, and perspectives. We unite as mission-driven problem-solvers with a deep desire to achieve our vision and make our time here count. Our award-winning culture is built around shared values of inclusiveness, accountability, and support.
Our DEI commitment: Our vision of accessible, high-quality, and sustainable healthcare for all requires addressing the inequities that stand in the way. That's one reason we prioritize diversity, equity, and inclusion in every aspect of our business, from attracting and sustaining a diverse workforce to maintaining an inclusive environment for athenistas, our partners, customers and the communities where we work and serve.
What we can do for you:
Along with health and financial benefits, athenistas enjoy perks specific to each location, including commuter support, employee assistance programs, tuition assistance, employee resource groups, and collaborative workspaces — some offices even welcome dogs.
We also encourage a better work-life balance for athenistas with our flexibility. While we know in-office collaboration is critical to our vision, we recognize that not all work needs to be done within an office environment, full-time. With consistent communication and digital collaboration tools, athenahealth enables employees to find a balance that feels fulfilling and productive for each individual situation.
In addition to our traditional benefits and perks, we sponsor events throughout the year, including book clubs, external speakers, and hackathons. We provide athenistas with a company culture based on learning, the support of an engaged team, and an inclusive environment where all employees are valued.
Learn more about our culture and benefits here: athenahealth.com/careers
This job is found at InterviewStack.io
Skills
About Athenahealth
Athenahealth is a healthcare technology company that provides network-enabled services for healthcare providers including medical billing, practice management, and electronic health records. It aims to improve clinical and financial outcomes for healthcare organizations through cloud-based solutions.