InterviewStack.io LogoInterviewStack.io
Browse more Site Reliability Engineer jobs

Senior Site Reliability Engineer

Jolera

Colombo, Western Province, Sri Lanka1 week ago
97 views58 saves11 applies

Prepare for this role


Job Type

full time

Description

Job Purpose

Lead for a team of site reliability engineers delivering who deliver incident detection, triage, and runbook-based remediation for production cloud-native environments, to support our North American customers. Set the operational standard for triage and recovery, act as the senior escalation point, and serve as the primary technical liaison to the Service Delivery Manager.

Key Responsibilities

• Lead incident detection, triage, and first response across production cloud and Kubernetes environments, to support our North American customers.

• Execute and oversee approved runbooks for service restoration — workload and node restarts, scaling, rollbacks, and database stabilization — within agreed operational boundaries.

• Act as the senior escalation authority; prepare clear escalation summaries covering impact, actions taken, current state, and recommended next steps.

• Author, review, and maintain operational runbooks; continuously improve detection, alerting, and automation.

• Engage cloud-provider support (AWS, GCP) for platform-level failures and vendor escalations.

• Technically supervise and mentor the SRE team; review handoffs and assure consistency across shifts.

• Own daily shift handoffs and contribute to monthly service reporting and reviews.

People Management

• Provides technical leadership and day-to-day supervision

• Contributes to coaching, performance input, and skills development; formal line management sits with the Service Delivery Manager.

Financial Responsibility

• Accountable for protecting service levels and cost-to-serve through efficient, automation-first operations.

• Key Performance Indicators (KPIs)

• Service-level (SLO/SLA) attainment

• Mean time to acknowledge / mean time to resolve

• Runbook coverage and quality

• Escalation accuracy and completeness

• Shift-handoff quality and reporting timeliness

• Repeat-incident reduction and automation adoption

Requirements

Education & Certifications

• Bachelor’s degree in Computer Science, Engineering, or equivalent experience.

• Certified Kubernetes Administrator (CKA) required; CKAD, AWS, and Google Cloud certifications strongly preferred.

Experience

• 7+ years in SRE, DevOps, or production infrastructure operations, including 3+ years operating Kubernetes in production.

• Proven track record leading incident response for production cloud workloads.

• Managed-services / MSP or 24×7 operations experience preferred.

Skills & Competencies

Technical Skills

• Kubernetes operations across AWS EKS and GCP GKE

• AWS and GCP core services (compute, storage, networking, scaling, IAM)

• Relational database operational recovery (e.g., PostgreSQL)

• Observability platforms (e.g., Datadog)

• Scripting and automation (Bash, Python, Go or equivalent); read-level Terraform/IaC

• Incident command and structured troubleshooting

Soft Skills

• Calm, decisive incident leadership under pressure

• Clear written and verbal English

• Mentoring and team collaboration

• Time management

Tools / Software

• Datadog

• Jira / ServiceNow

• Confluence / GitHub Wiki

• AWS & GCP consoles

• Slack / Microsoft Teams

Benefits

What We Offer

  • Competitive compensation package
  • Competitive benefits package
  • Company Perks, Good Life gym, and various brand discounts
  • Company events, recognitions, and celebrations
  • Career development and growth opportunities

This job is found at InterviewStack.io

Skills

kubernetesnode.jsawsgcpautomationeksiampostgresqlobservabilitydatadogbashpythonterraforminfrastructure as codejiraincident responsepeople management

About Jolera

Jolera is a Global Systems Integrator (GSI) dedicated to transforming IT operations into secure, efficient environments. With a diverse team of over 500 professionals across 24 countries, we combine global reach with localized expertise.

it services, cybersecurityWebsite