InterviewStack.io LogoInterviewStack.io
Browse more Software Engineer jobs

Lead Software Engineer, Cloud Site Reliability (SRE)

Icertis

Pune, Maharashtra, India1 month ago
108 views26 saves7 applies

Prepare for this role


Job Type

full time

Description

Role Responsibilities:

  • Lead 24x7 NOC operations with mandatory rotational shifts ensuring system availability and SLA adherence

  • Act as Major Incident Manager (P1/P2 incidents), driving triage, war room coordination, and stakeholder communication

  • Implement and enhance observability practices across logs, metrics, and traces

  • Work with tools like Datadog and Azure Monitor for monitoring and alerting

  • Drive proactive monitoring, alert tuning, anomaly detection, and AIOps initiatives

  • Manage Azure infrastructure and AKS clusters, including troubleshooting, scaling, and performance tuning

  • Build automation and self-healing workflows using Terraform, ARM, Helm, Power Automate, and scripting

  • Collaborate with engineering teams to improve reliability, deployment pipelines, and cloud-native architecture

  • Develop dashboards and reports using Power BI and ServiceNow

  • Handle Monthly Business reviews and leadership reporting

  • Mentor team members and drive process standardization and operational excellence

Required Skills:

  • 7–12 years of experience in CloudOps / SRE / NOC environments (24x7 operations)

  • Strong expertise in Azure Infrastructure (VMs, Networking, Storage)

  • Hands-on experience with Azure Kubernetes Service (AKS), Kubernetes, Docker

  • Strong experience with monitoring and observability tools (Datadog, Azure Monitor, Prometheus, Grafana)

  • Proven experience in Incident Management / Major Incident Handling, Monthly reporting

  • Experience with Infrastructure as Code (Terraform, ARM templates, Helm)

  • Scripting skills in PowerShell, Python, or Bash

  • Experience with ServiceNow (Incident, Problem, Change modules and dashboards)

  • Strong reporting and analytics experience using Power BI and exposure to tools like Power Automate

  • Good understanding of distributed systems and cloud-native architecture

  • Excellent communication, leadership, and problem-solving skills

Preferred Skills:

  • Experience in multi-cloud environments (AWS/GCP)

  • Exposure to AIOps / predictive monitoring / self-healing systems

  • Azure / Kubernetes certifications

This job is found at InterviewStack.io

Skills

observabilitydatadogazuremonitoringautomationterraformhelmdashboardspower bi

About Icertis

The Icertis platform delivers an enterprise–wide contract intelligence layer that understands business and industry context – connecting agreements, data, and systems to drive the future of autonomous contracting.

software, saasWebsite