Benefits

Remote Work

Job Type

full time

Description

The Calix platform enables Communication Service Providers (CSPs) of all sizes to transform and future-proof their businesses. Through real-time data, automation, and actionable insights delivered via Calix One — our cloud-first, AI-powered platform — CSPs can simplify operations, collapse cost, and accelerate innovation. Calix One brings together the automation of everything and the experience of one, empowering customers to deliver differentiated subscriber experiences while driving acquisition, loyalty, and revenue growth. This is the Calix mission: to enable CSPs of all sizes to simplify, innovate, and grow, strengthening both their businesses and the communities they serve.
We’re at the forefront of a once in a generational change in the broadband industry. Join us as we innovate, help our customers reach their potential, and connect underserved communities with unrivaled digital experiences.

The Site Reliability Engineer I (SRE I) ensures our production services remain available, scalable, and efficient on Google Cloud Platform. You will bridge the gap between development and operations by applying software engineering mindsets to containerized infrastructure challenges. This entry-level role focuses on GitOps application deployments via ArgoCD, alert triage, Grafana Labs observability, and leveraging AIOps platforms, backed by strong OS and networking knowledge.

Key Responsibilities:

GitOps & Deployments: Deploy, roll back, and manage the lifecycle of containerized applications using ArgoCD pipeline workflows.
Alert Triage & Resolution: Act as the first line of defense to investigate, troubleshoot, and resolve infrastructure, OS-level, application, and network alerts.
Network Troubleshooting: Diagnose connectivity and latency issues across all network layers, isolating problems between cloud VPCs and Kubernetes overlays.
OS Troubleshooting: Diagnose deep Operating System bottlenecks including CPU throttling, memory leaks, and storage constraints on GKE worker nodes.
AIOps Utilization: Use AI-driven operations tools to interpret correlated events, anomaly detections, and automated root-cause insights.
Incident Response: Participate in on-call rotations, using Grafana dashboards and AIOps suggestions to quickly mitigate production container issues.

Required Technical Skills:

CI/CD & GitOps: Practical experience deploying and managing applications using ArgoCD and Git version control systems.
Networking: Comprehensive knowledge of all layers of networking (OSI model), with practical troubleshooting skills in Layer 7 (Application): HTTP/S, DNS, gRPC, and SSL/TLS handshakes; Layer 4 (Transport): TCP/UDP mechanics, three-way handshakes, and port allocation; Layer 3 (Network): IP routing, CIDR blocks, Subnetting, and ICMP; Kubernetes Networking: Understanding of Pod-to-Pod communication, Services, Ingress controllers, and CNI plugins.
Operating Systems: Strong, deep foundational knowledge of Linux (Ubuntu, Debian, or Container-Optimized OS) internals, process management, file systems, and kernel parameters.
System & Network Utilities: Proficiency with command-line diagnostic tools (e.g., tcpdump, curl, dig, traceroute, top, iostat).
AIOps Concepts: Familiarity with AI-driven operations, event correlation, anomaly detection, and automated noise reduction.
Observability: Foundational experience with the Grafana Labs ecosystem (Grafana, Mimir/Prometheus, Loki, Tempo).
Orchestration: Functional knowledge of deploying, scaling, and managing workloads in GKE.
ML/AIOps (nice to have ) : Anomaly detection concepts , log-based ML models
Demonstrated experience writing automation scripts in Python, Bash, or Go

Soft Skills & Qualifications:

Problem Solving: Ability to systematically troubleshoot complex microservice and OS-level issues under pressure.
Urgency & Prioritization: Strong sense of ownership and ability to prioritize alerts based on business impact.
Analytical Mindset: Comfort working alongside data-driven, automated recommendations to solve infrastructure problems.
Communication: Clear written and verbal communication during high-stress incident responses.

Location:

India – (Flexible hybrid work model - work from Bangalore office for 20 days in a quarter)

This job is found at InterviewStack.io

Skills

automationgitopsargocdgrafanaobservabilitykubernetesdashboardsci/cdgitdnsgrpcssltlssubnettinglinuxprometheuspythonbashincident response

About Calix, Inc.

Calix is an AI platform company that enables service providers to transform their operations and deliver differentiated subscriber experiences. The company provides cloud and software platforms, systems, and services for communications service providers (CSPs) globally.

telecom, cloud_infrastructurepublicWebsite

Site Reliability Engineer - I

Prepare for this role