Site Reliability Engineer - I
Calix, Inc.
Prepare for this role
Benefits
Job Type
Description
We’re at the forefront of a once in a generational change in the broadband industry. Join us as we innovate, help our customers reach their potential, and connect underserved communities with unrivaled digital experiences.
The Site Reliability Engineer I (SRE I) ensures our production services remain available, scalable, and efficient on Google Cloud Platform. You will bridge the gap between development and operations by applying software engineering mindsets to containerized infrastructure challenges. This entry-level role focuses on GitOps application deployments via ArgoCD, alert triage, Grafana Labs observability, and leveraging AIOps platforms, backed by strong OS and networking knowledge.
Key Responsibilities:
- GitOps & Deployments: Deploy, roll back, and manage the lifecycle of containerized applications using ArgoCD pipeline workflows.
- Alert Triage & Resolution: Act as the first line of defense to investigate, troubleshoot, and resolve infrastructure, OS-level, application, and network alerts.
- Network Troubleshooting: Diagnose connectivity and latency issues across all network layers, isolating problems between cloud VPCs and Kubernetes overlays.
- OS Troubleshooting: Diagnose deep Operating System bottlenecks including CPU throttling, memory leaks, and storage constraints on GKE worker nodes.
- AIOps Utilization: Use AI-driven operations tools to interpret correlated events, anomaly detections, and automated root-cause insights.
- Incident Response: Participate in on-call rotations, using Grafana dashboards and AIOps suggestions to quickly mitigate production container issues.
Required Technical Skills:
- CI/CD & GitOps: Practical experience deploying and managing applications using ArgoCD and Git version control systems.
- Networking: Comprehensive knowledge of all layers of networking (OSI model), with practical troubleshooting skills in Layer 7 (Application): HTTP/S, DNS, gRPC, and SSL/TLS handshakes; Layer 4 (Transport): TCP/UDP mechanics, three-way handshakes, and port allocation; Layer 3 (Network): IP routing, CIDR blocks, Subnetting, and ICMP; Kubernetes Networking: Understanding of Pod-to-Pod communication, Services, Ingress controllers, and CNI plugins.
- Operating Systems: Strong, deep foundational knowledge of Linux (Ubuntu, Debian, or Container-Optimized OS) internals, process management, file systems, and kernel parameters.
- System & Network Utilities: Proficiency with command-line diagnostic tools (e.g., tcpdump, curl, dig, traceroute, top, iostat).
- AIOps Concepts: Familiarity with AI-driven operations, event correlation, anomaly detection, and automated noise reduction.
- Observability: Foundational experience with the Grafana Labs ecosystem (Grafana, Mimir/Prometheus, Loki, Tempo).
- Orchestration: Functional knowledge of deploying, scaling, and managing workloads in GKE.
- ML/AIOps (nice to have ) : Anomaly detection concepts , log-based ML models
- Demonstrated experience writing automation scripts in Python, Bash, or Go
Soft Skills & Qualifications:
- Problem Solving: Ability to systematically troubleshoot complex microservice and OS-level issues under pressure.
- Urgency & Prioritization: Strong sense of ownership and ability to prioritize alerts based on business impact.
- Analytical Mindset: Comfort working alongside data-driven, automated recommendations to solve infrastructure problems.
- Communication: Clear written and verbal communication during high-stress incident responses.
Location:
- India – (Flexible hybrid work model - work from Bangalore office for 20 days in a quarter)
This job is found at InterviewStack.io
Skills
About Calix, Inc.
Calix is an AI platform company that enables service providers to transform their operations and deliver differentiated subscriber experiences. The company provides cloud and software platforms, systems, and services for communications service providers (CSPs) globally.