Site Reliability Engineer (SRE)

Overview

We're looking for a passionate and hands-on Site Reliability Engineer (SRE) to join our team. This role is critical for ensuring the stability, performance, and scalability of our production services. You'll be the bridge between development and operations, with a strong focus on using code to manage infrastructure and eliminate toil.

Key Responsibilities

Monitoring and Alerting: Design, implement, and maintain robust monitoring and alerting systems (e.g., GCP Monitoring, Prometheus, Grafana, Traces, Logs) to provide visibility into application performance and infrastructure health.
Infrastructure Management: Build, provision, and maintain our core infrastructure, with a strong emphasis on Cloud environments and Kubernetes clusters.
Automation and Tooling: Write and maintain scripts and automation workflows (e.g., Python, Bash, TypeScript (Pulumi)) to streamline deployment, scaling, and operational tasks, embracing the philosophy of "automating everything."
Incident Response: Provide hands-on, real-time incident response and participate in an on-call rotation to quickly mitigate service disruptions and restore functionality.
Production Debugging: Deeply debug and troubleshoot complex production problems across the entire stack, from network issues to application code defects.
Process Improvement: Conduct blameless post-mortems for major incidents, implementing long-term solutions to prevent recurrence and continuously improve service reliability.

Qualifications

Proven experience as an SRE, DevOps Engineer, or similar role.
Expertise in managing and scaling Kubernetes in a production environment.
Strong proficiency in a scripting or programming language (e.g., Python, Go, Bash).
Deep understanding of monitoring, logging, and alerting best practices.
Solid experience with at least one major Cloud provider (AWS, GCP, or Azure).
Experience with Infrastructure as Code (IaC) tools like Terraform or Pulumi is a plus.

What You'll Bring

A proactive, data-driven approach to reliability and a passion for managing complex systems at scale.

Compensation

The base pay range for this role is $50,000 – $60,000 per year.

Graphite - Site Reliability Engineer (SRE)

Prepare for this role

Job Type

Description