InterviewStack.io LogoInterviewStack.io
Browse more Site Reliability Engineer jobs

Graphite - Site Reliability Engineer (SRE)

RCTS Global

Guadalajara, Mexico$50,000 - $60,0008 months ago
58 views26 saves0 applies

Prepare for this role


Job Type

full time

Description

Site Reliability Engineer (SRE)

Overview

We're looking for a passionate and hands-on Site Reliability Engineer (SRE) to join our team. This role is critical for ensuring the stability, performance, and scalability of our production services. You'll be the bridge between development and operations, with a strong focus on using code to manage infrastructure and eliminate toil.

Key Responsibilities

  • Monitoring and Alerting: Design, implement, and maintain robust monitoring and alerting systems (e.g., GCP Monitoring, Prometheus, Grafana, Traces, Logs) to provide visibility into application performance and infrastructure health.
  • Infrastructure Management: Build, provision, and maintain our core infrastructure, with a strong emphasis on Cloud environments and Kubernetes clusters.
  • Automation and Tooling: Write and maintain scripts and automation workflows (e.g., Python, Bash, TypeScript (Pulumi)) to streamline deployment, scaling, and operational tasks, embracing the philosophy of "automating everything."
  • Incident Response: Provide hands-on, real-time incident response and participate in an on-call rotation to quickly mitigate service disruptions and restore functionality.
  • Production Debugging: Deeply debug and troubleshoot complex production problems across the entire stack, from network issues to application code defects.
  • Process Improvement: Conduct blameless post-mortems for major incidents, implementing long-term solutions to prevent recurrence and continuously improve service reliability.

Qualifications

  • Proven experience as an SRE, DevOps Engineer, or similar role.
  • Expertise in managing and scaling Kubernetes in a production environment.
  • Strong proficiency in a scripting or programming language (e.g., Python, Go, Bash).
  • Deep understanding of monitoring, logging, and alerting best practices.
  • Solid experience with at least one major Cloud provider (AWS, GCP, or Azure).
  • Experience with Infrastructure as Code (IaC) tools like Terraform or Pulumi is a plus.

What You'll Bring

A proactive, data-driven approach to reliability and a passion for managing complex systems at scale.

Compensation

The base pay range for this role is $50,000 – $60,000 per year.

This job is found at InterviewStack.io

Skills

scalabilitymonitoringgcpprometheusgrafanakubernetesautomationpythonbashtypescriptpulumidebuggingawsazureinfrastructure as codeterraformprocess improvementinfrastructure managementincident response