InterviewStack.io LogoInterviewStack.io
Browse more Site Reliability Engineer jobs

Lead Support Analyst (Observability/SRE)

Io Tech Solutions Limited

Hong Kong, Hong Kong SAR, Hong Kong1 week ago
58 views30 saves1 applies

Prepare for this role


Job Type

full time

Description

Lead Support Analyst (Observability / SRE)

We are seeking a senior Lead Support Analyst to join a Shared Services team responsible for monitoring, observability, and site reliability engineering (SRE) operations across critical systems.

This is a hands-on role focused on ensuring platform reliability, performance, and availability while also providing guidance to junior team members.

Key Responsibilities

  • Support monitoring, observability, and SRE operations for critical production systems
  • Build and maintain dashboards, alerts, and monitoring solutions using Grafana, Prometheus, Elasticsearch, and related tools
  • Troubleshoot Linux environments and investigate system performance issues
  • Collaborate with engineering, infrastructure, and application teams to improve system stability and resilience
  • Participate in incident management, on-call support, and continuous improvement initiatives
  • Mentor and support junior team members

Requirements

  • 8+ years of experience in SRE, Production Support, Monitoring Engineering, or related areas
  • Strong hands-on experience with Grafana and observability/monitoring platforms
  • Experience with Prometheus, Elasticsearch/Kibana, or similar technologies
  • Solid Linux administration and troubleshooting skills
  • Proficiency in Python OR Golang, and Shell scripting.
  • Strong understanding of system reliability, incident management, and monitoring best practices
  • Excellent communication skills in English

Preferred

  • Experience in banking, financial services, or large enterprise environments
  • Exposure to ITRS Geneos, Victoria Metrics, Ansible, or CI/CD tools
  • Experience mentoring or guiding junior engineers

This job is found at InterviewStack.io

Skills

goobservabilitymonitoringdashboardsgrafanaprometheuselasticsearchlinuxkibanapythonansibleci/cdincident managementsite reliability engineering