Site Reliability Engineer (AD Level)
Io Tech Solutions Limited
Hong Kong, Hong Kong SAR, Hong Kong1 week ago
8 views2 saves0 applies
Prepare for this role
Job Type
full time
Description
Position Overview
We are seeking an experienced Support Analyst responsible for the operational ownership of build and shared services, including monitoring, SRE (Site Reliability Engineering), and the stability and performance of critical systems.
Key Responsibilities
- Monitor and support SRE operations to ensure reliability, availability, and performance of production systems.
- Build, enhance, and maintain monitoring solutions using:
- ITRS Geneos
- Prometheus
- Victoria-Metrics
- Elasticsearch
- Grafana
- Design and maintain alerting rules, dashboards, and observability pipelines.
- Troubleshoot Linux servers (RHEL 7/8/9), including:
- upgrades, configuration changes, patching, and maintenance
- assessing monitoring needs for system changes
- Perform log analysis and fault finding to identify and resolve performance exceptions.
- Collaborate with engineering, application, and infrastructure teams to improve:
- resilience, stability, security, efficiency, and scalability
- Participate in on-call rotations, including off-hours and weekend support.
- Support Disaster Recovery (DR) and Business Continuity Planning (BCP) drills.
- Stay current with modern monitoring/SRE tools and practices, and continuously drive improvements.
Requirements
- Bachelor's degree in Computer Science / Engineering.
- 8–10 years of IT experience, preferably within an investment bank or similar environment.
- Strong hands-on experience with monitoring and observability platforms, including:
- ITRS Geneos
- Prometheus
- Victoria-Metrics
- Elasticsearch
- Grafana
- Kibana
- Hands-on experience building and operating Prometheus pipelines, including:
- exporters
- scraping configurations
- relabeling / metric routing
- integrations with long-term storage (e.g., Victoria-Metrics)
- Experience building and maintaining Logstash pipelines, including:
- ingestion, parsing, filtering, enrichment, and routing
- log delivery into Elasticsearch
- Ability to design, build, and maintain Grafana and Kibana dashboards for metrics, logs, and performance analytics across distributed systems.
- Strong understanding of:
- metrics, logs, alerting, dashboards, and observability pipelines
- Strong Linux administration skills (RHEL 7/8/9), including troubleshooting, upgrades, patching, configuration, and performance optimization.
- Good understanding of SRE principles, including:
- high availability, scalability
- incident management
- DR / BCP activities
- Automation experience is an advantage, e.g. Bash, Python, Ansible, and CI/CD tooling.
- Understanding of networking fundamentals, performance tuning, and troubleshooting distributed systems.
- Prior experience in Production Support / SRE / Monitoring Engineering / Shared Services Operations, including participation in on-call rotations (after-hours and weekends).
- Self-motivated, adaptable, able to prioritize, learn continuously, and manage multiple responsibilities.
- Fluent in English and Chinese.
This job is found at InterviewStack.io
Skills
monitoringprometheuselasticsearchgrafanadashboardsobservabilitylinuxscalabilitykibanalogstashanalyticsautomationbashpythonansibleci/cdincident managementperformance optimizationsite reliability engineeringdisaster recoveryhigh availabilitylog analysis