Site Reliability Engineer (SRE)
Cbtalents
Penang, Malaysia1 month ago
73 views29 saves8 applies
Prepare for this role
Job Type
full time
Description
Overview
We are seeking an experienced Site Reliability Engineer (SRE) to join a dynamic technology team supporting large-scale infrastructure and AML systems. This role combines software engineering, systems engineering, automation, and operational excellence to ensure high availability, scalability, and reliability across critical platforms.
The ideal candidate is passionate about infrastructure automation, system performance, cloud-native technologies, and operational reliability in fast-paced environments.
Key Responsibilities
- Design, build, and maintain highly available, scalable, and fault-tolerant systems
- Collaborate closely with software engineering teams to improve system reliability and performance
- Develop and maintain automation tools and operational procedures to improve efficiency and reduce manual intervention
- Monitor infrastructure and application performance to proactively identify and resolve issues
- Implement and maintain monitoring, alerting, and observability solutions including SLIs, SLOs, and SLAs
- Participate in 24/7 on-call rotations, incident management, root-cause analysis, and blameless post-mortems
- Ensure infrastructure security, compliance, and operational best practices
- Support large-scale web traffic and machine learning data processing environments
Requirements
Technical Skills
- Proficiency in at least one programming language such as Python, Go, Java, or C++
- Strong scripting and automation skills
- Good understanding of Linux operating systems and network architecture
- Experience with Docker and Kubernetes
- Hands-on experience with monitoring tools such as Prometheus and Grafana
- Knowledge of relational databases and database modeling
Preferred Skills
- Exposure to machine learning frameworks such as TensorFlow, PyTorch, MXNet, or PaddlePaddle
- Strong analytical and problem-solving abilities
- Excellent communication and collaboration skills
- Ability to work effectively in a fast-paced and cross-functional environment
Qualifications
- Bachelor's or Master's Degree in Computer Science, Information Technology, Computer Engineering, or related field
- Minimum 3 years of experience in Site Reliability Engineering, Systems Engineering, or Software Engineering
Why Join Us
- Opportunity to work on large-scale distributed systems and modern infrastructure technologies
- Exposure to cloud-native environments and advanced automation practices
- Collaborative and technology-driven working environment
- Career growth and continuous learning opportunities
- Competitive salary and benefits package
This job is found at InterviewStack.io
Skills
automationscalabilitymonitoringobservabilitymachine learningpythonjavac++linuxdockerkubernetesprometheusgrafanatensorflowpytorchdistributed systemsincident managementroot cause analysisrelational databasessystems engineeringsite reliability engineeringhigh availability
About Cbtalents
Your international recruitment partner for hard to find candidates and jobs all over the globe.
recruitment, staffingWebsite