InterviewStack.io LogoInterviewStack.io
Browse more Site Reliability Engineer jobs

Technical Operations Engineer

Medlytix LLC

Roswell, GA, USA, 30076Remote3 days ago
13 views5 saves1 applies

Prepare for this role


Benefits

Remote Work

Job Type

full time

Description

The Technical Operations Engineer is responsible for supporting the performance, reliability, and visibility of the Medlytix Production System. This role serves as a hybrid technical function combining production monitoring, telemetry analysis, workflow orchestration support, and automation engineering.

Working under the direction of the Director, this individual plays a critical role in maintaining operational health across distributed systems, data pipelines, and workflow orchestration environments. The position requires strong hands-on expertise with monitoring tools, telemetry platforms, cloud technologies, and data processing systems.

The Technical Operations Engineer evaluates system behavior, investigates production issues, supports and maintains monitoring of production systems, and drives automation and reliability improvements. The ideal candidate is highly proficient in relevant technical tools and platforms, with the ability to effectively monitor, analyze, and improve production systems.

This role also requires strong critical thinking and problem-solving skills to evaluate complex system behaviors, identify root causes, and implement effective solutions across interconnected systems.

Responsibilities:

  • Monitor systems, workflows, and data pipelines to ensure optimal performance, high data quality, and system reliability
  • Build and maintain monitoring dashboards, alerts, and observability frameworks using telemetry tools
  • Analyze workflow performance metrics (latency, failures) and identify trends or anomalies
  • Support workflow orchestration platforms (e.g., Airflow) to ensure successful job execution and dependency management
  • Troubleshoot workflow failures, data pipeline issues, and system disruptions across distributed environments
  • Perform root cause analysis using logs, telemetry data, and execution history, and provide actionable recommendations
  • Manage and respond to production incidents, including triage, escalation, and coordination with cross-functional teams
  • Ensure data quality and integrity by implementing validation checks and identifying anomalies early
  • Develop automation scripts and tools to reduce manual operational effort and improve efficiency
  • Identify opportunities to improve system reliability, fault tolerance, and operational scalability
  • Collaborate with Engineering, Product, and Data teams to resolve issues and enhance system performance
  • Communicate technical findings clearly and contribute to operational reporting and dashboards

Requirements:

  • Bachelor's degree in Computer Science, Information Systems, Engineering, Data Science, or related field, with 3+ years of experience in technical operations, data engineering, or business intelligence
  • Strong proficiency in SQL and experience with Python or scripting for troubleshooting, analysis, and automation
  • Hands-on experience with workflow orchestration tools (e.g., Airflow) and data pipelines
  • Familiarity with cloud platforms (AWS preferred) and monitoring/observability tools (e.g., Datadog, CloudWatch)
  • Proven ability to perform root cause analysis and troubleshoot complex issues across distributed systems
  • Strong critical thinking and problem-solving skills with the ability to quickly learn and apply new tools and technologies
  • Effective communication skills with the ability to translate technical findings into actionable insights
  • Exposure to ML/AI concepts, tools, or operational use cases is a plus

This job is found at InterviewStack.io

Skills

monitoringautomationdistributed systemsdata pipelinesdashboardsobservabilityairflowscalabilitysqlpythonawsdatadogcloudwatchroot cause analysisdata sciencedata qualitybusiness intelligence