✅

Testing, Quality & Reliability Topics

Quality assurance, testing methodologies, test automation, and reliability engineering. Includes QA frameworks, accessibility testing, quality metrics, and incident response from a reliability/engineering perspective. Covers testing strategies, risk-based testing, test case development, UAT, and quality transformations. Excludes operational incident management at scale (see 'Enterprise Operations & Incident Management').

Reliability, Observability, and Trade offs

Focuses on designing for failure, identifying and mitigating single points of failure, defining monitoring and alerting strategies, and owning incident response and post mortem practices. Also covers observability and the metrics that enable operational visibility, and design trade offs such as consistency versus availability and simplicity versus robustness. Interviewers will probe reasoning about operational practices and trade off decision making.

0 questions

Monitoring, Logging, and Operational Visibility

Understand that running systems need constant visibility. Know basic monitoring concepts: metrics (numerical measurements like CPU, memory, request count), logs (detailed event records), and alerts (notifications when issues occur). Know the monitoring tools: CloudWatch (AWS), Azure Monitor (Azure), Cloud Operations/Stackdriver (GCP). Understand what should be monitored: application health (uptime, error rates), infrastructure health (CPU, memory, disk), and security events (access logs, permission denials). Know that proper monitoring enables quick issue detection and troubleshooting. Be familiar with dashboard creation (visualizing metrics) and alert configuration (notifying on problems). Understand log aggregation—collecting logs from multiple sources for centralized analysis.

0 questions

Operational Mindset and Reliability

Evaluates a candidate's operational ownership of production systems and their approach to designing and operating for reliability. Topics include incident response and on call practices, creating and using runbooks and playbooks, blameless postmortems and root cause analysis, monitoring and observability strategies including metrics, logging, and distributed tracing, alerting and escalation policies, service level objectives and service level agreements and error budgets, capacity planning and load testing, fault tolerance and graceful degradation patterns such as redundancy, replication, failover, retries, and backpressure, automation to reduce operational toil including runbook automation and infrastructure as code, and continuous improvement driven by postmortem action items and testing. Candidates should be prepared to describe concrete examples of incident handling and improving service reliability, how they balance reliability against cost and time to market, and how they collaborate with site reliability engineering, operations, platform, and product teams to set and meet reliability targets.

0 questions