Comprehensive planning and execution for migrating applications, data, and infrastructure from on premise environments to cloud platforms. Candidates should be able to assess existing application architecture, infrastructure, data flows, dependencies, performance and operational practices; prioritize workloads based on technical characteristics and business value; and select appropriate migration approaches such as rehost or lift and shift, replatform, refactor or rearchitect for cloud native, repurchase or move to software as a service, retire, or retain. Evaluation should include trade offs for each approach with respect to total cost of ownership, time to migrate, implementation effort, operational complexity, and long term optimization.
Candidates should also plan phased migration execution including discovery and dependency mapping, migration waves, cutover and rollback strategies, and data migration and synchronization techniques. Interviewers may probe planning for domain name system updates, testing and validation, monitoring and operationalization after migration, security and compliance controls, and hybrid or coexistence patterns during transition. Candidates should be familiar with assessment tools and migration services, methods to estimate effort and risk, strategies for automation and continuous integration and continuous delivery pipelines, and training and organizational change management needed for a successful migration.
MediumSystem Design
21 practiced
You must migrate a 3-tier web application (web tier behind a load balancer, application servers with sticky sessions, and a MySQL database; file uploads stored on a network share) that currently handles ~1,000 RPS and 500GB of DB data. As the Cloud Architect, provide a migration plan: choose the migration approaches per tier (rehost/replatform/refactor), outline sequence of steps, session/state management strategy, database migration technique, file storage strategy, test and rollback plan, and an estimate of minimal downtime or zero-downtime approach.
Sample Answer
**Migration approach (per tier)**- Web/load balancer: Rehost to cloud LB (ELB/ALB) + autoscaling — minimal change.- App servers: Replatform — remove sticky sessions, containerize or use managed ASGs / ECS/EKS.- DB: Replatform/refactor — migrate MySQL to managed RDS/Aurora with minimal schema changes.- File share: Refactor — move to object storage (S3) and serve via CDN.**Sequence of steps**1. Prepare cloud infra: VPC, subnets, security, IAM, LB, ASG/ECS cluster, RDS subnet group.2. Implement centralized session/state (below) and deploy modified app to staging.3. Set up S3 buckets + CDN + adapt app to read/write both file share and S3.4. Provision RDS (read replica) and run initial logical/full snapshot seeding.5. Test end-to-end in staging; performance tune autoscale.6. Cutover: sync DB binlogs to replica, switch read replica to master, update DNS/LB, finalize file sync.7. Decommission on success.**Session/state strategy**- Replace sticky sessions with stateless app + Redis (ElastiCache) for session store or JWT for small state.- Store long-lived/session files in S3.**DB migration technique**- Use logical replication: create RDS read-replica from on-prem using AWS DMS (CDC) to replicate changes; validate schema; promote replica to master at cutover.**File strategy**- Initial bulk rsync to S3 (aws s3 sync), enable dual-write during cutover window, then finalize and switch reads to S3 + CloudFront.**Testing & rollback**- Staged functional, load, failover, and DR tests. Backout plan: keep on-prem app and DB writable until final cutover; DNS TTLs low; ability to re-point LB/DNS to original environment.**Downtime estimate**- Zero-downtime achievable with DMS CDC + dual-write and low TTL DNS; minimal planned maintenance (seconds–minutes) to promote replica and flip LB. Trade-offs: complexity vs downtime.
MediumTechnical
27 practiced
A database migration to cloud failed during cutover and you need to perform a rollback while minimizing data loss and user impact. Describe a rollback strategy that accounts for: delta between cutover attempt and rollback, ongoing writes during rollback, necessary restoration steps, communications to users, and tests to validate that rollback succeeded and systems are consistent.
Sample Answer
**Situation & goal**I would perform a controlled rollback to the source environment minimizing data loss and user impact, using CDC (change-data-capture) and clear communications.**1) Assess delta since cutover**- Immediately capture timeline: cutover start, failed step, and current time.- Enable/confirm CDC logs on both source and attempted target to identify committed transactions on source after cutover start and any writes that reached target.- Compute delta: set of source transactions not applied to target and target-only writes (if any).**2) Freeze or quiesce writes (if possible)**- Short maintenance window: put application into read-only mode or queue writes upstream (API gateway, feature flag) to prevent further divergence.- If full freeze not possible, route writes to an append-only queue for replay.**3) Ongoing writes handling**- Continue CDC capturing all new source writes into a staging replay stream.- If target accepted any writes, export them separately to reconcile or reapply to source after validation.**4) Restoration steps**- Restore source database to consistent pre-cutover snapshot if it was modified; alternatively, keep source live and apply missing deltas back from CDC into source if target became canonical erroneously.- Use transactional replays with idempotency checks and ordering guarantees; validate constraints in a staging environment first.- Run integrity checks and check FK/unique constraints during apply.**5) Communication plan**- Immediate internal alert to stakeholders and SRE/DB teams with expected maintenance window.- Notify users: short, clear status (read-only mode or degraded service), ETA for full service, and follow-ups.- Post-rollback report with root cause and remediation plan.**6) Validation tests**- Automated checks: row counts, checksums (per-table hashes), high-value record spot checks, referential integrity, and application smoke tests (login, read, critical write path in dry-run).- Compare pre-cutover snapshot hashes vs post-rollback.- Run end-to-end functional tests and monitor metrics (error rates, latency) for an hour after reopening writes.**7) Post-action**- Preserve logs and CDC streams for forensics.- Run a post-mortem, harden cutover runbook (canary, traffic split, shorter windows), and consider blue/green or hot-standby for next attempt.This approach balances speed (minimize outage) with correctness (CDC-driven delta replay, idempotent restores) and clear user communication.
MediumTechnical
23 practiced
Detail data synchronization techniques to achieve near-zero downtime for a relational database migration: discuss logical replication, Change Data Capture (CDC), dual-write patterns, out-of-band reconciliation, cutover validation, and how to handle schema changes or incompatible features during synchronization.
Sample Answer
**Approach summary**As a cloud architect I design migrations that use continuous change capture and staged cutover to get near‑zero downtime while preserving consistency and allowing safe rollback.**Key techniques**- Logical replication / CDC - Use DB-native logical replication (Postgres logical, MySQL binlog) or CDC tools (Debezium, GoldenGate) to stream DML into target in near real‑time. - Example: Debezium → Kafka → cloud data service with exactly‑once consumer semantics.- Dual‑write pattern - Temporarily write to both source and target application paths. Prefer an idempotent API layer or service mesh sidecar to avoid divergence. - Limit duration; use for small windows when transactional guarantees cross systems.- Out‑of‑band reconciliation - Periodic checksums (row counts, hash of partition ranges) and targeted replays for gaps. - Use parallelized tools to compare partitions and rectify drift.- Cutover validation - Shadow reads: route a percentage of reads to target and compare responses. - Canary promotion: promote target after passing synthetic transaction suites and consistency checks; rollback path ready.**Schema changes & incompatible features**- Backwards/forwards compatible migrations: add columns nullable, avoid renaming; use feature flags.- Two‑phase schema migration: deploy compatible schema on both sides, migrate data, then remove legacy fields.- For incompatible features (e.g., proprietary functions), provide translation layer or run polyglot compatibility service during transition.**Operational considerations**- Monitor lag, throughput, error rates; implement dead‑letter queues and automated replay.- Plan for network, throttling, and transactional ordering; ensure idempotency and use global unique IDs.- Document rollback, cutover checklist, and post‑cutover reconciliation windows.
MediumSystem Design
23 practiced
Design a CI/CD pipeline for migration automation and infrastructure-as-code that supports: automated provisioning of landing zones, environment promotion (dev->stage->prod), integration tests for migrated services, controlled rollouts (canary/blue-green), and automated rollback on failure. Describe pipeline stages, tests, required artifacts, gating mechanisms, and how to keep infrastructure state and secrets secure.
Sample Answer
**Clarify scope & goals**- Automated landing-zone provisioning, promotion dev→stage→prod, integration tests for migrated services, controlled rollouts (canary/blue-green), automated rollback, secure state & secrets.**High-level pipeline stages**1. Commit & Validate - Lint Terraform, static checks (tflint, checkov), unit tests for IaC modules. - Artifacts: versioned Terraform modules, container images, migration scripts.2. Build & Package - Build images, run container SCAN, store artifacts in registry/artifact repo with immutable tags.3. Deploy Landing Zone (one-off / infra repo) - Apply Terraform to create landing zone resources in target account/tenant. - Gating: manual approval for prod landing zone creation.4. Deploy Environment (Dev) - Terraform apply using workspace per env; run smoke tests.5. Integration Tests - Run automated integration suites (API contract, end-to-end migrations) against dev. Artifacts: test reports, migration provenance. - Gating: tests must pass.6. Promote → Stage - Automated promotion if tests pass; run more extensive performance / chaos tests. - Gating: automated + optional stakeholder approval.7. Canary / Blue-Green Production Release - Deploy canary subset (k% traffic) or spin green environment and shift traffic incrementally. - Observability & SLO-based gates (error rate, latency, business metrics).8. Automated Rollback - If metrics breach thresholds, automated rollback to previous image/version or traffic shift back to blue. - Preserve migration rollback scripts and runbook automation.**Tests & gating**- Unit (IaC), smoke, integration, contract, performance, chaos.- Gates: pipeline enforces pass/fail, SLO-based automated gating, RBAC manual approvals for prod changes.**Artifacts**- Immutable Terraform module packages, cloud-init/migration scripts, container images, test reports, provenance metadata stored in artifact store.**State & secrets security**- Remote Terraform state in encrypted backend (e.g., S3 with KMS + DynamoDB locking) or Terraform Cloud.- State access controlled via IAM roles and least privilege; enable state locking.- Secrets in dedicated secret manager (AWS Secrets Manager/HashiCorp Vault/Azure Key Vault) with auto-rotation, short-lived credentials via OIDC and GitHub Actions/GitLab runners.- Pipeline runners in private subnets, use ephemeral agents and assume-role patterns for least privilege.- Audit logs (CloudTrail) and signed manifests for change traceability.**Observability & recovery**- Centralized metrics/logs, automated alerting, runbooks, and post-mortems. Version everything; allow fast reprovision of prior infra via tagged commits.
MediumTechnical
26 practiced
Estimate effort and risk for refactoring a 10,000-line monolithic Java application into microservices using the strangler pattern. Describe the decomposition strategy, how you'd identify service boundaries, approach to data ownership and transactions, an effort estimation methodology (person-weeks, POCs), and how you'd mitigate operational risk during incremental refactor.
Sample Answer
**Situation & high-level approach**I’d use the Strangler Pattern to incrementally extract functionality, keep the monolith running, and migrate traffic to new microservices behind an API gateway. Prioritize business-critical, loosely-coupled domains first.**Decomposition & identifying service boundaries**- Domain-driven design: map bounded contexts from domain/events, use UML/use-case walkthroughs with SMEs.- Static analysis + runtime tracing: call graph, module dependency matrix, and high-traffic REST/RPC paths.- Criteria: high cohesion, low coupling, independent deployability, separate scaling needs, clear ownership.**Data ownership & transactions**- Move to service-owned data stores gradually. Start with read replicas or materialized views for consumers.- Use Sagas for distributed business transactions (or compensating actions) and eventual consistency for non-critical paths.- For strong consistency needs, use façade in monolith until entire transaction boundary is migrated.**Effort estimation methodology**- Discovery: 3–4 weeks (architecture, telemetry, DDD workshops).- PoCs: 2–3 small PoCs (API gateway + one service, DB migration pattern, Saga orchestrator) — each 2–3 person-weeks.- Per service: average 3–8 person-weeks depending on complexity; for ~8–12 candidate services estimate 6–60 person-weeks.- Add cross-cutting (CI/CD, monitoring, infra as code) ~8–12 person-weeks; planning & buffer 25%.- Express estimates as person-weeks and milestones (Discovery, PoC, Pilot, Iterative rollout).**Mitigating operational risk**- Dark-launch/traffic-splitting and canary deployments behind gateway.- Feature flags, circuit breakers, centralized observability (distributed tracing, metrics, SLOs).- Automated rollback, runbooks, and staged cutover per domain.- Strong governance: API contract tests, backward compatibility, and a migration playbook.This plan balances incremental risk reduction with measurable proofs (POCs) and cloud-native operational controls appropriate for enterprise-scale migration.
Unlock Full Question Bank
Get access to hundreds of Cloud Migration Strategy and Planning interview questions and detailed answers.