Cloud Migration Strategy and Planning Questions

Comprehensive planning and execution for migrating applications, data, and infrastructure from on premise environments to cloud platforms. Candidates should be able to assess existing application architecture, infrastructure, data flows, dependencies, performance and operational practices; prioritize workloads based on technical characteristics and business value; and select appropriate migration approaches such as rehost or lift and shift, replatform, refactor or rearchitect for cloud native, repurchase or move to software as a service, retire, or retain. Evaluation should include trade offs for each approach with respect to total cost of ownership, time to migrate, implementation effort, operational complexity, and long term optimization. Candidates should also plan phased migration execution including discovery and dependency mapping, migration waves, cutover and rollback strategies, and data migration and synchronization techniques. Interviewers may probe planning for domain name system updates, testing and validation, monitoring and operationalization after migration, security and compliance controls, and hybrid or coexistence patterns during transition. Candidates should be familiar with assessment tools and migration services, methods to estimate effort and risk, strategies for automation and continuous integration and continuous delivery pipelines, and training and organizational change management needed for a successful migration.

MediumSystem Design

21 practiced

You must migrate a 3-tier web application (web tier behind a load balancer, application servers with sticky sessions, and a MySQL database; file uploads stored on a network share) that currently handles ~1,000 RPS and 500GB of DB data. As the Cloud Architect, provide a migration plan: choose the migration approaches per tier (rehost/replatform/refactor), outline sequence of steps, session/state management strategy, database migration technique, file storage strategy, test and rollback plan, and an estimate of minimal downtime or zero-downtime approach.

MediumTechnical

27 practiced

A database migration to cloud failed during cutover and you need to perform a rollback while minimizing data loss and user impact. Describe a rollback strategy that accounts for: delta between cutover attempt and rollback, ongoing writes during rollback, necessary restoration steps, communications to users, and tests to validate that rollback succeeded and systems are consistent.

Sample Answer

**Situation & goal**I would perform a controlled rollback to the source environment minimizing data loss and user impact, using CDC (change-data-capture) and clear communications.

**1) Assess delta since cutover**- Immediately capture timeline: cutover start, failed step, and current time.- Enable/confirm CDC logs on both source and attempted target to identify committed transactions on source after cutover start and any writes that reached target.- Compute delta: set of source transactions not applied to target and target-only writes (if any).

**2) Freeze or quiesce writes (if possible)**- Short maintenance window: put application into read-only mode or queue writes upstream (API gateway, feature flag) to prevent further divergence.- If full freeze not possible, route writes to an append-only queue for replay.

**3) Ongoing writes handling**- Continue CDC capturing all new source writes into a staging replay stream.- If target accepted any writes, export them separately to reconcile or reapply to source after validation.

**4) Restoration steps**- Restore source database to consistent pre-cutover snapshot if it was modified; alternatively, keep source live and apply missing deltas back from CDC into source if target became canonical erroneously.- Use transactional replays with idempotency checks and ordering guarantees; validate constraints in a staging environment first.- Run integrity checks and check FK/unique constraints during apply.

**5) Communication plan**- Immediate internal alert to stakeholders and SRE/DB teams with expected maintenance window.- Notify users: short, clear status (read-only mode or degraded service), ETA for full service, and follow-ups.- Post-rollback report with root cause and remediation plan.

**6) Validation tests**- Automated checks: row counts, checksums (per-table hashes), high-value record spot checks, referential integrity, and application smoke tests (login, read, critical write path in dry-run).- Compare pre-cutover snapshot hashes vs post-rollback.- Run end-to-end functional tests and monitor metrics (error rates, latency) for an hour after reopening writes.

**7) Post-action**- Preserve logs and CDC streams for forensics.- Run a post-mortem, harden cutover runbook (canary, traffic split, shorter windows), and consider blue/green or hot-standby for next attempt.

This approach balances speed (minimize outage) with correctness (CDC-driven delta replay, idempotent restores) and clear user communication.

MediumTechnical

23 practiced

Detail data synchronization techniques to achieve near-zero downtime for a relational database migration: discuss logical replication, Change Data Capture (CDC), dual-write patterns, out-of-band reconciliation, cutover validation, and how to handle schema changes or incompatible features during synchronization.

MediumSystem Design

23 practiced

Design a CI/CD pipeline for migration automation and infrastructure-as-code that supports: automated provisioning of landing zones, environment promotion (dev->stage->prod), integration tests for migrated services, controlled rollouts (canary/blue-green), and automated rollback on failure. Describe pipeline stages, tests, required artifacts, gating mechanisms, and how to keep infrastructure state and secrets secure.

Sample Answer

**Clarify scope & goals**- Automated landing-zone provisioning, promotion dev→stage→prod, integration tests for migrated services, controlled rollouts (canary/blue-green), automated rollback, secure state & secrets.

**High-level pipeline stages**1. Commit & Validate - Lint Terraform, static checks (tflint, checkov), unit tests for IaC modules. - Artifacts: versioned Terraform modules, container images, migration scripts.2. Build & Package - Build images, run container SCAN, store artifacts in registry/artifact repo with immutable tags.3. Deploy Landing Zone (one-off / infra repo) - Apply Terraform to create landing zone resources in target account/tenant. - Gating: manual approval for prod landing zone creation.4. Deploy Environment (Dev) - Terraform apply using workspace per env; run smoke tests.5. Integration Tests - Run automated integration suites (API contract, end-to-end migrations) against dev. Artifacts: test reports, migration provenance. - Gating: tests must pass.6. Promote → Stage - Automated promotion if tests pass; run more extensive performance / chaos tests. - Gating: automated + optional stakeholder approval.7. Canary / Blue-Green Production Release - Deploy canary subset (k% traffic) or spin green environment and shift traffic incrementally. - Observability & SLO-based gates (error rate, latency, business metrics).8. Automated Rollback - If metrics breach thresholds, automated rollback to previous image/version or traffic shift back to blue. - Preserve migration rollback scripts and runbook automation.

**Tests & gating**- Unit (IaC), smoke, integration, contract, performance, chaos.- Gates: pipeline enforces pass/fail, SLO-based automated gating, RBAC manual approvals for prod changes.

**Artifacts**- Immutable Terraform module packages, cloud-init/migration scripts, container images, test reports, provenance metadata stored in artifact store.

**State & secrets security**- Remote Terraform state in encrypted backend (e.g., S3 with KMS + DynamoDB locking) or Terraform Cloud.- State access controlled via IAM roles and least privilege; enable state locking.- Secrets in dedicated secret manager (AWS Secrets Manager/HashiCorp Vault/Azure Key Vault) with auto-rotation, short-lived credentials via OIDC and GitHub Actions/GitLab runners.- Pipeline runners in private subnets, use ephemeral agents and assume-role patterns for least privilege.- Audit logs (CloudTrail) and signed manifests for change traceability.

**Observability & recovery**- Centralized metrics/logs, automated alerting, runbooks, and post-mortems. Version everything; allow fast reprovision of prior infra via tagged commits.

MediumTechnical

26 practiced

Estimate effort and risk for refactoring a 10,000-line monolithic Java application into microservices using the strangler pattern. Describe the decomposition strategy, how you'd identify service boundaries, approach to data ownership and transactions, an effort estimation methodology (person-weeks, POCs), and how you'd mitigate operational risk during incremental refactor.

Sample Answer

**Situation & high-level approach**I’d use the Strangler Pattern to incrementally extract functionality, keep the monolith running, and migrate traffic to new microservices behind an API gateway. Prioritize business-critical, loosely-coupled domains first.

**Decomposition & identifying service boundaries**- Domain-driven design: map bounded contexts from domain/events, use UML/use-case walkthroughs with SMEs.- Static analysis + runtime tracing: call graph, module dependency matrix, and high-traffic REST/RPC paths.- Criteria: high cohesion, low coupling, independent deployability, separate scaling needs, clear ownership.

**Data ownership & transactions**- Move to service-owned data stores gradually. Start with read replicas or materialized views for consumers.- Use Sagas for distributed business transactions (or compensating actions) and eventual consistency for non-critical paths.- For strong consistency needs, use façade in monolith until entire transaction boundary is migrated.

**Effort estimation methodology**- Discovery: 3–4 weeks (architecture, telemetry, DDD workshops).- PoCs: 2–3 small PoCs (API gateway + one service, DB migration pattern, Saga orchestrator) — each 2–3 person-weeks.- Per service: average 3–8 person-weeks depending on complexity; for ~8–12 candidate services estimate 6–60 person-weeks.- Add cross-cutting (CI/CD, monitoring, infra as code) ~8–12 person-weeks; planning & buffer 25%.- Express estimates as person-weeks and milestones (Discovery, PoC, Pilot, Iterative rollout).

**Mitigating operational risk**- Dark-launch/traffic-splitting and canary deployments behind gateway.- Feature flags, circuit breakers, centralized observability (distributed tracing, metrics, SLOs).- Automated rollback, runbooks, and staged cutover per domain.- Strong governance: API contract tests, backward compatibility, and a migration playbook.

This plan balances incremental risk reduction with measurable proofs (POCs) and cloud-native operational controls appropriate for enterprise-scale migration.

Unlock Full Question Bank

Get access to hundreds of Cloud Migration Strategy and Planning interview questions and detailed answers.

Join thousands of developers preparing for their dream job.