InterviewStack.io LogoInterviewStack.io
đź”—

Data Engineering & Analytics Infrastructure Topics

Data pipeline design, ETL/ELT processes, streaming architectures, data warehousing infrastructure, analytics platform design, and real-time data processing. Covers event-driven systems, batch and streaming trade-offs, data quality and governance at scale, schema design for analytics, and infrastructure for big data processing. Distinct from Data Science & Analytics (which focuses on statistical analysis and insights) and from Cloud & Infrastructure (platform-focused rather than data-flow focused).

Hadoop Ecosystem & Related Tools

Overview of the Hadoop ecosystem components (e.g., HDFS, MapReduce, YARN) and related tools (Hive, Pig, HBase, Sqoop, Flume, Oozie, Hue, etc.). Covers batch and streaming data processing, data ingestion and ETL pipelines, data warehousing in Hadoop, and operational considerations for deploying and managing Hadoop-based data pipelines in modern data architectures.

50 questions

Data Quality and Edge Case Handling

Practical skills and best practices for recognizing, preventing, and resolving real world data quality problems and edge cases in queries, analyses, and production data pipelines. Core areas include handling missing and null values, empty and single row result sets, duplicate records and deduplication strategies, outliers and distributional assumptions, data type mismatches and inconsistent formatting, canonicalization and normalization of identifiers and addresses, time zone and daylight saving time handling, null propagation in joins, and guarding against division by zero and other runtime anomalies. It also covers merging partial or inconsistent records from multiple sources, attribution and aggregation edge cases, group by and window function corner cases, performance and correctness trade offs at scale, designing robust queries and pipeline validations, implementing sanity checks and test datasets, and documenting data limitations and assumptions. At senior levels this expands to proactively designing automated data quality checks, monitoring and alerting for anomalies, defining remediation workflows, communicating trade offs to stakeholders, and balancing engineering effort against business risk.

55 questions

Data Cleaning and Business Logic Edge Cases

Covers handling data centric edge cases and complex business rule interactions in queries and data pipelines. Topics include cleaning and normalizing data, handling nulls and type mismatches, deduplication strategies, treating inconsistent or malformed records, validating results and detecting anomalies, using conditional logic for data transformation, understanding null semantics in SQL, and designing queries that correctly implement date boundaries and domain specific business rules. Emphasis is on producing robust results in the presence of imperfect data and complex requirements.

0 questions

Real Time and Batch Ingestion

Focuses on choosing between batch ingestion and real time streaming for moving data from sources to storage and downstream systems. Topics include latency and throughput requirements, cost and operational complexity, consistency and delivery semantics such as at least once and exactly once, idempotent and deduplication strategies, schema evolution, connector and source considerations, backpressure and buffering, checkpointing and state management, and tooling choices for streaming and batch. Candidates should be able to design hybrid architectures that combine streaming for low latency needs with batch pipelines for large backfills or heavy aggregations and explain operational trade offs such as monitoring, scaling, failure recovery, and debugging.

46 questions

Data Quality & Troubleshooting Missing/Incorrect Data

Understand how to identify and troubleshoot data quality issues. Common issues: (1) Duplicate records—same person appears multiple times in database, (2) Missing data—required fields are blank, (3) Incorrect data—email addresses formatted inconsistently, (4) Out-of-sync data—CRM and analytics show different numbers, (5) Tracking failures—events not being recorded. When investigating data quality issues: (1) What specifically is wrong? (2) How much data is affected? (3) When did it start? (4) What changed around that time? (5) What's the impact? (6) How do we fix it going forward? Example: 'Our lead count from website forms dropped 30% overnight. I checked: Was form code broken? (no) Were people still submitting? (yes) Were submissions being captured? (no—tracked in analytics but not reaching CRM) Root cause: API integration failed. We manually synced overnight data and fixed the API.' For junior level, show you think systematically about investigating issues and involve technical teams when needed.

0 questions

Big Data Technologies Stack

Overview of big data tooling and platforms used for data ingestion, processing, and analytics at scale. Includes frameworks and platforms such as Apache Spark, Hadoop ecosystem components (HDFS, MapReduce, YARN), data lake architectures, streaming and batch processing, and cloud-based data platforms. Covers data processing paradigms, distributed storage and compute, data quality, and best practices for building robust data pipelines and analytics infrastructure.

50 questions

Apache Spark Architecture

Covers core Apache Spark architecture and programming model, including the roles of the driver and executors, cluster manager options, resource allocation, executor memory and cores, partitions, tasks, stages, and the directed acyclic graph used for job execution. Explains lazy evaluation and the distinction between transformations and actions, fault tolerance mechanisms, caching and persistence strategies, partitioning and shuffle behavior, broadcast variables and accumulators, and techniques for performance tuning and handling data skew. Compares Resilient Distributed Datasets, DataFrames, and Datasets, describing when to use each API, the benefits of the DataFrame and Spark SQL APIs driven by the Catalyst optimizer and Tungsten execution engine, and considerations for user defined functions, serialization, checkpointing, and common data sources and formats.

40 questions

AWS Data Services

Specialized knowledge of Amazon Web Services targeted at data storage, processing, analytics, and streaming. This covers object storage and data lake design with Simple Storage Service including storage classes, lifecycle and partitioning strategies; analytics and warehousing with Redshift including columnar storage, distribution styles, compression, query optimization and concurrency considerations; big data processing with Elastic MapReduce for managed Spark and Hadoop clusters and associated tuning; serverless extract transform and load using Glue and data catalog concepts, schema management and job orchestration; and real time data ingestion and processing with Kinesis including producers, shards, retention, consumers, and stream processing patterns. Candidates should understand when to choose batch versus streaming architectures, how to integrate services into end to end data pipelines, trade offs around scalability, latency, consistency, security, data governance and cost optimization, and monitoring and debugging techniques for data workloads.

35 questions

Data Pipeline Scalability and Performance

Design data pipelines that meet throughput and latency targets at large scale. Topics include capacity planning, partitioning and sharding strategies, parallelism and concurrency, batching and windowing trade offs, network and I O bottlenecks, replication and load balancing, resource isolation, autoscaling patterns, and techniques for maintaining performance as data volume grows by orders of magnitude. Include approaches for benchmarking, backpressure management, cost versus performance trade offs, and strategies to avoid hot spots.

40 questions
Page 1/8