Data engineering interview questions · medium
Describe strategies for optimizing a slow-running query on a massive dataset.
Discuss strategies for handling schema evolution in data warehouses.
Duplicate characters in a string (e.g., '123a!' to '112233aa!!').
ER Modeling vs. Dimensional Modeling?
Explain CTE vs Temp Table. What are the differences and use cases?
Explain Data Modeling SCD Types (Type 1, 2, 3).
Explain Dynamic Partition Pruning error and how to fix it.
Explain Fact Table and Star Schema.
Explain Redshift Data Distribution (EVEN, KEY, ALL).
Explain Union vs Union All in SQL.
Implement a recursive query for hierarchy (employee-manager). Explain the termination guarantees, depth limits, and when a recursive CTE becomes a scalability bottleneck. What alternatives exist for graph-scale hierarchies in Spark or a data lake?
Compare Glue partition discovery with Hive MSCK/ADD PARTITION. Explain the operational and cost implications of crawler-based vs. partition-projection approaches. When does partition projection become necessary, and what are its limitations?
Explain how partitioning and bucketing in Hive/Spark optimize queries. What are the trade-offs in bucket count, partition cardinality, and small-file problem? When does over-partitioning or over-bucketing become counterproductive?
Explain how to implement cumulative sum in SQL.
Explain how you would implement partitioning and bucketing for data stored in S3 to improve query performance.
Explain how you would optimize Redshift query performance for a reporting system with large fact tables.
Explain how you would use repartition or coalesce effectively to optimize processing when analyzing data only for a specific region.
Explain indexing and its impact on database performance.
Explain normalization and its disadvantages.
Explain offset management, Sync vs. Async commits, partition assignment strategies and Consumer groups, and handling backpressure in Kafka streams.
Type or paste your answer to any of these questions and our AI Coach scores it, highlights gaps, and rewrites it at FAANG quality. Free to try.
SQL is the most tested topic in data engineering interviews. Most companies dedicate an entire round to SQL, typically asking 3-5 questions covering window functions, CTEs, joins, optimization, and platform-specific features.
Focus on: window functions (RANK, ROW_NUMBER, LAG/LEAD), CTEs and recursive queries, query optimization and execution plans, indexing strategies, and platform-specific features for BigQuery, Redshift, or Snowflake depending on the company.
Yes. Data engineering SQL rounds emphasize analytical queries (window functions, aggregations), large-scale optimization (partitioning, indexing), and data warehouse concepts (star schema, slowly changing dimensions). Software engineering SQL tends to focus on CRUD operations and basic joins.
For a mid-level data engineering role, plan 2-4 weeks of focused SQL practice. Cover window functions, CTEs, optimization, and practice writing queries under time pressure. Use real interview questions from companies you're targeting.