Data engineering interview questions · medium
What is the difference between repartition and coalesce in Apache Spark?
What is the difference between cache() and persist() in Spark? When would you use each?
What is the difference between groupByKey and reduceByKey in Spark?
What is the difference between narrow and wide transformations in Apache Spark? Explain with examples.
What strategies can you use to handle skewed data in Spark?
Explain the difference between Spark's map() and flatMap() transformations.
Explain the concept of Broadcast Join in Spark. When should it be used?
Convert complex SQL (CTEs, window functions, subqueries) to production-grade PySpark. Discuss when to use spark.sql() vs. DataFrame API, and the implications for testability, partitioning, and execution predictability.
Explain how Adaptive Query Execution changes the economics of Spark tuning. What problems does it solve at runtime, and when might you still need manual intervention (e.g., salting, broadcast hints)?
Architect incremental load in ADF + Databricks with idempotency, late-arrival handling, and cost/scalability implications of watermark vs. change data capture.
Explain strategies for managing schema changes in PySpark over time.
How do you drop columns with null values in PySpark?
How do you handle data skewness in Spark?
How would you read data from a web API using PySpark?
What is Adaptive Query Execution (AQE) in Spark 3.x, and how does it improve performance?
What is the difference between repartition and coalesce in Spark?
When and how do you use Broadcast Join in Spark?
What is broadcasting in Spark, and why is it used? Can you give an example of its use?
What is the difference between map and flatMap in Spark, and when would you use each?
What is the purpose of the Bronze, Silver, and Gold layers in a data pipeline?
Type or paste your answer to any of these questions and our AI Coach scores it, highlights gaps, and rewrites it at FAANG quality. Free to try.
The most common Spark interview topics are: the difference between RDDs and DataFrames, transformations vs actions, data skew and how to handle it, partition strategies, shuffle optimization, and the catalyst optimizer. Delta Lake and Structured Streaming are increasingly tested.
If you're targeting mid-to-senior roles at companies processing large datasets, yes. Spark/Big Data questions appear in most data engineering interviews at scale-up and enterprise companies. Even companies using other tools test Spark as a proxy for distributed systems knowledge.
Use Databricks Community Edition (free), Google Colab with PySpark, or local Docker setups. Focus on understanding concepts like partitioning, broadcast joins, and lazy evaluation. Most interview questions test conceptual understanding, not syntax.
Data skew handling and performance tuning are the most challenging areas. Interviewers ask how to diagnose skew in a Spark job, strategies to fix it (salting, repartitioning, broadcast joins), and how to read Spark UI for performance bottlenecks.