Real questions from top companies in Spark/Big Data Β· medium
What is the difference between repartition and coalesce in Apache Spark?
What is the difference between cache() and persist() in Spark? When would you use each?
What is the difference between groupByKey and reduceByKey in Spark?
What is the difference between narrow and wide transformations in Apache Spark? Explain with examples.
What strategies can you use to handle skewed data in Spark?
Explain the difference between Spark's map() and flatMap() transformations.
Explain the concept of Broadcast Join in Spark. When should it be used?
Convert complex SQL (CTEs, window functions, subqueries) to production-grade PySpark. Discuss when to use spark.sql() vs. DataFrame API, and the implications for testability, partitioning, and execution predictability.
Explain how Adaptive Query Execution changes the economics of Spark tuning. What problems does it solve at runtime, and when might you still need manual intervention (e.g., salting, broadcast hints)?
Architect incremental load in ADF + Databricks with idempotency, late-arrival handling, and cost/scalability implications of watermark vs. change data capture.
Explain strategies for managing schema changes in PySpark over time.
How do you drop columns with null values in PySpark?
How do you handle data skewness in Spark?
How would you read data from a web API using PySpark?
What is Adaptive Query Execution (AQE) in Spark 3.x, and how does it improve performance?
What is the difference between repartition and coalesce in Spark?
When and how do you use Broadcast Join in Spark?
What is broadcasting in Spark, and why is it used? Can you give an example of its use?
What is the difference between map and flatMap in Spark, and when would you use each?
What is the purpose of the Bronze, Silver, and Gold layers in a data pipeline?
Type or paste your answer to any of these questions and our AI Coach scores it, highlights gaps, and rewrites it at FAANG quality. Free to try.