Medium-level spark & big data questions from real data engineering interviews.
These medium spark & big data questions are selected from real interviews at top companies. Each question includes a detailed expert answer and pro tip to help you nail your interview.
What is the difference between repartition and coalesce in Apache Spark?
What is the difference between cache() and persist() in Spark? When would you use each?
What is the difference between groupByKey and reduceByKey in Spark?
What is the difference between narrow and wide transformations in Apache Spark? Explain with examples.
What strategies can you use to handle skewed data in Spark?
Explain the difference between Spark's map() and flatMap() transformations.
Explain the concept of Broadcast Join in Spark. When should it be used?
Convert complex SQL (CTEs, window functions, subqueries) to production-grade PySpark. Discuss when to use spark.sql() vs. DataFrame API, and the implications for testability, partitioning, and execution predictability.
Explain how Adaptive Query Execution changes the economics of Spark tuning. What problems does it solve at runtime, and when might you still need manual intervention (e.g., salting, broadcast hints)?
Architect incremental load in ADF + Databricks with idempotency, late-arrival handling, and cost/scalability implications of watermark vs. change data capture.
Explain strategies for managing schema changes in PySpark over time.
How do you drop columns with null values in PySpark?
How do you handle data skewness in Spark?
How would you read data from a web API using PySpark?
What is Adaptive Query Execution (AQE) in Spark 3.x, and how does it improve performance?
What is the difference between repartition and coalesce in Spark?
When and how do you use Broadcast Join in Spark?
What is broadcasting in Spark, and why is it used? Can you give an example of its use?
What is the difference between map and flatMap in Spark, and when would you use each?
What is the purpose of the Bronze, Silver, and Gold layers in a data pipeline?
What work is done by the executor memory in Spark?
When and how do you use Broadcast Join?
Write a Python script to find the count of each word in a text file using Spark.
Write the PySpark code to find the second highest salary in each department.
Accumulators - use as shared variable for write-only operations
Broadcast Joins and Shuffle Merge Joins?
Broadcast join - how it optimizes joins
Can you explain the concept of mappers in Spark, and how are they used in data transformations?
Code a simple PySpark job to read a JSON file, filter records, and write output in Parquet format.
Compare Spark's lineage recovery with Hadoop's block replication mechanism.
Daily tasks of a Data Engineer?
Data-Related Issues Encountered - handling skewed data
Describe how you would use PySpark to aggregate and summarize large transaction datasets.
Discuss performance tuning concepts such as shuffle, skew, and caching.
Discuss techniques such as partitioning, broadcast joins, and caching to enhance Spark job performance.
How do you handle out-of-memory errors in Spark jobs?
How do you handle very large datasets in Spark to ensure scalability and efficiency?
Provide specific examples of challenges faced with PySpark and SQL and solutions implemented.
Split a DataFrame such that even numbers appear in one column and odd numbers in another
Steps to mount storage in Databricks.
Transformation vs. Action in PySpark?
What Hadoop command would you use to merge multiple files into one?
What are Spark Submit properties?
What are the key differences between Map and Reduce in Spark?
What are the key performance tuning techniques you apply in Spark jobs to improve performance?
What are the limitations of the REORG command with respect to large datasets?
What are the performance trade-offs of using salting to mitigate data skewness?
What are the steps to efficiently process 1 TB of data in Spark?
What causes Out of Memory (OOM) issues in Databricks, and how do you resolve them?
What causes data skewness in Spark, and how can it be resolved?
What configuration parameters are critical for enabling AQE effectively?
What determines the maximum parallelism achievable in Databricks?
What do you understand by data shuffling in Spark? Why is it important?
What is Broadcast Join and Why is It Required?
What is Shuffle and How to Handle It in Spark
What is offset management in Kafka?
What is the advantage of caching in PySpark? When and why would you use it?
What is the command to import data from HDFS to Hive?
What is the difference between partitions and repartitions in Spark, and when do you use each?
What is the most common performance bottleneck in Spark jobs, and how would you resolve it?
Download the complete interview prep bundle with expert answers. Study offline, on your commute, anywhere.