Spark & Big Data questions from Fragma Data Systems data engineering interviews.
These spark & big data questions are sourced from Fragma Data Systems data engineering interviews. Each includes an expert-level answer. This set leans toward senior-level depth (18 of 29 are tagged hard). Recurring themes are spark, partition, and optimization — these patterns appear most often in real interviews and reward the deepest preparation. Many of these questions also surface at Dunnhumby and Delivery Hero, so the preparation transfers across companies. Average answer is around 1 minute of reading — plan roughly 1 hour to work through the full set thoughtfully.
This collection contains 29 curated questions: 4 easy, 7 medium, and 18 hard. The distribution skews toward harder problems, reflecting the depth expected in senior-level interviews.
The most frequently tested areas in this set are spark (23), partition (20), optimization (17), join (13), sql (9), and python (7). Focusing on these topics will give you the highest return on your preparation time.
Start with the easy questions to warm up and solidify fundamentals. Medium-difficulty questions form the bulk of real interviews — spend the most time here and practice explaining your reasoning out loud. Hard questions often appear in senior and staff-level rounds; attempt them after you're comfortable with the basics. For each question, try answering before revealing the solution. Use our AI Mock Interview to simulate real interview conditions and get instant feedback on your responses.
What is the difference between repartition and coalesce in Apache Spark?
What is the difference between narrow and wide transformations in Apache Spark? Explain with examples.
Describe the difference between Spark RDDs, DataFrames, and Datasets.
Explain the difference between Spark's map() and flatMap() transformations.
How does Spark's Catalyst Optimizer work? Explain its stages.
What is the difference between Managed and External tables in Hive/Spark?
Explain the concept of Broadcast Join in Spark. When should it be used?
How do you optimize Spark jobs for better performance? Mention at least 5 techniques.
Design an anti-skew strategy for a join on a high-cardinality key with a long-tail distribution (e.g., a few keys hold 80% of rows). Cover salting, split-skew, AQE, and cost/operational trade-offs.
Explain the benefits of using DataFrames over RDDs.
How do you optimize Spark jobs for performance?
How would you implement a sliding window aggregation in Spark Structured Streaming?
What is Spark's Catalyst Optimizer? Explain its stages.
What is the difference between Spark RDDs, DataFrames, and Datasets?
When and how do you use Broadcast Join in Spark?
Describe your approach to managing offsets in Kafka.
Explain how you would design a partition strategy for a large dataset in HDFS.
Explain the architecture of Kafka and its core components.
Explain your choice of streaming framework (Kafka, Spark Streaming, etc.).
How do you handle out-of-memory errors in Spark jobs?
How do you reduce shuffle operations in Spark?
How does Kafka ensure message durability and reliability?
How does Spark execute a job? Explain the DAG and stages.
How does lazy evaluation work in Spark?
Implement a Kafka consumer that writes streaming data into a database.
Implement a PySpark job to read CSV data, perform joins, and store output as partitioned Parquet.
What are the different delivery semantics in Kafka (at least-once, at-most-once, exactly-once)?
What is the role of Zookeeper in Kafka?
Write a PySpark code snippet to filter rows with a specific condition.
Get full access to 1,800+ expert answers, AI mock interviews, and personalized progress tracking.