Spark & Big Data questions from Fragma Data Systems data engineering interviews.
These spark & big data questions are sourced from Fragma Data Systems data engineering interviews. Each includes an expert-level answer.
What is the difference between repartition and coalesce in Apache Spark?
What is the difference between narrow and wide transformations in Apache Spark? Explain with examples.
Describe the difference between Spark RDDs, DataFrames, and Datasets.
Explain the difference between Spark's map() and flatMap() transformations.
How does Spark's Catalyst Optimizer work? Explain its stages.
What is the difference between Managed and External tables in Hive/Spark?
Explain the concept of Broadcast Join in Spark. When should it be used?
How do you optimize Spark jobs for better performance? Mention at least 5 techniques.
Design an anti-skew strategy for a join on a high-cardinality key with a long-tail distribution (e.g., a few keys hold 80% of rows). Cover salting, split-skew, AQE, and cost/operational trade-offs.
Explain the benefits of using DataFrames over RDDs.
How do you optimize Spark jobs for performance?
How would you implement a sliding window aggregation in Spark Structured Streaming?
What is Spark's Catalyst Optimizer? Explain its stages.
What is the difference between Spark RDDs, DataFrames, and Datasets?
When and how do you use Broadcast Join in Spark?
Describe your approach to managing offsets in Kafka.
Explain how you would design a partition strategy for a large dataset in HDFS.
Explain the architecture of Kafka and its core components.
Explain your choice of streaming framework (Kafka, Spark Streaming, etc.).
How do you handle out-of-memory errors in Spark jobs?
How do you reduce shuffle operations in Spark?
How does Kafka ensure message durability and reliability?
How does Spark execute a job? Explain the DAG and stages.
How does lazy evaluation work in Spark?
Implement a Kafka consumer that writes streaming data into a database.
Implement a PySpark job to read CSV data, perform joins, and store output as partitioned Parquet.
What are the different delivery semantics in Kafka (at least-once, at-most-once, exactly-once)?
What is the role of Zookeeper in Kafka?
Write a PySpark code snippet to filter rows with a specific condition.
Download the complete interview prep bundle with expert answers. Study offline, on your commute, anywhere.