Real questions from top companies in Spark/Big Data
What are the key components of the Spark execution model (Job, Stage, Task)?
What is Adaptive Query Execution (AQE) in Spark 3.x, and how does it improve performance?
What is Spark's Catalyst Optimizer? Explain its stages.
What is the difference between Spark RDDs, DataFrames, and Datasets?
What is the difference between repartition and coalesce in Spark?
What is the small-file problem in Spark, and how do you solve it?
When and how do you use Broadcast Join in Spark?
What is broadcasting in Spark, and why is it used? Can you give an example of its use?
What is the difference between Managed and External Tables in Databricks?
What is the difference between map and flatMap in Spark, and when would you use each?
What is the purpose of the Bronze, Silver, and Gold layers in a data pipeline?
What work is done by the executor memory in Spark?
When and how do you use Broadcast Join?
Why is SparkSession used in Spark 2.0 and later versions?
Write a Python script to find the count of each word in a text file using Spark.
Write the PySpark code to find the second highest salary in each department.
A JSON file with evolving schema needs to be ingested into a DataFrame. How would you handle new fields dynamically in PySpark without breaking the job for previous structures?
A data pipeline processes files for different clients stored in separate directories. Explain how you would use dynamic DAG creation to handle client-specific workflows in Airflow.
A task intermittently fails due to external API limitations. How would you configure Airflow retries and alerts to manage this situation efficiently?
Accumulator and Broadcast Variables - explain
Type or paste your answer to any of these questions and our AI Coach scores it, highlights gaps, and rewrites it at FAANG quality. Free to try.