DataEngPrep.tech
QuestionsBlogStore
Get PDF Bundle

Interview Questions

Real questions from top companies in Spark/Big Data · medium

700+ Easy450+ Medium650+ Hard
All CategoriesBehavioralSpark/Big DataSQLPython/CodingSystem Design/ArchitectureCloud/ToolsGeneral/Othereasymediumhard
1

What is the difference between repartition and coalesce in Apache Spark?

Spark/Big Datamediumpartitionpythonspark1 min read
BCGCitiDunnhumbyFragma Data Systems+3
→
2

What is the difference between cache() and persist() in Spark? When would you use each?

Spark/Big Datamediumpartitionspark0.7 min read
AccentureCoforgeFreechargeImpetus+1
→
3

What is the difference between groupByKey and reduceByKey in Spark?

Spark/Big Datamediumpartitionspark0.8 min read
AccentureCapcoCoforgeNagarro+1
→
4

What is the difference between narrow and wide transformations in Apache Spark? Explain with examples.

Spark/Big Datamediumjoinpartitionpython0.9 min read
CoforgeDelivery HeroDunnhumbyFragma Data Systems+1
→
5

What strategies can you use to handle skewed data in Spark?

Spark/Big Datamediumjoinpartitionspark0.5 min read
BCGBitwiseCitiHashedIn
→
6

Explain the difference between Spark's map() and flatMap() transformations.

Spark/Big Datamediumpartitionspark0.4 min read
Delivery HeroDunnhumbyFragma Data Systems
→
7

Explain the concept of Broadcast Join in Spark. When should it be used?

Spark/Big Datamediumjoinsparksql0.4 min read
Delivery HeroDunnhumbyFragma Data Systems
→
8

Convert complex SQL (CTEs, window functions, subqueries) to production-grade PySpark. Discuss when to use spark.sql() vs. DataFrame API, and the implications for testability, partitioning, and execution predictability.

Spark/Big Datamedium
9

Explain how Adaptive Query Execution changes the economics of Spark tuning. What problems does it solve at runtime, and when might you still need manual intervention (e.g., salting, broadcast hints)?

Spark/Big Datamedium
10

Architect incremental load in ADF + Databricks with idempotency, late-arrival handling, and cost/scalability implications of watermark vs. change data capture.

Spark/Big Datamedium
11

Explain strategies for managing schema changes in PySpark over time.

Spark/Big Datamedium
12

How do you drop columns with null values in PySpark?

Spark/Big Datamedium
13

How do you handle data skewness in Spark?

Spark/Big Datamedium
14

How would you read data from a web API using PySpark?

Spark/Big Datamedium
15

What is Adaptive Query Execution (AQE) in Spark 3.x, and how does it improve performance?

Spark/Big Datamedium

+13 More Questions with Expert Answers

Get the complete 1,800+ question library with detailed, expert-level answers covering SQL, Spark, System Design, Python, Cloud, and Behavioral topics.

Get PDF Bundle — from $21Try Free Sample
123...5Next