DataEngPrep.tech
QuestionsBlogStore
Get PDF Bundle

Interview Questions

Real questions from top companies in Spark/Big Data · easy

700+ Easy450+ Medium650+ Hard
All CategoriesBehavioralSpark/Big DataSQLPython/CodingSystem Design/ArchitectureCloud/ToolsGeneral/Othereasymediumhard
1

What is the difference between Managed and External tables in Hive/Spark?

Spark/Big Dataeasyspark0.4 min read
CitiDunnhumbyFragma Data Systems
→
2

When would you architecturally choose Dataset[T] over DataFrame in a Scala Spark pipeline, and what are the scalability and portability trade-offs? Include type-safety benefits vs. operational constraints.

Spark/Big Dataeasy
3

What is the difference between Managed and External Tables in Databricks?

Spark/Big Dataeasy
4

A JSON file with evolving schema needs to be ingested into a DataFrame. How would you handle new fields dynamically in PySpark without breaking the job for previous structures?

Spark/Big Dataeasy
5

A task intermittently fails due to external API limitations. How would you configure Airflow retries and alerts to manage this situation efficiently?

Spark/Big Dataeasy
6

Accumulator and Broadcast Variables - explain

Spark/Big Dataeasy
7

Approaches to handling multiple tasks within a sprint?

Spark/Big Dataeasy
8

Cache() vs Persist(): Explain the difference and use cases for caching and persisting data in Spark with memory levels.

Spark/Big Dataeasy
9

Can you explain dynamic resource allocation in Spark? How does it help optimize job performance?

Spark/Big Dataeasy

+19 More Questions with Expert Answers

Get the complete 1,800+ question library with detailed, expert-level answers covering SQL, Spark, System Design, Python, Cloud, and Behavioral topics.

Get PDF Bundle — from $21Try Free Sample
123...5Next