Question 1

Can you explain the architecture of Apache Spark and its components?

Accepted Answer

**Section 1 — The Context (The 'Why')**
Apache Spark's distributed execution model faces the core challenge of coordinating hundreds of executors while avoiding driver bottlenecks and shuffle storms. At scale, the driver's single-threaded scheduling and result aggregation become failure points....

Question 2

Accumulators - use as shared variable for write-only operations

Accepted Answer

Accumulators: workers increment; driver reads. For record counts, error counts, validation metrics. **Why**: Shared state across partitions without shuffle. **Not for control flow**: Retries can double-count. **Scalability**: Use for observability; broadcast for read-only sharing....

Question 3

Broadcast join - how it optimizes joins

Accepted Answer

**Why broadcast optimizes**: Eliminates shuffle for the large table—each executor has full small table locally. **Mechanism**: Driver fetches small table, serializes, sends to all executors. Each partition of large table joins locally—no network for big table. **Scalability trade-offs**: Small table must fit in executor memory; replication count = executor count—storage multiplies....

Question 4

Databricks notebooks vs. Fabric notebooks - differences

Accepted Answer

**Why comparison matters**: Ecosystem lock-in and integration differ. **Databricks**: Spark-centric; Delta Lake; ML Runtime; AWS/Azure/GCP. **Fabric**: Microsoft ecosystem; OneLake; Power BI integration; Azure-native. Both support notebooks, Spark, lakehouse. **Scalability trade-offs**: Databricks = multi-cloud; Fabric = Azure-first. **Cost implications**: Different billing models; Fabric bundled with Microsoft stack....

Question 5

Schema evolution - techniques for handling schema changes in PySpark

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Schema evolution in PySpark: (1) `mergeSchema` in read/write. (2) `schema_of_json` for JSON. (3) `StructType` with `nullable=True`. (4) `withColumn` for new columns. (5) Validate with `df.schema`....

Question 6

Writing Excel sheets to Delta tables in Databricks

Accepted Answer

**Why It Matters (Architectural Logic)**: Excel is common for business users; pipelines must handle it reliably. Define schema—inference is slow and inconsistent.

Excel to Delta in Databricks: Read Excel via `pd.read_excel()` (small files) or `openpyxl`/`xlrd`, then convert: `spark_df = spark.createDataFrame(pd_df)`. Or use `com.crealytics.spark.excel` package: `spark.read.format("com.crealytics.spark.excel").option("header", "true").load("dbfs:/path/file.xlsx")`....

Nihilent Spark & Big Data Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 6 Questions

More Interview Prep Guides

Unlock All Expert Answers