Question 1

What is the difference between SparkSession and SparkContext in Spark?

Accepted Answer

**SparkContext** (Spark 1.x): Low-level entry point for RDD operations. Manages cluster connections, configuration, and RDD creation. One active SparkContext per JVM. RDD-only.

**SparkSession** (Spark 2.0+): Unified entry point subsuming SparkContext, SQLContext, HiveContext, StreamingContext. Provides DataFrame, Dataset, SQL, and Structured Streaming APIs....

Question 2

How do you handle late-arriving data in Spark Structured Streaming?

Accepted Answer

Watermark: Defines max lateness (e.g., 10 min). Events older than (max_event_time - watermark) are dropped. State: Kept for aggregations within watermark; beyond that, state is purged to avoid unbounded growth. Output modes: Append (only finalized results), Update (changed rows), Complete (full result). Why: Trade-off between latency and correctness—tighter watermark = less state, more dropped late data. Looser = more state, more complete results....

Question 3

What is the small-file problem in Spark, and how do you solve it?

Accepted Answer

Problem: Thousands of tiny files (KB–MB) cause metadata overhead, slow S3/HDFS listing, many small tasks, I/O thrashing. Root causes: High parallelism, over-partitioning, streaming micro-batches. Why it hurts: Each file = task; 10K files = 10K tasks = scheduling overhead. S3 LIST is rate-limited; listing 100K files can take minutes. Solutions: 1) coalesce/repartition before write to target 128MB–1GB per file. 2) Delta Lake/Spark auto-compaction. 3) Larger partition sizes....

Question 4

Design a Delta table layout for mixed workload: point lookups by user_id, range scans by date, and full partition scans. Compare partitioning vs. Z-ordering—when to use each, and the rewrite cost trade-off.

Accepted Answer

**Section 1 — The Context (The 'Why')**
A Delta table serving point lookups (by user_id) and full scans (analytics) faces conflicting optimization. Point lookups want partition pruning by user; analytics want date partitioning.

**Section 2 — The Diagram**
```
[Delta] Z-Order:user_id | Partition:date
  |
  v
[Lookup][Scan][Stream]
```

**Section 3 — Component Logic**
**Partitioning strategies** use date for incremental and range pruning....

Question 5

Architect incremental load in ADF + Databricks with idempotency, late-arrival handling, and cost/scalability implications of watermark vs. change data capture.

Accepted Answer

**Pattern**: Process only new/changed data by tracking last processed boundary (watermark) or using CDC.

**ADF approach**: Watermark via lookup/stored procedure storing max(modified_date); filter source WHERE modified_date > @lastRun; parameterize pipeline for lastRun; write to sink. Use triggers for scheduling.

**Databricks approach**: Delta Lake MERGE INTO for upserts; Change Data Feed (CDF) for CDC; read CDF or filter on _change_version....

Question 6

What is the small-file problem in Spark, and how do you solve it?

Accepted Answer

Small-file problem: Too many tiny files (KB–MB) cause metadata explosion (S3/HDFS list operations), slow scans, and many small tasks. **Root causes**: High parallelism (many partitions), over-partitioning by high-cardinality key, streaming append with small batches. **Why it hurts**: S3 list costs $0.005/1000 requests; 1M files = $5 just for listing. Query engines (Athena, Presto) open each file; latency grows with file count....

Question 7

What is the difference between Managed and External Tables in Databricks?

Accepted Answer

**Managed tables**: Databricks/Spark owns both metadata and data; `DROP TABLE` deletes metadata and underlying data. **External tables**: Metadata is in the catalog; data lives in an external location (S3, ADLS, GCS); `DROP TABLE` removes only metadata; data persists. **Why it matters architecturally**: Managed tables enforce a single lifecycle for schema and data. External tables decouple storage from compute, enabling multi-engine access (Snowflake, Athena, Redshift) and shared data lakes....

Question 8

Explain PySpark's Catalyst Optimizer.

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

The Catalyst Optimizer is Spark's rule-based and cost-based query optimizer. It operates on the logical plan (unresolved to resolved to optimized) and physical plan....

Question 9

Explain caching techniques in Databricks.

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Databricks caching: (1) .cache() or .persist()—stores DataFrame/RDD in executor memory; (2) Delta Cache—accelerates repeated reads from cloud storage (S3/ADLS) by caching in SSD; (3) Predictive I/O—prefetches data. Use cache() for DataFrames reused across actions: df.cache(); df.filter(...).count(); df.filter(...).write....

Question 10

What is the difference between Lazy Evaluation and Eager Execution in PySpark?

Accepted Answer

**Lazy**: Transformations build DAG; nothing runs until action. Catalyst optimizes full plan. PySpark default.

**Eager**: Each op runs immediately. Pandas, traditional SQL. No cross-op optimization.

**Why Lazy**: Catalyst needs full plan for predicate pushdown, join reorder, constant folding. Single execution of optimized plan.

**Trade-offs**: Lazy = optimization, single execution. Eager = immediate feedback, no optimization.

**Scalability Trade-offs**: Long lineage = scheduler overhead....

Incedo Spark & Big Data Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 10 Questions

More Interview Prep Guides

Unlock All Expert Answers