Question 1

What is the difference between repartition and coalesce in Apache Spark?

Accepted Answer

**Repartition(n)**: Performs a full shuffle to redistribute data across exactly `n` partitions. Can increase or decrease partition count. Uses hash partitioning by default—all rows are exchanged across the network.

**Coalesce(n)**: Merges existing partitions into fewer partitions without a full shuffle. Only decreases partition count....

Question 2

What strategies can you use to handle skewed data in Spark?

Accepted Answer

**1. Salting**: Add random suffix to skewed keys to spread load; requires two-phase aggregation. **2. Two-phase aggregation**: Aggregate with salted key, then aggregate again without salt. **3. Broadcast**: For small dimension tables, broadcast to avoid shuffle. **4. Custom partitioning**: Pre-partition by known skewed keys. **5. Increase partitions**: Spreads work but doesn't fix root cause. **6. AQE Skew Join (Spark 3.0+)**: Automatically splits skewed partitions....

Question 3

Design a Delta table layout for mixed workload: point lookups by user_id, range scans by date, and full partition scans. Compare partitioning vs. Z-ordering—when to use each, and the rewrite cost trade-off.

Accepted Answer

**Section 1 — The Context (The 'Why')**
A Delta table serving point lookups (by user_id) and full scans (analytics) faces conflicting optimization. Point lookups want partition pruning by user; analytics want date partitioning.

**Section 2 — The Diagram**
```
[Delta] Z-Order:user_id | Partition:date
  |
  v
[Lookup][Scan][Stream]
```

**Section 3 — Component Logic**
**Partitioning strategies** use date for incremental and range pruning....

Question 4

Compare Hadoop and Spark. Which one would you choose for a real-time application, and why?

Accepted Answer

**Why choice matters**: Hadoop MapReduce = disk-bound, batch; Spark = in-memory, streaming capable. **Hadoop (MapReduce)**: Disk I/O between stages; high latency; batch-oriented. **Spark**: In-memory; lazy; Structured Streaming for micro-batch/continuous. **For real-time**: Choose Spark—Structured Streaming, sub-second to minute latency. Hadoop suits cold archival, cost-optimized batch. **Scalability trade-offs**: Spark streaming scales with partitions; Hadoop batch scales with cluster....

Question 5

Explain how HDFS (Hadoop Distributed File System) stores data across nodes.

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

HDFS stores data by splitting files into blocks (default 128MB), replicating blocks (default 3x) across DataNodes. NameNode tracks block locations and metadata. Client writes to first replica; that node pipelines to others. Reads are from nearest replica. Rack awareness: spread replicas across racks for fault tolerance....

Question 6

Explain how to schedule an automated task using Apache Airflow.

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Schedule tasks in Airflow: (1) Define DAG with schedule_interval (cron or timedelta); (2) Add tasks; (3) Set dependencies: task1 >> task2; (4) Enable DAG....

Question 7

How do Spark transformations differ from actions? Provide examples of each.

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Transformations: lazy, build DAG (map, filter, join). Actions: eager, trigger execution (count, collect, write). Transformations return DataFrame/RDD; actions return data or side effects. Example: df.filter() is transformation; df.count() is action....

Question 8

How would you optimize Spark jobs for better performance?

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Spark job optimization: (1) Partitioning—partition by filter columns; avoid too many partitions. (2) Broadcast joins—for small tables (<10MB). (3) Coalesce/repartition—balance task sizes. (4) Caching—when reuse > 2x. (5) Predicate pushdown—use Parquet, filters early. (6) AQE—enable `spark.sql.adaptive.enabled`. (7) Avoid UDFs—use built-ins....

Question 9

What role does Kafka play in real-time data streaming pipelines?

Accepted Answer

**Roles**: (1) **Durable log**—events persisted; replayable. (2) **Decoupling**—producers and consumers independent. (3) **Buffer**—absorb spikes. (4) **Multiple consumers**—same stream, different use cases.

**In Pipelines**: Ingest (clicks, transactions) → Kafka → Spark/Flink process → sink (Delta, DB, dashboard). CDC from DB → Kafka → lake.

**Why Kafka**: Throughput (millions/sec), retention, exactly-once (with Transactions API).

**Scalability Trade-offs**: Partitions = parallelism....

Question 10

What strategies would you use to reduce latency in a streaming data pipeline?

Accepted Answer

**Strategies**: (1) More parallelism (partitions, executors). (2) Smaller micro-batches (trigger interval). (3) Efficient serialization (Avro, Protobuf). (4) Co-locate consumers and brokers. (5) Pipeline stages (overlap read/process/write). (6) Avoid unnecessary shuffle. (7) Continuous processing if supported.

**Why Trade-offs**: Lower latency = smaller batches = more overhead. Higher throughput = larger batches. Match to SLA.

**Scalability Trade-offs**: More partitions = more parallelism....

BCG Spark & Big Data Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 10 Questions

More Interview Prep Guides

Unlock All Expert Answers

BCG Spark & Big Data Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 10 Questions

More Interview Prep Guides

Unlock All Expert Answers