Question 1

What is the difference between repartition and coalesce in Apache Spark?

Accepted Answer

**Repartition(n)**: Performs a full shuffle to redistribute data across exactly `n` partitions. Can increase or decrease partition count. Uses hash partitioning by default—all rows are exchanged across the network.

**Coalesce(n)**: Merges existing partitions into fewer partitions without a full shuffle. Only decreases partition count....

Question 2

What is the difference between SparkSession and SparkContext in Spark?

Accepted Answer

**SparkContext** (Spark 1.x): Low-level entry point for RDD operations. Manages cluster connections, configuration, and RDD creation. One active SparkContext per JVM. RDD-only.

**SparkSession** (Spark 2.0+): Unified entry point subsuming SparkContext, SQLContext, HiveContext, StreamingContext. Provides DataFrame, Dataset, SQL, and Structured Streaming APIs....

Question 3

What strategies can you use to handle skewed data in Spark?

Accepted Answer

**1. Salting**: Add random suffix to skewed keys to spread load; requires two-phase aggregation. **2. Two-phase aggregation**: Aggregate with salted key, then aggregate again without salt. **3. Broadcast**: For small dimension tables, broadcast to avoid shuffle. **4. Custom partitioning**: Pre-partition by known skewed keys. **5. Increase partitions**: Spreads work but doesn't fix root cause. **6. AQE Skew Join (Spark 3.0+)**: Automatically splits skewed partitions....

Question 4

What is the difference between Managed and External tables in Hive/Spark?

Accepted Answer

Managed: Spark/Hive owns metadata and data. DROP TABLE deletes both. External: Metadata only; data lives in specified location. DROP TABLE drops metadata; data remains. Why External: Shared data across tools (Athena, Glue, Spark); production datasets where accidental DROP would be catastrophic; data lifecycle independent of table. Why Managed: Ephemeral tables, temp outputs; simpler—no orphan paths. Cost: Managed DROP can trigger expensive recursive deletes on object store....

Question 5

Explain the concept of checkpointing in Spark and why it is important.

Accepted Answer

Checkpointing breaks RDD lineage by materializing to storage, trading storage I/O for DAG truncation. **Why it exists**: Long lineage (e.g., 1000+ stages in iterative algorithms like PageRank or ML) causes stack overflow in the driver and makes fault recovery expensive—Spark would replay the entire DAG. **Scalability trade-off**: Checkpoint is a full write; at scale, this is costly (network, disk). For batch, prefer persist/cache when lineage is < ~50 stages....

Question 6

Describe how to pass data between tasks in Airflow using XComs.

Accepted Answer

**Why XCom matters**: Cross-task communication for DAG orchestration; misuse = performance and size issues. **Mechanism**: Push via return value or `xcom_push`; pull via `xcom_pull`. Stored in Airflow metadata DB by default. **Scalability trade-offs**: Default backend has ~1MB limit; large payloads = DB bloat. **Cost implications**: XCom in DB = latency for large data....

Question 7

Explain the concept of RDD, DataFrame, and Dataset in PySpark.

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

RDD: low-level, immutable, distributed collection (Spark 1.x). DataFrame: structured, Catalyst-optimized, SQL-capable (Spark 2.x). Dataset: typed DataFrame (Scala/Java). In PySpark, DataFrame is primary (no typed Dataset). RDD: sc.parallelize(); DataFrame: spark.read.parquet(); Dataset: Scala only....

Question 8

Explain the concept of consumer groups in Kafka. How do they affect message processing?

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Kafka consumer groups: consumers in a group share topic partitions; each partition assigned to one consumer. Adding consumers increases parallelism (up to partition count); more consumers than partitions equals idle. Offsets tracked per group. Example: 6 partitions, 3 consumers yields 2 partitions each....

Question 9

Explain the difference between TriggerDagRunOperator and ExternalTaskSensor in Airflow.

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

TriggerDagRunOperator: triggers another DAG as a task; creates a new run of that DAG. ExternalTaskSensor: waits for a task in another DAG to complete (polling). Use TriggerDagRun when you need to explicitly start a child DAG. Use ExternalTaskSensor when your DAG depends on completion of another DAG's task (e.g., wait for upstream pipeline)....

Question 10

How do you ensure data quality and consistency across different stages of a data pipeline?

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Data quality across pipeline: (1) Schema validation; (2) Expectations (Great Expectations, DLT); (3) Deduplication; (4) Row/column checks; (5) Monitoring and alerting. Bronze: schema; Silver: business rules; Gold: aggregated consistency....

Citi Spark & Big Data Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 16 Questions

More Interview Prep Guides

Unlock All Expert Answers

Citi Spark & Big Data Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 16 Questions

More Interview Prep Guides

Unlock All Expert Answers