Question 1

What is the difference between repartition and coalesce in Apache Spark?

Accepted Answer

**Repartition(n)**: Performs a full shuffle to redistribute data across exactly `n` partitions. Can increase or decrease partition count. Uses hash partitioning by default—all rows are exchanged across the network.

**Coalesce(n)**: Merges existing partitions into fewer partitions without a full shuffle. Only decreases partition count....

Question 2

What is the difference between SparkSession and SparkContext in Spark?

Accepted Answer

**SparkContext** (Spark 1.x): Low-level entry point for RDD operations. Manages cluster connections, configuration, and RDD creation. One active SparkContext per JVM. RDD-only.

**SparkSession** (Spark 2.0+): Unified entry point subsuming SparkContext, SQLContext, HiveContext, StreamingContext. Provides DataFrame, Dataset, SQL, and Structured Streaming APIs....

Question 3

What is the difference between partitioning and bucketing in Spark, and when would you use bucketing?

Accepted Answer

**Partitioning**: Physically divides data by column values (e.g., date, region); enables partition pruning; one directory per partition value. **Bucketing**: Divides data within a partition into a fixed number of files via hash of bucketing column(s); co-locates same-key rows. **When to bucket**: Frequent joins or group-bys on a column (e.g., user_id). Same bucket count on both sides enables sort-merge join without shuffle....

Question 4

What strategies can you use to handle skewed data in Spark?

Accepted Answer

**1. Salting**: Add random suffix to skewed keys to spread load; requires two-phase aggregation. **2. Two-phase aggregation**: Aggregate with salted key, then aggregate again without salt. **3. Broadcast**: For small dimension tables, broadcast to avoid shuffle. **4. Custom partitioning**: Pre-partition by known skewed keys. **5. Increase partitions**: Spreads work but doesn't fix root cause. **6. AQE Skew Join (Spark 3.0+)**: Automatically splits skewed partitions....

Question 5

What is the difference between Managed and External tables in Hive/Spark?

Accepted Answer

Managed: Spark/Hive owns metadata and data. DROP TABLE deletes both. External: Metadata only; data lives in specified location. DROP TABLE drops metadata; data remains. Why External: Shared data across tools (Athena, Glue, Spark); production datasets where accidental DROP would be catastrophic; data lifecycle independent of table. Why Managed: Ephemeral tables, temp outputs; simpler—no orphan paths. Cost: Managed DROP can trigger expensive recursive deletes on object store....

Question 6

What is a window function? Explain with an example.

Accepted Answer

**Architectural Logic:** Window functions compute values over a set of rows related to the current row (defined by OVER) without collapsing rows like GROUP BY. **Why windows over self-joins:** Single scan, no explosion; optimal for rankings, running totals, period-over-period. **Scalability:** PARTITION BY and ORDER BY affect sort/spill; large windows with RANGE BETWEEN UNBOUNDED PRECEDING can spill to disk. **Cost:** ROWS vs....

Question 7

Explain the concept of checkpointing in Spark and why it is important.

Accepted Answer

Checkpointing breaks RDD lineage by materializing to storage, trading storage I/O for DAG truncation. **Why it exists**: Long lineage (e.g., 1000+ stages in iterative algorithms like PageRank or ML) causes stack overflow in the driver and makes fault recovery expensive—Spark would replay the entire DAG. **Scalability trade-off**: Checkpoint is a full write; at scale, this is costly (network, disk). For batch, prefer persist/cache when lineage is < ~50 stages....

Question 8

Agile methodologies used?

Accepted Answer

**Scrum**: 2-week sprints, standups, planning, retros, refinement. Jira for epics, stories, bugs. **Kanban**: For support and incidents—continuous flow. **Data**: Sprint on pipeline features, schema changes, quality; incidents through Kanban....

Question 9

An existing job running longer suddenly: how to analyze the issue?

Accepted Answer

**STAR approach**: **Situation**: Job ran 2 hr, now 6 hr. **Task**: Find root cause. **Analysis**: (1) Data volume growth—compare input size. (2) Skew—Spark UI shows uneven partitions. (3) Resource contention—other jobs. (4) Source throttling—DB or API limits. (5) Partition pruning—missing partitions? **Actions**: Add partitions; salting for skew; increase resources; fix source limits. **Result**: Document in runbook; add monitoring....

Question 10

How is Oozie called?

Accepted Answer

**Invocation**: (1) CLI—oozie job -run -config job.properties; (2) REST—POST /oozie/v1/jobs?action=start; (3) Hue UI; (4) Cron; (5) Sub-workflow/shell from other jobs. Example: oozie job -oozie http://host:11000/oozie -run -config workflow.properties. Oozie coordinates Hadoop jobs....

Citi Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 39 Questions

More Interview Prep Guides

Unlock All Expert Answers