Question 1

Explain the differences between Repartition and Coalesce. When would you use each?

Accepted Answer

**Repartition(n)**: Full shuffle; creates exactly n partitions. Can increase or decrease. **Coalesce(n)**: Merges partitions without full shuffle; only decreases. **Why it matters**: Shuffle is expensive—network and disk I/O. Coalesce avoids shuffle when reducing partitions by merging within existing partitions. **When Repartition**: Increasing partitions, fixing skew (repartition by key), or before a join to align partition counts....

Question 2

Explain Fact and Dimension Tables with examples.

Accepted Answer

Architecture: Star schema centralizes measurable events in fact tables; dimensions provide semantic context. Why this design: Facts are append-heavy and grow unbounded; dimensions are smaller and change slowly. Separating them optimizes for different access patterns. Fact grain defines the entire schema—get it wrong and joins become wrong. Example: sales_fact (quantity, revenue, date_key, product_key, customer_key) at grain one row per transaction....

Question 3

Convert complex SQL (CTEs, window functions, subqueries) to production-grade PySpark. Discuss when to use spark.sql() vs. DataFrame API, and the implications for testability, partitioning, and execution predictability.

Accepted Answer

Two approaches: spark.sql() for direct translation and DataFrame API for programmatic logic. SQL approach: createOrReplaceTempView, run ANSI-like SQL—fast parity, but string-based, harder to unit test, and execution plan less explicit. DataFrame API: composable, testable (pass mock DataFrames), explicit transformations....

Question 4

How do you drop columns with null values in PySpark?

Accepted Answer

Two distinct operations: (1) **Drop columns that are entirely null** (no non-null values): null_cols = [c for c in df.columns if df.filter(col(c).isNotNull()).count() == 0]; df = df.drop(*null_cols). **Caveat**: count() triggers a full scan—expensive on large tables. (2) **Drop rows with null in specified columns**: df.dropna(subset=["col1", "col2"]). **Scalability**: The column-null check is O(partitions × columns) and can be costly; consider sampling or inferring from schema/sample....

Question 5

Discuss Primary, Foreign, and Composite Keys.

Accepted Answer

**PK**: Uniquely identifies row; NOT NULL + unique. Surrogate (auto-increment) vs. natural. Surrogate preferred for stability.

**FK**: References PK in another table; referential integrity. Index FKs for join performance.

**Composite**: Multiple columns for uniqueness—e.g., (order_id, line_item_id)....

Question 6

How to optimize join of large and small tables in Spark?

Accepted Answer

**Broadcast**: df_large.join(broadcast(df_small), 'key'). Small table sent to all executors; no shuffle on large. **Threshold**: spark.sql.autoBroadcastJoinThreshold (default 10MB). **When**: Small table < threshold. **Hint**: broadcast() forces even if over threshold (use carefully). **Moderate size**: Repartition large on join key....

Question 7

Discuss common transformations used in Spark code.

Accepted Answer

**Why transformations matter**: Choice affects Catalyst optimization and performance. **Common**: map, filter, select, withColumn, drop, groupBy, agg, join, union, distinct. Window: row_number, rank, sum over. **Scalability trade-offs**: Narrow (map, filter) = no shuffle; wide (groupBy, join) = shuffle. Prefer narrow; minimize wide. **Cost implications**: UDF breaks optimization; built-ins preferred....

Question 8

Explain Delta Table features – Z-ordering and Time Travel.

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Z-ordering and Time Travel are core Delta Lake optimizations. Z-ordering physically colocates related data in the same files by ordering rows based on specified columns (e.g., date, customer_id), dramatically improving predicate pushdown and reducing I/O for range queries. Use OPTIMIZE table_name ZORDER BY (date, region)....

Question 9

Explain Spark Architecture – Driver, Executors, and Tasks.

Accepted Answer

**Section 1 — The Context (The 'Why')**
Spark's driver-executor architecture creates a single point of coordination: the driver builds the DAG and schedules tasks, while executors perform the actual work. Driver OOM from collect() or executor OOM from data skew are common production failures. A naive fix—increasing driver memory for an executor skew problem—wastes cost and does not solve the root cause....

Question 10

Explain Spark's execution process – Job/Stage/Task creation.

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Spark execution: (1) Action triggers job; (2) DAG Scheduler splits the DAG into stages at wide transformation boundaries (shuffle dependency); (3) Each stage has tasks (one per partition); (4) Task Scheduler launches tasks on executors. Stage boundary occurs when a shuffle is required—e.g., before a reduceByKey or join....

Datametica Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 13 Questions

More Interview Prep Guides

Unlock All Expert Answers