Q: Grouping and aggregation functions?

**GROUP BY:** Partition; SUM, AVG, COUNT, MIN, MAX reduce. **Window:** OVER (PARTITION BY ... ORDER BY ...). **Snowflake:** GROUPING SETS, CUBE, ROLLUP. **Why:** Aggregate per group. **Production:** Filter before group; HAVING for aggregate filter.

Question 1

What is the difference between repartition and coalesce in Apache Spark?

Accepted Answer

**Repartition(n)**: Performs a full shuffle to redistribute data across exactly `n` partitions. Can increase or decrease partition count. Uses hash partitioning by default—all rows are exchanged across the network.

**Coalesce(n)**: Merges existing partitions into fewer partitions without a full shuffle. Only decreases partition count....

Question 2

CDC During Migration - explain approaches for real-time Change Data Capture

Accepted Answer

CDC captures inserts, updates, and deletes from a source and applies them to a target in near real-time, enabling minimal-downtime migrations. **Approaches**: Log-based CDC (Debezium, AWS DMS)—reads WAL/redo logs; lowest latency, no schema change. Trigger-based—triggers on source; adds load and schema coupling. Timestamp/version columns—incremental only; misses deletes and out-of-order updates. Dual-write with reconciliation—applications write to both; eventual consistency and complexity....

Question 3

Prioritize Spark optimizations by impact and effort. Discuss partitioning strategy, caching policy, join selection, shuffle reduction, and when each becomes a scalability or cost bottleneck.

Accepted Answer

Optimization hierarchy: (1) Partitioning: partition by filter columns (date, region) for predicate pushdown; coalesce/repartition to match downstream parallelism. Impact: high—avoids full scans; cost: storage overhead for many partitions. (2) Caching: cache() for multi-pass reuse; memory cost—unpersist when done. (3) Broadcast joins: < autoBroadcastJoinThreshold; eliminates shuffle for small dimension tables....

Question 4

Walk through the three AQE features in Spark 3.x (coalesce, join switch, skew join)—how they operate at shuffle boundaries, which configs enable them, and what happens when AQE cannot help.

Accepted Answer

AQE in Spark 3.x performs runtime reoptimization at shuffle boundaries. Three features: (1) Coalesce shuffle partitions (spark.sql.adaptive.coalescePartitions.enabled): post-shuffle, merge undersized partitions into fewer tasks—reduces scheduler overhead and small-task waste. (2) Join strategy switch: if runtime stats show one side small, convert sort-merge to broadcast—eliminates shuffle for that side....

Question 5

What is Adaptive Query Execution (AQE) in Spark 3.x, and how does it improve performance?

Accepted Answer

AQE re-optimizes at runtime using actual statistics at stage boundaries, addressing the planning-time blind spot (e.g., wrong size estimates, skew). **Three features**: (1) **Coalesce shuffle partitions**—merges small partitions after shuffle to reduce task overhead; avoids 10K tiny tasks. (2) **Switch join strategy**—if one side is smaller than expected, converts sort-merge to broadcast; avoids unnecessary shuffle....

Question 6

Challenges faced in translating requirements into technical solutions?

Accepted Answer

Situation: Product and business stakeholders often provide ambiguous requirements (e.g., 'we need better data') without specifics on freshness, granularity, or SLAs. Conflicting priorities (speed vs. completeness), legacy constraints, and scope creep compound the challenge. Task: Translate fuzzy requirements into actionable technical solutions while managing trade-offs....

Question 7

API calling with Airflow?

Accepted Answer

Why: APIs are common sources (SaaS, internal services); Airflow needs reliable, observable integration. Architectural logic: HttpOperator/SimpleHttpOperator for simple GET/POST—low code. Custom PythonOperator with requests/httpx for pagination, retries, OAuth. Why retries: APIs fail transiently; exponential backoff reduces 429/503 impact. Credentials in Connections—never hardcode; enables rotation. XCom for passing data between tasks—beware size limits (48KB default)....

Question 8

Airflow operators, hooks, and scheduler functionality?

Accepted Answer

Why this design: Separation of concerns—Operators define what to do; Hooks abstract how to connect; Scheduler coordinates when. Operators: single unit of work (BashOperator, PythonOperator); encapsulate logic; should be idempotent. Hooks: interface to external systems; manage connections, connection pooling; reused across operators. Scheduler: reads DAGs, evaluates DAG/task state, triggers ready tasks; uses executor (LocalExecutor, Celery, K8s) to distribute work....

Question 9

Grouping and aggregation functions?

Accepted Answer

**GROUP BY:** Partition; SUM, AVG, COUNT, MIN, MAX reduce. **Window:** OVER (PARTITION BY ... ORDER BY ...). **Snowflake:** GROUPING SETS, CUBE, ROLLUP. **Why:** Aggregate per group. **Production:** Filter before group; HAVING for aggregate filter.

Question 10

Building ETL pipelines to capture changes when new records are inserted into source tables?

Accepted Answer

**Patterns:** (1) Incremental: watermark (max id/updated_at), query WHERE id > watermark. (2) CDC: Debezium, AWS DMS—log-based, low latency. (3) Hash/checksum: compare batches. (4) MERGE: upsert by key.

**Snowflake:** Stage + MERGE. Store last watermark in state table. Use event time, not process time, for late arrivals....

Snowflake Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 25 Questions

More Interview Prep Guides

Unlock All Expert Answers