Question 1

Explain Common Table Expressions (CTEs) and their benefits.

Accepted Answer

**Architectural Logic**: CTEs are named subqueries in a WITH clause, evaluated as defined (or materialized, depending on engine). They provide logical decomposition without forcing physical materialization. **Why**: Readability and reuse—complex pipelines split into stages (raw → cleansed → aggregated). Recursion for hierarchies (org charts, bill-of-materials). Some engines inline CTEs; others (e.g., BigQuery) can materialize for reuse....

Question 2

Explain SQL Window Functions with examples.

Accepted Answer

**Architectural Logic**: Window functions compute over a "frame" of rows related to the current row without collapsing rows. Syntax: func() OVER (PARTITION BY ... ORDER BY ... [frame]). Categories: Ranking (ROW_NUMBER, RANK, DENSE_RANK), Aggregate (SUM, AVG over partitions), Value (LAG, LEAD, FIRST_VALUE). **Why**: Enable row-level analytics (running totals, moving averages, prior/next comparisons) without self-joins. Self-joins duplicate data and are slower....

Question 3

Explain the use of the MERGE statement in SQL.

Accepted Answer

**Architectural Logic**: MERGE (upsert) performs INSERT, UPDATE, DELETE in one atomic statement based on a join condition. Syntax: MERGE INTO target USING source ON (key) WHEN MATCHED THEN UPDATE ... WHEN NOT MATCHED THEN INSERT ... [WHEN NOT MATCHED BY SOURCE THEN DELETE]. **Why**: Single pass over target and source; avoids read-modify-write race conditions; efficient for SCD Type 1/2, incremental loads, CDC sync. **Scalability**: Join key should be indexed; large source scans can lock target....

Question 4

How do you handle NULL values in SQL? Mention functions like COALESCE and ISNULL.

Accepted Answer

**Architectural Logic**: NULL represents unknown/missing; it propagates through expressions (NULL + 1 = NULL). Handling: IS NULL / IS NOT NULL for predicates; COALESCE(val1, val2, ...) for first non-NULL (portable); ISNULL/IFNULL for dialect-specific default; NULLIF(val1, val2) to normalize to NULL. **Why**: JOIN on NULL yields no match (NULL != NULL). Aggregates ignore NULL except COUNT(*). Explicit handling prevents silent exclusions....

Question 5

How do you optimize a long-running SQL query?

Accepted Answer

**Architectural Logic**: Optimization is diagnostic-first. 1. Profile: EXPLAIN/EXPLAIN ANALYZE to find bottleneck (scan, join, sort, spill). 2. Reduce input: Filter early (WHERE, partition pruning); SELECT only needed columns. 3. Indexing: B-tree on filter/join columns; avoid over-indexing (writes slow). 4. Partitioning: Date/tenant partitioning for pruning. 5. Join strategy: Broadcast small dims; avoid cross joins. 6. Statistics: Up-to-date stats for planner. 7....

Question 6

How would you handle duplicate records in an SQL table?

Accepted Answer

**Architectural Logic**: 1. Identify: GROUP BY key HAVING COUNT(*) > 1. 2. Resolve: ROW_NUMBER() OVER (PARTITION BY key ORDER BY tiebreaker) then keep rn=1. 3. Prevent: UNIQUE constraint, PK; MERGE/INSERT with conflict handling. **Why**: Duplicates indicate ingestion bug, missing idempotency, or intentional multi-version (e.g., SCD2). Fix root cause before ad-hoc dedup. **Scalability**: DELETE from CTE/subquery can lock; prefer INSERT INTO new_table SELECT ......

Question 7

Explain how you would use repartition or coalesce effectively to optimize processing when analyzing data only for a specific region.

Accepted Answer

**Architectural Logic**: Filter first; then repartition for parallelism or coalesce to reduce output. **Repartition**: Increase partitions when skewed or under-partitioned; `df.filter(col("region")=="US").repartition(8, "region")`. **Coalesce**: Reduce partitions after filter when data small; avoids empty partitions and many small files. **Why Order**: Filter early to reduce data; repartition for downstream parallelism; coalesce before write....

Question 8

How can you delete partitions from a table in Hive using a command?

Accepted Answer

ALTER TABLE table_name DROP PARTITION (partition_col='value'); **Multiple**: DROP PARTITION for each. **IF EXISTS**: ALTER TABLE ... DROP PARTITION (dt='2024-01-01') PURGE; (PURGE deletes files). **MSCK REPAIR**: Syncs metastore with S3; doesn't drop. **Cascade**: Some systems support DROP PARTITION ......

Question 9

If manual partitions are created in a Hive data-warehouse table directory, and you query records from those partitions, will you see the data? If not, how can this be fixed?

Accepted Answer

Manually created Hive partition directories (e.g., /table/country=US/date=2024-01-01/) are not visible to Hive until metadata is added. Hive uses the metastore to find partitions; manually created dirs are unknown. Fix: run MSCK REPAIR TABLE table_name to scan the table directory and add partition metadata. Alternative: ALTER TABLE table_name ADD PARTITION (country='US', date='2024-01-01') LOCATION '/path/to/partition'....

Question 10

What is the difference between static and dynamic partitioning in Hive?

Accepted Answer

Static partitioning: Partition values provided explicitly at insert. INSERT INTO t PARTITION (country='US') SELECT ... Dynamic: Values derived from data. INSERT INTO t PARTITION (country) SELECT ..., country FROM source. Static—one partition per statement; dynamic—many. Static is faster (no discovery); dynamic is flexible. Drawback of dynamic: many small partitions if cardinality high....

Dunnhumby SQL Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 14 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading