Question 1

Discuss how you handled null values or unstructured data in your previous projects.

Accepted Answer

**Architectural Logic**: Nulls and semi-structured data require policy, validation, and flexible schemas. **Nulls**: COALESCE/IFNULL for defaults; define semantics (missing vs not applicable). Use sentinel values (e.g., -1, 'Unknown') for dimensions; document policy. **Unstructured**: Schema-on-read (Parquet, JSON); JSON_EXTRACT, from_json for extraction. Validate and handle malformed; optional chaining. **Data Quality**: Null checks in pipelines; dbt/great_expectations tests; log anomalies....

Question 2

How does indexing improve query performance in SQL?

Accepted Answer

**Index**: B-tree/hash/bitmap—fast lookup without full scan. **Use**: Filter columns, join keys. **Trade-off**: Faster reads; slower writes (maintain index). ** selectivity**: High-cardinality columns benefit more. **Composite**: (a, b) helps (a) and (a, b); not (b) alone. **Over-indexing**: Every column indexed = write penalty....

Question 3

How would you deal with data skewness in a join operation?

Accepted Answer

**Identify**: Skewed keys (e.g., NULL, bot user_id). **Solutions**: (1) **Salting**: Add random suffix 0..N to key in both tables; join; aggregate. (2) **Broadcast**: If one side small. (3) **Skew join**: Process skewed keys separately. (4) **AQE**: Spark 3.x auto-optimizes. **Trade-off**: Salting adds shuffle; AQE is automatic....

Question 4

How would you deal with data skewness in a large dataset?

Accepted Answer

**Aggregation skew**: Two-phase: aggregate by salted key, then by real key. **Join skew**: Salt or broadcast (see 1097). **Filter**: Remove skewed keys (e.g., NULL) if not needed. **Repartition**: By different key. **AQE**: Enable; handles at runtime....

Question 5

Solve a problem using a window function in Spark or SQL.

Accepted Answer

Window example—rank per group: SELECT *, ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) rn FROM employees. PySpark: w = Window.partitionBy('dept').orderBy(col('salary').desc()); df.withColumn('rn', row_number().over(w)). Use for: top-N, dedup, running metrics. Combine with filter: WHERE rn <= 3. **Why it matters**: Design choices compound at scale—wrong approach can cause 100× overhead. **Scalability trade-offs**: Profile before optimizing; validate on sample then full....

Question 6

map() vs mapPartitions(): Highlight the difference between map (row-level transformation) and mapPartitions (partition-level transformation).

Accepted Answer

**map()**: One element → one output; row-level. Higher overhead per record. **mapPartitions()**: One partition (iterator) → iterator; enables setup/teardown per partition (e.g., DB connection); amortizes cost. **Use mapPartitions when**: DB connections, batch writes, per-partition setup. **Scalability**: mapPartitions reduces function call overhead; avoids creating connection per row....

Question 7

repartition() vs coalesce(): Explain when to use repartition() (increases partitions) vs coalesce() (reduces partitions).

Accepted Answer

**repartition(n)**: Full shuffle; increase or decrease partitions. Use when needing more parallelism or redistribute after skew. **coalesce(n)**: Reduces without full shuffle; merges adjacent. Use when reducing before write (e.g., fewer files). **Scalability**: repartition expensive; coalesce cheaper for reduction. **Cost**: coalesce(1) = single-partition bottleneck on large data—avoid....

Capgemini SQL Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 7 Questions

More Interview Prep Guides

Unlock All Expert Answers