Question 1

Explain wide vs. narrow transformations and how they drive shuffle cost, failure domains, and pipeline design. When would you intentionally add a wide transformation, and how do you minimize its impact?

Accepted Answer

**Section 1 — The Context (The 'Why')**
Wide transformations force full data shuffles across the cluster; narrow transformations stay partition-local. The cost of shuffle dominates Spark job runtime at scale.

**Section 2 — The Diagram**
```
[Narrow: map, filter] --> [RDD] --> [Wide: join, groupBy] --> [Shuffle]
```

**Section 3 — Component Logic**
**Narrow transformations** (map, filter) do not require data movement....

Question 2

Data masking scenarios for secure data handling

Accepted Answer

**Scenarios**: (1) PII in dev—mask SSN, email, names. (2) Logs—redact tokens. (3) Exports—generalize (region vs. city). (4) Support—dynamic masking (last 4 of card). **Techniques**: Substitution, hashing, blurring....

Question 3

Normalization: Various forms and impact on query performance

Accepted Answer

Higher normalization (3NF) often increases joins, which can slow queries. Denormalization (star schema) reduces joins, improves read performance. Balance: normalize for source/ODS; denormalize for warehouse reporting. Materialized views can denormalize without base-table redundancy. **Why it matters**: Design choices compound at scale—wrong approach can cause 100× overhead. **Scalability trade-offs**: Profile before optimizing; validate on sample then full....

Question 4

Optimization: Performance tuning strategies and temporal tables

Accepted Answer

Performance: indexing, partitioning, query rewrite, statistics. Temporal tables: system-versioned tables for point-in-time queries. Use for auditing, history. Example: SELECT * FROM employee FOR SYSTEM_TIME AS OF '2024-01-01'. Combines current and historical in one query. **Why it matters**: Design choices compound at scale—wrong approach can cause 100× overhead. **Scalability trade-offs**: Profile before optimizing; validate on sample then full....

Question 5

SCDs: Types of Slowly Changing Dimensions and their use cases

Accepted Answer

SCD types: 1—overwrite (no history); 2—add row (full history); 3—add column (previous value). Type 4—separate history table. Type 5—mini-dimension. Type 6—hybrid. Use 1 for corrections; 2 for audit/history; 3 for single previous value. Type 2 most common in DW. **Why it matters**: Design choices compound at scale—wrong approach can cause 100× overhead. **Scalability trade-offs**: Profile before optimizing; validate on sample then full....

Question 6

Schema Design: Star vs. Snowflake schema differences

Accepted Answer

**Section 1 — The Context (The 'Why')**
The primary challenge for this design in SQL is balancing scale, cost, and reliability. At scale, naive approaches fail: single points of failure cause cascades, schema evolution breaks consumers, and over-provisioning explodes cost. Failure modes include silent data loss from non-idempotent writes, cascading job failures from tight coupling, and operational burden from manual intervention....

Question 7

Spark optimizations: Partitioning, caching, tuning parallelism

Accepted Answer

Spark: partition by key (date, id) for pruning and join locality. Cache: cache() for repeated use. Parallelism: spark.default.parallelism = 2–4 * cores; repartition for shuffle. Tuning: executor memory, cores, shuffle partitions. AQE for adaptive optimization. Monitor Spark UI for skew and GC. **Why it matters**: Design choices compound at scale—wrong approach can cause 100× overhead. **Scalability trade-offs**: Profile before optimizing; validate on sample then full....

Question 8

Apache Spark Architecture - RDD, DAG, cluster manager, driver node, worker node

Accepted Answer

**Section 1 — The Context (The 'Why')**
The primary challenge for this design in Spark/Big Data is balancing scale, cost, and reliability. At scale, naive approaches fail: single points of failure cause cascades, schema evolution breaks consumers, and over-provisioning explodes cost. Failure modes include silent data loss from non-idempotent writes, cascading job failures from tight coupling, and operational burden from manual intervention....

Question 9

Spark Streaming - streaming data handling and file mounting techniques

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Spark Streaming: `readStream.format('kafka')` or `format('parquet').schema(schema)` for file stream. File mounting: mount S3/ADLS to DBFS; use `spark.readStream.format('parquet').load('/mnt/path')`....

Question 10

CI/CD implementation across environments (DEV, QA, UAT, PreProd, PROD)

Accepted Answer

**Section 1 — The Context (The 'Why')**
This system faces scale and failure challenges at production. A naive approach breaks under load, loses data, or violates compliance. The primary challenge varies by domain: notifications need preference respect; banking needs ACID; pipelines need idempotency....

Zen Data Shastra Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 13 Questions

More Interview Prep Guides

Unlock All Expert Answers