Question 1

Explain wide vs. narrow transformations and how they drive shuffle cost, failure domains, and pipeline design. When would you intentionally add a wide transformation, and how do you minimize its impact?

Accepted Answer

**Section 1 — The Context (The 'Why')**
Wide transformations force full data shuffles across the cluster; narrow transformations stay partition-local. The cost of shuffle dominates Spark job runtime at scale.

**Section 2 — The Diagram**
```
[Narrow: map, filter] --> [RDD] --> [Wide: join, groupBy] --> [Shuffle]
```

**Section 3 — Component Logic**
**Narrow transformations** (map, filter) do not require data movement....

Question 2

Data masking scenarios for secure data handling

Accepted Answer

**Scenarios**: (1) PII in dev—mask SSN, email, names. (2) Logs—redact tokens. (3) Exports—generalize (region vs. city). (4) Support—dynamic masking (last 4 of card). **Techniques**: Substitution, hashing, blurring....

Question 3

Normalization: Various forms and impact on query performance

Accepted Answer

Higher normalization (3NF) often increases joins, which can slow queries. Denormalization (star schema) reduces joins, improves read performance. Balance: normalize for source/ODS; denormalize for warehouse reporting. Materialized views can denormalize without base-table redundancy. **Why it matters**: Design choices compound at scale—wrong approach can cause 100× overhead. **Scalability trade-offs**: Profile before optimizing; validate on sample then full....

Question 4

Optimization: Performance tuning strategies and temporal tables

Accepted Answer

Performance: indexing, partitioning, query rewrite, statistics. Temporal tables: system-versioned tables for point-in-time queries. Use for auditing, history. Example: SELECT * FROM employee FOR SYSTEM_TIME AS OF '2024-01-01'. Combines current and historical in one query. **Why it matters**: Design choices compound at scale—wrong approach can cause 100× overhead. **Scalability trade-offs**: Profile before optimizing; validate on sample then full....

Question 5

SCDs: Types of Slowly Changing Dimensions and their use cases

Accepted Answer

SCD types: 1—overwrite (no history); 2—add row (full history); 3—add column (previous value). Type 4—separate history table. Type 5—mini-dimension. Type 6—hybrid. Use 1 for corrections; 2 for audit/history; 3 for single previous value. Type 2 most common in DW. **Why it matters**: Design choices compound at scale—wrong approach can cause 100× overhead. **Scalability trade-offs**: Profile before optimizing; validate on sample then full....

Question 6

Schema Design: Star vs. Snowflake schema differences

Accepted Answer

**Section 1 — The Context (The 'Why')**
The primary challenge for this design in SQL is balancing scale, cost, and reliability. At scale, naive approaches fail: single points of failure cause cascades, schema evolution breaks consumers, and over-provisioning explodes cost. Failure modes include silent data loss from non-idempotent writes, cascading job failures from tight coupling, and operational burden from manual intervention....

Question 7

Spark optimizations: Partitioning, caching, tuning parallelism

Accepted Answer

Spark: partition by key (date, id) for pruning and join locality. Cache: cache() for repeated use. Parallelism: spark.default.parallelism = 2–4 * cores; repartition for shuffle. Tuning: executor memory, cores, shuffle partitions. AQE for adaptive optimization. Monitor Spark UI for skew and GC. **Why it matters**: Design choices compound at scale—wrong approach can cause 100× overhead. **Scalability trade-offs**: Profile before optimizing; validate on sample then full....

Question 8

Apache Spark Architecture - RDD, DAG, cluster manager, driver node, worker node

Accepted Answer

**Section 1 — The Context (The 'Why')**
The primary challenge for this design in Spark/Big Data is balancing scale, cost, and reliability. At scale, naive approaches fail: single points of failure cause cascades, schema evolution breaks consumers, and over-provisioning explodes cost. Failure modes include silent data loss from non-idempotent writes, cascading job failures from tight coupling, and operational burden from manual intervention....

Zen Data Shastra Data Engineer Interview Questions

Reading isn't practice. Get AI feedback on your answers.

Other Companies

Zen Data Shastra Data Engineer Interview Questions

Reading isn't practice. Get AI feedback on your answers.

Other Companies