Question 1

Conceptualize and design a real-time streaming data pipeline end-to-end.

Accepted Answer

**Section 1 — The Context (The 'Why')**
Real-time streaming pipelines face a fundamental tension: events arrive continuously at high velocity while downstream consumers demand low latency, yet the system must guarantee no data loss during broker failures or consumer restarts. A naive approach—writing directly to a database per event—collapses under load; checkpointing to local disk loses state on executor preemption....

Question 2

Explain Apache Spark fundamentals, OOM scenarios and their resolutions, optimization techniques, strategies for optimized joins, and handling data skewness with Key Salting techniques.

Accepted Answer

**Fundamentals**: RDD, DAG, lazy evaluation. **OOM**: Driver—collect, broadcast oversized; Executor—skew, large shuffle. Fix: avoid collect; right-size broadcast; salting for skew. **Optimization**: Predicate pushdown, broadcast, partition, cache, AQE. **Joins**: Broadcast small; salt for skew. **Salting**: Add random key suffix (e.g., 0–9); redistribute; aggregate; merge. **Scalability trade-offs**: Each technique has limits; profile first....

Question 3

How can Docker be used to scale streaming data applications?

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Docker for streaming scale: run Kafka, Flink, Spark workers, or custom processors in containers. Kubernetes runs multiple replicas; each consumes from Kafka partitions. Benefits: consistent env, easy scaling, portability. Example: Kafka Connect in Docker; Flink taskmanagers as pods....

Expedia Spark & Big Data Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 3 Questions

More Interview Prep Guides

Unlock All Expert Answers