Question 1

How do you handle late-arriving data in Spark Structured Streaming?

Accepted Answer

Watermark: Defines max lateness (e.g., 10 min). Events older than (max_event_time - watermark) are dropped. State: Kept for aggregations within watermark; beyond that, state is purged to avoid unbounded growth. Output modes: Append (only finalized results), Update (changed rows), Complete (full result). Why: Trade-off between latency and correctness—tighter watermark = less state, more dropped late data. Looser = more state, more complete results....

Question 2

What is the small-file problem in Spark, and how do you solve it?

Accepted Answer

Problem: Thousands of tiny files (KB–MB) cause metadata overhead, slow S3/HDFS listing, many small tasks, I/O thrashing. Root causes: High parallelism, over-partitioning, streaming micro-batches. Why it hurts: Each file = task; 10K files = 10K tasks = scheduling overhead. S3 LIST is rate-limited; listing 100K files can take minutes. Solutions: 1) coalesce/repartition before write to target 128MB–1GB per file. 2) Delta Lake/Spark auto-compaction. 3) Larger partition sizes....

Question 3

How do you optimize Spark jobs for better performance? Mention at least 5 techniques.

Accepted Answer

1) Broadcast joins for small tables—avoid shuffle. 2) Predicate pushdown—filter at source (Parquet/ORC) to reduce scan. 3) Partition tuning—spark.sql.shuffle.partitions ~2–4× cores; match partition columns to filter/join keys. 4) Cache only when reused; unpersist when done to free memory. 5) Prefer Spark SQL over UDFs—Catalyst optimization. 6) Skew handling—salted keys, AQE skew join. 7) Kryo serialization for RDD; avoid Java default. 8) Coalesce before write to avoid small files....

Question 4

How would you implement a sliding window aggregation in Spark Structured Streaming?

Accepted Answer

Sliding window: window(timeColumn, windowDuration, slideDuration) where slideDuration < windowDuration creates overlapping windows. Example: df.withWatermark("event_time", "10 minutes").groupBy(window(col("event_time"), "1 hour", "10 minutes"), col("user_id")).count(). **Why watermark**: Late data would grow state unbounded; watermark drops events older than (max_event_time - delay) and allows state cleanup....

Question 5

Compare HDFS and cloud-based storage systems in terms of scalability and performance.

Accepted Answer

**Why comparison matters**: Choice affects cost model, elasticity, and ops. **HDFS**: Scale by adding nodes; fixed cluster size; data locality for compute. Throughput limited by cluster; replication = 3x storage. **Cloud (S3, GCS)**: Effectively unlimited scale; pay-per-use; no cluster to manage. High throughput with proper concurrency; no locality (network fetch). **Scalability trade-offs**: HDFS = predictable perf with locality; cloud = elastic but network latency....

Question 6

Describe how you would use PySpark to aggregate and summarize large transaction datasets.

Accepted Answer

**Why aggregation strategy matters**: Wrong approach = OOM or 10x runtime. **Approach**: `df.groupBy("category").agg(sum("amount"), count("*"))`. For large: partition by date; filter early; use window functions for running metrics. **Scalability trade-offs**: Shuffle on groupBy; incremental for streams. **Cost implications**: Full scan = expensive; partition pruning reduces....

Question 7

Describe the role of a workflow orchestrator like Airflow in a data pipeline.

Accepted Answer

**Why orchestrator matters**: Pipelines are multi-step; dependencies, retries, monitoring = orchestrator's job. **Role**: Schedules (cron, triggers); manages dependencies (task order); retries on failure; monitors; ensures idempotent, reliable runs. Central place for pipeline logic. **Scalability trade-offs**: Orchestrator can bottleneck; use Celery for parallelism. **Cost implications**: Failed tasks without retry = manual intervention = cost....

Question 8

Describe the stages of a Spark job and strategies to optimize Spark performance for large datasets.

Accepted Answer

**Why stages matter**: Stages = units of parallelism; boundaries = shuffles. **Stages**: DAG split at shuffle boundaries; each stage = set of tasks (1 per partition). **Optimization**: Reduce shuffles (broadcast); partition well; cache; AQE. For large data: partition pruning; avoid skew; right-size. **Scalability trade-offs**: Fewer stages = less overhead; more shuffles = more cost. **Cost implications**: Shuffle = network + disk; minimize....

Question 9

Explain how Kafka handles real-time data streaming and guarantees message delivery.

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Kafka guarantees: (1) At-least-once—acks=all, retries; (2) Exactly-once—enable idempotence and transactions; (3) Ordering per partition. Producers send to partitions (key-based or round-robin); brokers replicate to ISR; consumers commit offsets. For delivery: set acks=all, replication.factor>=2....

Question 10

Provide strategies for handling data deduplication and cleaning in Spark jobs.

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Deduplication in Spark: (1) `dropDuplicates` on business key. (2) Window + rank for 'latest' record. (3) Delta merge with key. (4) Fuzzy dedup: MinHash, LSH. Cleaning: handle nulls, standardize formats, validate ranges....

Swiggy Spark & Big Data Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 11 Questions

More Interview Prep Guides

Unlock All Expert Answers