Question 1

Describe a scenario where partitioning and bucketing would improve query performance.

Accepted Answer

Situation: An events table with billions of rows serving time-range and user-level analytics. Task: Achieve sub-second query latency while controlling storage and compute costs. Why Partitioning: Partition pruning at read time eliminates entire data scans—a query filtering by date_range only touches relevant partition dirs. This reduces I/O by orders of magnitude (e.g., 365 partitions → scan 1 vs all)....

Question 2

How do you handle late-arriving data in Spark Structured Streaming?

Accepted Answer

Watermark: Defines max lateness (e.g., 10 min). Events older than (max_event_time - watermark) are dropped. State: Kept for aggregations within watermark; beyond that, state is purged to avoid unbounded growth. Output modes: Append (only finalized results), Update (changed rows), Complete (full result). Why: Trade-off between latency and correctness—tighter watermark = less state, more dropped late data. Looser = more state, more complete results....

Question 3

What is the small-file problem in Spark, and how do you solve it?

Accepted Answer

Problem: Thousands of tiny files (KB–MB) cause metadata overhead, slow S3/HDFS listing, many small tasks, I/O thrashing. Root causes: High parallelism, over-partitioning, streaming micro-batches. Why it hurts: Each file = task; 10K files = 10K tasks = scheduling overhead. S3 LIST is rate-limited; listing 100K files can take minutes. Solutions: 1) coalesce/repartition before write to target 128MB–1GB per file. 2) Delta Lake/Spark auto-compaction. 3) Larger partition sizes....

Question 4

How do you optimize Spark jobs for better performance? Mention at least 5 techniques.

Accepted Answer

1) Broadcast joins for small tables—avoid shuffle. 2) Predicate pushdown—filter at source (Parquet/ORC) to reduce scan. 3) Partition tuning—spark.sql.shuffle.partitions ~2–4× cores; match partition columns to filter/join keys. 4) Cache only when reused; unpersist when done to free memory. 5) Prefer Spark SQL over UDFs—Catalyst optimization. 6) Skew handling—salted keys, AQE skew join. 7) Kryo serialization for RDD; avoid Java default. 8) Coalesce before write to avoid small files....

Question 5

What are decorators in Python, and how do they work?

Accepted Answer

Decorator: Higher-order function that wraps another function; @decorator syntax. Implementation: def timer(f): def wrapper(*args, **kwargs): start=time(); r=f(*args,**kwargs); print(time()-start); return r; return wrapper. Why: Cross-cutting concerns—logging, retries, auth, caching—without cluttering business logic. In data pipelines: @retry, @logged, @rate_limited on API fetchers. Scalability: Decorators add indirection; overuse can make stack traces deep....

Question 6

Explain the difference between args and kwargs in Python.

Accepted Answer

*args: Variable positional args, received as tuple. **kwargs: Variable keyword args, received as dict. Why: Flexible signatures—wrappers, decorators, config-driven functions. Example: def run_pipeline(*sources, **config): ... allows run_pipeline('s3://a','s3://b', parallelism=4). Order: (regular, *args, keyword-only, **kwargs). Pitfall: Passing through—func(*args, **kwargs)—preserves interface. In data pipelines: **kwargs for optional connector params (region, timeout, retries)....

Question 7

Explain the trade-offs between batch and real-time data processing. Provide examples of when each is appropriate.

Accepted Answer

**Why the distinction matters**: Systems are optimized for different access patterns. Batch assumes bounded, complete datasets; streaming assumes unbounded, infinite data. The choice cascades into storage layout, compute model, operational complexity, and cost structure. **Batch**: Designed for high-throughput, bounded processing. Lower cost per record due to amortized compute (e.g., spot instances). Easier to reason about correctness (all-or-nothing). Trade-off: End-to-end latency is O(hours)....

Question 8

Retrieve the most recent sale_timestamp for each product (Latest Transaction).

Accepted Answer

**SQL (full row)**: `SELECT * FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY product_id ORDER BY sale_timestamp DESC) AS rn FROM sales) t WHERE rn = 1` — Returns the full record. **SQL (timestamp only)**: `SELECT product_id, MAX(sale_timestamp) FROM sales GROUP BY product_id` — Use when you only need the timestamp. **PySpark**: `df.withColumn('rn', F.row_number().over(Window.partitionBy('product_id').orderBy(F.desc('sale_timestamp')))).filter(F.col('rn') == 1).drop('rn')` — Full row....

Question 9

Difference between ROW_NUMBER(), RANK(), and DENSE_RANK() with examples.

Accepted Answer

**Architectural Logic**: All three assign ordinals over partitions but differ in tie-handling and downstream semantics. ROW_NUMBER(): Deterministic unique ordering—required for deduplication (pick one row per key). RANK(): Ties get same rank, next rank skips—classic leaderboard (e.g., Olympic medals). DENSE_RANK(): Ties get same rank, no skip—"top N per category" when N must be exact count. **Why**: ROW_NUMBER needs ORDER BY with a tie-breaker for determinism. RANK reflects competitive gaps....

Question 10

Difference between where and having clause with examples.

Accepted Answer

**Architectural Logic**: WHERE filters rows before aggregation; HAVING filters groups after aggregation. Execution order: FROM → WHERE → GROUP BY → aggregates → HAVING → SELECT. **Why**: WHERE enables predicate pushdown—filtering early reduces rows before shuffle and aggregation, saving memory and I/O. HAVING operates on aggregates; pushing filter logic into WHERE when possible is critical for performance....

Swiggy Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 60 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading