Question 1

Can Presto work with Near Real-Time Data (Streaming Data Source)?

Accepted Answer

**Why this matters**: Presto/Trino is an MPP query engine—designed for ad-hoc queries over *stored* data. **Native streaming**: No—Presto does not consume Kafka/Kinesis directly. **Near real-time options**: (1) Query frequently updated tables (e.g., Delta with 1-min micro-batches); (2) Kafka connector for *bounded* reads ( snapshot queries); (3) Materialized views refreshed on schedule. Latency = refresh interval + query time....

Question 2

Cluster Resource Allocation in Spark

Accepted Answer

**Why allocation matters**: Wrong config = OOM, stragglers, or wasted spend. **Typical allocation**: Driver: 4–8GB, 2–4 cores (scheduling, collect). Executors: 4–8 cores, 8–16GB each. Rule of thumb: 3–5 tasks per core. **Scalability trade-offs**: Too many small executors = overhead; too few = underutilization. Avoid >1 executor per node for HDFS locality. **Cost implications**: Over-provisioning = 20–50% waste; under = no savings if jobs retry. Use dynamic allocation for variable load....

Question 3

Difference between Presto vs. Spark underlying architecture

Accepted Answer

**Section 1 — The Context (The 'Why')**
Presto and Spark address fundamentally different workloads: ad-hoc interactive queries versus batch ETL and iterative processing. Confusing them leads to poor architecture—using Spark for sub-second BI queries wastes cluster spin-up time; using Presto for multi-stage ETL lacks state and fault tolerance....

Question 4

Onboarding Delta Lake Catalog to Presto

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Delta Lake to Presto: (1) Use Presto/Trino Delta connector. (2) Configure catalog: `connector.name=delta; delta.table-directory=s3://...`. (3) For Unity Catalog: Use Delta Sharing or external location. (4) Sync: Presto reads Delta log....

Question 5

Spark Optimizations: skewed joins, broadcast joins, Catalyst Optimizer, repartition vs coalesce

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Spark optimizations: Skewed joins—salting. Broadcast—small tables. Catalyst—predicate pushdown, CBO. Repartition—increase partitions; coalesce—decrease. Best practice: Use AQE; profile first.

**Scalability trade-offs**: Partition/parallelism limits; single points of failure; horizontal vs vertical scaling....

Question 6

Spark Tungsten & Catalyst Optimizer

Accepted Answer

**Why It Matters (Architectural Logic)**: Tungsten and Catalyst are the dual engines that separate Spark from naive MapReduce. Without them, Spark would suffer the same disk-bound, unoptimized fate as legacy Hadoop.

**Catalyst Optimizer**: Built on Scala's pattern matching and immutable trees. Performs (1) Logical optimization—constant folding, predicate pushdown, projection pruning, null propagation. (2) Physical planning—join strategy (broadcast vs....

Question 7

What is Avro file format & what is its significance in delta tables?

Accepted Answer

**Avro**: Row-based binary format; schema embedded; supports schema evolution. Common in Kafka, Hive.

**Delta and Avro**: Delta primarily uses Parquet. Avro is optional for compatibility—e.g., reading from Kafka (Avro), converting to Delta. `spark.read.format("avro")` for Avro files.

**Why Avro in Pipeline**: Kafka Schema Registry + Avro for evolution....

Question 8

Write code to read data from Delta Lake in S3 and perform upsert based on primary key

Accepted Answer

**Why It Matters (Architectural Logic)**: MERGE enables CDC and incremental loads—update existing, insert new. Partition pruning on merge key is critical for performance.

Delta Lake MERGE supports upsert by primary key. Read existing: `from delta.tables import DeltaTable; delta = DeltaTable.forPath(spark, "s3://bucket/table")`. Merge: `delta.alias("t").merge(new_df.alias("s"), "t.id = s.id").whenMatchedUpdate(set={"col": "s.col"}).whenNotMatchedInsertAll().execute()`....

Walmart Spark & Big Data Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 8 Questions

More Interview Prep Guides

Unlock All Expert Answers