Question 1

What is the difference between groupByKey and reduceByKey in Spark?

Accepted Answer

**groupByKey()**: Shuffles all (key, value) pairs to group values per key. Transfers O(total_values) over the network. No local aggregation—you combine values afterward. High memory and network cost.

**reduceByKey(func)**: Performs local reduce (e.g., sum) on each partition before shuffle. Shuffles only O(unique_keys) aggregated values. Combines locally first, then across partitions.

**Architectural Logic (Why reduceByKey Wins)**: Shuffle is the bottleneck....

Question 2

Implement a Spark job to find the top 10 most frequent words in a large text file.

Accepted Answer

Core logic: read text → split → explode → filter empty → groupBy → count → orderBy desc → limit 10. Code: from pyspark.sql import functions as F; df = spark.read.text("path/to/file.txt"); words = df.select(F.explode(F.split(F.col("value"), "\s+")).alias("word")); top10 = words.filter(F.length(F.col("word")) > 0).groupBy("word").count().orderBy(F.desc("count")).limit(10). **Why \s+**: Handles multiple spaces/tabs; more robust than single space....

Question 3

Describe a custom EMR cluster configuration for Spark-based ETL with minimal cost.

Accepted Answer

**Why config matters**: 50–70% cost savings with right instance mix. **Minimal-cost config**: (1) Spot instances for workers (interruptible); (2) On-demand for driver (reliability). (3) Right-size: m5.xlarge or m5.2xlarge workers; 1 driver, 2–10 workers. (4) S3 for storage (no EBS for data). (5) Autoscaling. **Scalability trade-offs**: Spot = possible interruption; use checkpointing. **Cost implications**: Spot ~70% cheaper; appropriate instance types....

Question 4

Explain how Glue's Spark-based architecture handles data parallelism.

Accepted Answer

**Section 1 — The Context (The 'Why')**
AWS Glue runs Spark jobs on serverless DPUs—each DPU provides 4 vCPUs and 16GB RAM. The challenge is that Glue abstracts away cluster management, so engineers often treat it as a black box and miss parallelism tuning. Small file problems, incorrect partition counts, and wrong worker types cause jobs to run 5–10x slower than necessary....

Question 5

Explain the benefits of auto-scaling policies in EMR.

Accepted Answer