Question 1

What is the difference between SparkSession and SparkContext in Spark?

Accepted Answer

**SparkContext** (Spark 1.x): Low-level entry point for RDD operations. Manages cluster connections, configuration, and RDD creation. One active SparkContext per JVM. RDD-only.

**SparkSession** (Spark 2.0+): Unified entry point subsuming SparkContext, SQLContext, HiveContext, StreamingContext. Provides DataFrame, Dataset, SQL, and Structured Streaming APIs....

Question 2

Code a simple PySpark job to read a JSON file, filter records, and write output in Parquet format.

Accepted Answer

**Production-grade example** (with schema, error handling):

```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("json_to_parquet").getOrCreate()

# Provide schema to avoid inference cost on large reads
df = spark.read.schema("id INT, status STRING, amount DOUBLE") \
    .json("s3://bucket/input/*.json")

filtered = df.filter((col("status") == "active") & (col("amount") > 0))

filtered.write.mode("overwrite") \
    .parquet("...

Question 3

Explain a scenario-based question on Spark optimization and how you would troubleshoot performance issues.

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Troubleshooting Spark performance: (1) Identify bottleneck—check Spark UI for long-running stages, skew, or spill; (2) Skew—use salting for hot keys, increase partitions; (3) Spill—increase executor memory or reduce partition size; (4) Slow shuffles—use broadcast for small joins, coalesce after filter; (5) Small files—use repartition/coalesce before write,...

Question 4

Explain repartition vs. coalesce. Which one would you use to reduce shuffle operations?

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

repartition(n) creates n partitions via full shuffle; coalesce(n) merges partitions without full shuffle (narrow). Use coalesce to reduce partitions (e.g., after filter)—avoids shuffle. Use repartition to increase partitions or change partition key. To reduce shuffle: coalesce. Example: df.filter(...).coalesce(10) to shrink after filter....

Question 5

How did you handle data ingestion and processing for large datasets?

Accepted Answer

**Situation**: Faced competing demands—multiple pipelines, stakeholders, deadlines. **Task**: Deliver impact while maintaining quality and preventing burnout. **Action**: (1) Prioritized by business impact and SLA risk. (2) Used ROI (value/time); WIP limits; timeboxing. (3) Communicated trade-offs—'Adding X pushes Y by N days.' (4) Maintained backlog with tech-debt capacity. **Result**: Shipped on time; zero incidents; stakeholder alignment on deferrals....

Question 6

How does Spark's Catalyst Optimizer improve query performance?

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Catalyst Optimizer improves performance via: (1) Logical optimization—predicate pushdown, constant folding, projection pruning. (2) Physical planning—join strategy (broadcast, sort-merge), partition pruning. (3) Code generation—whole-stage codegen for faster execution. Example: `df.filter('x>0').select('a')` pushes filter to source, prunes columns....

Question 7

What is the salting technique, and when would you use it?

Accepted Answer

**Salting**: Add random suffix to skewed keys. key → key_1, key_2, ... key_N. Distributes load across partitions.

**When**: One/few keys dominate (e.g., null, default, popular tenant). Causes stragglers.

**Steps**: Salt key → shuffle (now distributed) → aggregate/join → merge salted groups.

**Why**: Eliminates straggler; job finishes. Without salt, one task runs 10x longer.

**Trade-offs**: Extra shuffle and merge. Salt cardinality = tune....

American Express Spark & Big Data Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 7 Questions

More Interview Prep Guides

Unlock All Expert Answers

American Express Spark & Big Data Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 7 Questions

More Interview Prep Guides

Unlock All Expert Answers