Question 1

What is the difference between SparkSession and SparkContext in Spark?

Accepted Answer

**SparkContext** (Spark 1.x): Low-level entry point for RDD operations. Manages cluster connections, configuration, and RDD creation. One active SparkContext per JVM. RDD-only.

**SparkSession** (Spark 2.0+): Unified entry point subsuming SparkContext, SQLContext, HiveContext, StreamingContext. Provides DataFrame, Dataset, SQL, and Structured Streaming APIs....

Question 2

Discuss the data size challenges in your previous projects. How did you optimize storage and processing?

Accepted Answer

**Situation:** I led data platform optimization for a 50TB+ analytical workload where queries routinely timed out (15+ min), blocking downstream reporting and analytics teams. The warehouse was partition-unaware and stored raw JSON, driving 3x storage costs and full-table scans.

**Task:** Reduce query latency to sub-minute P95 while cutting storage spend by at least 25%, without disrupting existing pipelines or introducing new vendor lock-in....

Question 3

What were the biggest infrastructure-level challenges you faced, and how did you resolve them?

Accepted Answer

Situation: Infra challenges. Task: Describe and resolve. Action: Provisioning→IaC; cost→tagging, reserved instances; security→encryption, audits; DR→multi-region; observability→logging, metrics. Partnered with platform/security; automation; runbooks....

Question 4

Why do you want to join American Express?

Accepted Answer

Situation: Company-specific. Task: Research-based. Action: Fintech leadership, payments/analytics/risk at scale. Engineering culture, data investment. Trust and reliability. Critical pipelines. Talent development. Fit with experience....

Question 5

What are your strengths, and how do they align with the Data Engineer role?

Accepted Answer

Map strengths to DE. 'Systems thinking—design for scale and failure. Ownership—projects design to prod. Collaboration—analytics, product. Learning—adopted [tools] quickly. For this role: systems→architecture; ownership→reliability; collaboration→stakeholders.' 3–4 strengths with examples and alignment.

Question 6

Create a Python program to demonstrate the use of set operations (union, intersection).

Accepted Answer

**Ops:** `a|b` union, `a&b` intersection, `a-b` diff, `a^b` sym diff. Methods: `a.union(b)`, `a.intersection(b)`. **Why:** Dedup, overlap, record matching. **Production:** Set-based joins for large datasets....

Question 7

Describe Spark's memory management model. How do you handle heap memory overhead issues?

Accepted Answer

**Model:** Executor split—execution (shuffle, compute) + storage (cache). Off-heap for large. **OOM fixes:** More executor memory; fewer partitions; avoid collect; broadcast small joins; repartition. **Tune:** spark.memory.fraction, storageFraction. **Why:** GC and spill hurt performance....

Question 8

Explain the difference between mutable and immutable objects in Python.

Accepted Answer

**Mutable:** list, dict, set—change in-place. **Immutable:** tuple, str, int—cannot. **Example:** `a.append(3)` vs `t.append(3)` fails. **Hashable:** Immutables as dict keys. **Why:** Thread safety; caching....

American Express Data Engineer Interview Questions

Reading isn't practice. Get AI feedback on your answers.

Other Companies

American Express Data Engineer Interview Questions

Reading isn't practice. Get AI feedback on your answers.

Other Companies