Real interview questions asked at American Express. Practice the most frequently asked questions and land your next role.
American Express data engineering interviews test your ability across multiple domains. These questions are sourced from real American Express interview experiences and sorted by frequency. Practice the ones that matter most. This set leans toward senior-level depth (13 of 27 are tagged hard). Recurring themes are join, partition, and spark — these patterns appear most often in real interviews and reward the deepest preparation. Many of these questions also surface at Altimetrik and Citi, so the preparation transfers across companies. Average answer is around 1 minute of reading — plan roughly 1 hour to work through the full set thoughtfully.
This collection contains 27 curated questions: 7 easy, 7 medium, and 13 hard. The distribution skews toward harder problems, reflecting the depth expected in senior-level interviews.
The most frequently tested areas in this set are join (12), partition (12), spark (10), optimization (8), python (6), and sql (4). Focusing on these topics will give you the highest return on your preparation time.
Start with the easy questions to warm up and solidify fundamentals. Medium-difficulty questions form the bulk of real interviews — spend the most time here and practice explaining your reasoning out loud. Hard questions often appear in senior and staff-level rounds; attempt them after you're comfortable with the basics. For each question, try answering before revealing the solution. Use our AI Mock Interview to simulate real interview conditions and get instant feedback on your responses.
What is the difference between SparkSession and SparkContext in Spark?
Discuss the data size challenges in your previous projects. How did you optimize storage and processing?
What were the biggest infrastructure-level challenges you faced, and how did you resolve them?
Why do you want to join American Express?
What are your strengths, and how do they align with the Data Engineer role?
Create a Python program to demonstrate the use of set operations (union, intersection).
Describe Spark's memory management model. How do you handle heap memory overhead issues?
Explain the difference between mutable and immutable objects in Python.
Explain the differences between multiprocessing and multithreading.
Implement a Python function to count unique words from a file and write them to another file.
Write a decorator function to log the execution time of a function.
Describe a scenario where you used Databricks for real-time data processing.
Describe a cross-team data project where you had to align architectural boundaries, ownership, and SLAs. How did you handle conflicting priorities, technical debt, and the scalability of communication as the number of stakeholders grew?
Implement a recursive query for hierarchy (employee-manager). Explain the termination guarantees, depth limits, and when a recursive CTE becomes a scalability bottleneck. What alternatives exist for graph-scale hierarchies in Spark or a data lake?
Explain bloom filters in Spark: how they reduce I/O and when they introduce false positives that hurt performance. What are the scalability and cost implications of enabling dynamic partition pruning and bloom filter pushdown at petabyte scale?
Given a table of sales data, use window functions to calculate a running total.
How do you handle schema evolution in data lakes or data warehouses?
How would you optimize a query with multiple joins and subqueries?
Write a query to find the first number repeating consecutively three times in a sequence.
Code a simple PySpark job to read a JSON file, filter records, and write output in Parquet format.
Explain a scenario-based question on Spark optimization and how you would troubleshoot performance issues.
Explain repartition vs. coalesce. Which one would you use to reduce shuffle operations?
How did you handle data ingestion and processing for large datasets?
How does Spark's Catalyst Optimizer improve query performance?
What is the salting technique, and when would you use it?
Describe the architecture of an ETL pipeline you built in your previous project.
How do you ensure data quality and consistency in your pipelines?
Get full access to 1,800+ expert answers, AI mock interviews, and personalized progress tracking.