Interview questions
Preparing for a data engineering interview at American Express? This page contains 27 real interview questions sourced from verified American Express interview experiences. Questions are sorted by frequency — the ones asked most often appear first.
American Express data engineering interviews typically focus on SQL, Spark/Big Data, and Python/Coding. The interview bar skews toward harder problems (13 hard vs. 7 easy), suggesting emphasis on depth and system-level thinking.
Use the difficulty filters above to focus your preparation. For each question, attempt your own answer first, then compare with our expert solution. You can also practice these questions in our AI Mock Interview Coach for real-time feedback.
What is the difference between SparkSession and SparkContext in Spark?
Discuss the data size challenges in your previous projects. How did you optimize storage and processing?
What were the biggest infrastructure-level challenges you faced, and how did you resolve them?
Why do you want to join American Express?
What are your strengths, and how do they align with the Data Engineer role?
Create a Python program to demonstrate the use of set operations (union, intersection).
Describe Spark's memory management model. How do you handle heap memory overhead issues?
Explain the difference between mutable and immutable objects in Python.
Explain the differences between multiprocessing and multithreading.
Implement a Python function to count unique words from a file and write them to another file.
Write a decorator function to log the execution time of a function.
Describe a scenario where you used Databricks for real-time data processing.
Describe a cross-team data project where you had to align architectural boundaries, ownership, and SLAs. How did you handle conflicting priorities, technical debt, and the scalability of communication as the number of stakeholders grew?
Implement a recursive query for hierarchy (employee-manager). Explain the termination guarantees, depth limits, and when a recursive CTE becomes a scalability bottleneck. What alternatives exist for graph-scale hierarchies in Spark or a data lake?
Explain bloom filters in Spark: how they reduce I/O and when they introduce false positives that hurt performance. What are the scalability and cost implications of enabling dynamic partition pruning and bloom filter pushdown at petabyte scale?
Given a table of sales data, use window functions to calculate a running total.
How do you handle schema evolution in data lakes or data warehouses?
How would you optimize a query with multiple joins and subqueries?
Write a query to find the first number repeating consecutively three times in a sequence.
Code a simple PySpark job to read a JSON file, filter records, and write output in Parquet format.
Type or paste your answer to any of these questions and our AI Coach scores it, highlights gaps, and rewrites it at FAANG quality. Free to try.