Real interview questions asked at American Express. Practice the most frequently asked questions and land your next role.
American Express data engineering interviews test your ability across multiple domains. These questions are sourced from real American Express interview experiences and sorted by frequency. Practice the ones that matter most.
What is the difference between SparkSession and SparkContext in Spark?
Discuss the data size challenges in your previous projects. How did you optimize storage and processing?
What were the biggest infrastructure-level challenges you faced, and how did you resolve them?
Why do you want to join American Express?
What are your strengths, and how do they align with the Data Engineer role?
Create a Python program to demonstrate the use of set operations (union, intersection).
Describe Spark's memory management model. How do you handle heap memory overhead issues?
Explain the difference between mutable and immutable objects in Python.
Explain the differences between multiprocessing and multithreading.
Implement a Python function to count unique words from a file and write them to another file.
Write a decorator function to log the execution time of a function.
Describe a scenario where you used Databricks for real-time data processing.
Describe a cross-team data project where you had to align architectural boundaries, ownership, and SLAs. How did you handle conflicting priorities, technical debt, and the scalability of communication as the number of stakeholders grew?
Implement a recursive query for hierarchy (employee-manager). Explain the termination guarantees, depth limits, and when a recursive CTE becomes a scalability bottleneck. What alternatives exist for graph-scale hierarchies in Spark or a data lake?
Explain bloom filters in Spark: how they reduce I/O and when they introduce false positives that hurt performance. What are the scalability and cost implications of enabling dynamic partition pruning and bloom filter pushdown at petabyte scale?
Given a table of sales data, use window functions to calculate a running total.
How do you handle schema evolution in data lakes or data warehouses?
How would you optimize a query with multiple joins and subqueries?
Write a query to find the first number repeating consecutively three times in a sequence.
Code a simple PySpark job to read a JSON file, filter records, and write output in Parquet format.
Explain a scenario-based question on Spark optimization and how you would troubleshoot performance issues.
Explain repartition vs. coalesce. Which one would you use to reduce shuffle operations?
How did you handle data ingestion and processing for large datasets?
How does Spark's Catalyst Optimizer improve query performance?
What is the salting technique, and when would you use it?
Describe the architecture of an ETL pipeline you built in your previous project.
How do you ensure data quality and consistency in your pipelines?
Download the complete interview prep bundle with expert answers. Study offline, on your commute, anywhere.