Real interview questions asked at Dunnhumby. Practice the most frequently asked questions and land your next role.
Dunnhumby data engineering interviews test your ability across multiple domains. These questions are sourced from real Dunnhumby interview experiences and sorted by frequency. Practice the ones that matter most.
What is the difference between repartition and coalesce in Apache Spark?
What is the difference between narrow and wide transformations in Apache Spark? Explain with examples.
Explain the difference between Spark's map() and flatMap() transformations.
How does Spark's Catalyst Optimizer work? Explain its stages.
What is the difference between Managed and External tables in Hive/Spark?
Explain the concept of Broadcast Join in Spark. When should it be used?
Explain the difference between shallow copy and deep copy in Python.
Write a Python function to find the first non-repeating character in a string.
Have you worked on Data Warehousing projects?
What is the difference between OLTP and OLAP?
What is the difference between SQL and NoSQL databases?
Explain Common Table Expressions (CTEs) and their benefits.
Explain SQL Window Functions with examples.
Explain the use of the MERGE statement in SQL.
How do you handle NULL values in SQL? Mention functions like COALESCE and ISNULL.
How do you optimize a long-running SQL query?
How would you handle duplicate records in an SQL table?
What is Spark's Catalyst Optimizer? Explain its stages.
Write a Python function to find the first non-repeating character in a string.
Explain your projects on which you worked till now and what was your role?
Name the tools and technologies you have worked with to date.
What steps do you take to troubleshoot a slow-running Spark job?
What would you do if you were assigned a task with a technology you've never used before?
Explain how you would use repartition or coalesce effectively to optimize processing when analyzing data only for a specific region.
How can you delete partitions from a table in Hive using a command?
If manual partitions are created in a Hive data-warehouse table directory, and you query records from those partitions, will you see the data? If not, how can this be fixed?
What is the difference between static and dynamic partitioning in Hive?
Write a SQL query to find distinct IDs from a table where the count is more than 1 and greater than 200.
You need to create a workflow where Task B runs only if Task A is successful, and Task C should always run regardless of Task A or B's status. How would you define this dependency using Airflow?
You need to design a Kafka topic for a logging service. How would you decide the number of partitions and the key for partitioning to balance throughput and ordering requirements?
Your Kafka consumer shows significant lag during peak hours. What strategies would you employ to reduce lag and ensure timely data processing?
A JSON file with evolving schema needs to be ingested into a DataFrame. How would you handle new fields dynamically in PySpark without breaking the job for previous structures?
A data pipeline processes files for different clients stored in separate directories. Explain how you would use dynamic DAG creation to handle client-specific workflows in Airflow.
A task intermittently fails due to external API limitations. How would you configure Airflow retries and alerts to manage this situation efficiently?
Describe how you would optimize a join between two large tables where one is significantly smaller, using broadcast joins in PySpark.
Explain how you would use Kafka Connect to ingest data from a relational database into Kafka while ensuring minimal latency and exactly-once semantics.
Given a DataFrame with columns id and name, add a new column department: If id < 100 assign HR, if id >= 100 and id < 200 assign admin.
If a consumer fails to process a message due to data corruption, describe how you would configure Kafka to handle retries and avoid message loss.
In Spark, what is the difference between cores and executors?
Suppose you have a DAG that ingests data from multiple databases. How would you increase task parallelism in Airflow to improve performance without overloading the system?
What are the different modes in which you can submit Spark jobs? Explain each.
What is the difference between Pandas DataFrame and Spark DataFrame? When would you prefer using each?
What is the difference between external and internal tables in Hive?
When submitting Spark jobs, how does the process work in the backend? Explain.
Write a PySpark job that calculates the number of unique users who logged in per day, but exclude any logins from inactive users listed in a separate file.
Write a PySpark script to check for missing values and duplicate rows in a DataFrame. How would you ensure data quality before saving it to a storage system?
Write the Spark command to rename an existing column in a DataFrame.
Your Kafka producer schema has changed, and the new data includes additional fields. How would you ensure backward compatibility using Schema Registry while consuming data from the same topic?
Download the complete interview prep bundle with expert answers. Study offline, on your commute, anywhere.