Real questions on Extract, Transform, Load-the core of data engineering. Covers pipeline design, ingestion patterns, batch vs streaming, and data quality.
ETL (Extract, Transform, Load) is fundamental to every data engineering role. These questions cover pipeline design, incremental vs full loads, idempotency, late-arriving data, batch vs streaming ingestion, transformation patterns, data quality checks, and SCD (slowly changing dimension) implementations. Prepare for the ETL deep-dives that interviewers consistently ask.
Tell me about yourself and your experience.
What architecture are you following in your current project, and why?
What is the difference between narrow and wide transformations in Apache Spark? Explain with examples.
CDC During Migration - explain approaches for real-time Change Data Capture
Tell me about your family background
What strategies can you use to handle skewed data in Spark?
Briefly introduce yourself and walk us through your journey as a Data Engineer so far.
Describe the difference between Spark RDDs, DataFrames, and Datasets.
Explain the difference between args and kwargs in Python.
Explain the difference between Spark's map() and flatMap() transformations.
Explain the types of triggers in ADF, including schedule, tumbling window, and event-based triggers.
What are decorators in Python, and how do they work?
What is the difference between a list and a tuple in Python?
What is the small-file problem in Spark, and how do you solve it?
Architect incremental load in ADF + Databricks with idempotency, late-arrival handling, and cost/scalability implications of watermark vs. change data capture.
Architecturally, how do Job–Stage–Task boundaries in Spark's execution model impact cluster sizing, shuffle cost, and when would you deliberately collapse or split stages?
Architecturally, how would you justify or challenge Hadoop vs. a cloud-native data lake (S3 + EMR/Databricks) for a greenfield enterprise data platform? Discuss scalability ceilings, cost model trade-offs, and operational complexity.
Convert complex SQL (CTEs, window functions, subqueries) to production-grade PySpark. Discuss when to use spark.sql() vs. DataFrame API, and the implications for testability, partitioning, and execution predictability.
Describe the data pipeline architecture you've worked with.
Design a cost-aware resource strategy for a Databricks workload with spiky and batch jobs. Explain Dynamic Resource Allocation, when to disable it, and how min/max executors and spot instances affect cost and SLAs.
Design a Delta table layout for mixed workload: point lookups by user_id, range scans by date, and full partition scans. Compare partitioning vs. Z-ordering—when to use each, and the rewrite cost trade-off.
Explain Common Table Expressions (CTEs) and their benefits.
Explain strategies for managing schema changes in PySpark over time.
Explain the benefits of using DataFrames over RDDs.
Explain the concept of checkpointing in Spark and why it is important.
Explain the difference between Azure Data Factory (ADF) and Databricks.
Explain the difference between batch and streaming data processing in Data Fusion.
Explain the trade-offs between batch and real-time data processing. Provide examples of when each is appropriate.
Explain the use of the MERGE statement in SQL.
Explain wide vs. narrow transformations and how they drive shuffle cost, failure domains, and pipeline design. When would you intentionally add a wide transformation, and how do you minimize its impact?
Given a streaming dataset from Kafka, how would you ingest the data in real-time using Spark?
Have you worked on Data Warehousing projects?
How do you handle conflicts within a team? Provide an example.
How do you handle data skewness in Spark?
How do you handle memory management in Python?
How would you handle duplicate records in an SQL table?
How would you read data from a web API? What steps would you follow after reading the data?
Tell me about a time when you faced a challenging situation at work and how you handled it.
Triggers in ADF, especially tumbling window triggers.
What are primary keys and foreign keys? Why are they important?
What are the key components of AWS Glue, and how do they work together?
What are the key components of the Spark execution model (Job, Stage, Task)?
What challenges did you face, and how did you tackle them?
What is Azure Data Factory (ADF), and what are its main components?
What is normalization and denormalization? When would you use each?
What is the difference between a generator and a list in Python?
What is the difference between DELETE and TRUNCATE?
What is the difference between OLTP and OLAP?
What is the difference between S3 and HDFS?
What is the difference between Spark RDDs, DataFrames, and Datasets?
Download the complete interview prep bundle with expert answers. Study offline, on your commute, anywhere.