Real questions on pipeline design, orchestration, Airflow, DAGs, and workflow scheduling. How data flows from source to destination.
Data pipeline design is central to data engineering. These questions cover orchestration tools (Airflow, Prefect, dbt), DAG design, dependency management, scheduling, idempotency, failure handling, and pipeline architecture at scale. Interviewers probe both theoretical knowledge and real-world implementation experience.
Tell me about yourself and your experience.
What architecture are you following in your current project, and why?
What is the difference between narrow and wide transformations in Apache Spark? Explain with examples.
Explain the differences between Data Warehouse, Data Lake, and Delta Lake
Tell me about your family background
What are Airflow Operators? Give examples.
Briefly introduce yourself and walk us through your journey as a Data Engineer so far.
Can you explain the architecture of Apache Spark and its components?
Explain the difference between args and kwargs in Python.
Explain the types of triggers in ADF, including schedule, tumbling window, and event-based triggers.
What are decorators in Python, and how do they work?
What is the difference between a list and a tuple in Python?
What is the small-file problem in Spark, and how do you solve it?
Architect incremental load in ADF + Databricks with idempotency, late-arrival handling, and cost/scalability implications of watermark vs. change data capture.
Architecturally, how would you justify or challenge Hadoop vs. a cloud-native data lake (S3 + EMR/Databricks) for a greenfield enterprise data platform? Discuss scalability ceilings, cost model trade-offs, and operational complexity.
Describe the data pipeline architecture you've worked with.
Design a Delta table layout for mixed workload: point lookups by user_id, range scans by date, and full partition scans. Compare partitioning vs. Z-ordering—when to use each, and the rewrite cost trade-off.
Explain Common Table Expressions (CTEs) and their benefits.
Explain strategies for managing schema changes in PySpark over time.
Explain the concept of checkpointing in Spark and why it is important.
Explain the difference between Azure Data Factory (ADF) and Databricks.
Explain the Medallion Architecture (Bronze, Silver, Gold layers).
Explain wide vs. narrow transformations and how they drive shuffle cost, failure domains, and pipeline design. When would you intentionally add a wide transformation, and how do you minimize its impact?
Have you worked on Data Warehousing projects?
How do you handle conflicts within a team? Provide an example.
How do you handle memory management in Python?
How would you read data from a web API using PySpark?
How would you read data from a web API? What steps would you follow after reading the data?
Tell me about a time when you faced a challenging situation at work and how you handled it.
Triggers in ADF, especially tumbling window triggers.
What are the key components of AWS Glue, and how do they work together?
What are the key components of the Spark execution model (Job, Stage, Task)?
What challenges did you face, and how did you tackle them?
What is Azure Data Factory (ADF), and what are its main components?
What is normalization and denormalization? When would you use each?
What is the difference between OLTP and OLAP?
What is the purpose of the Bronze, Silver, and Gold layers in a data pipeline?
What is the role of AWS Lambda in a data engineering pipeline?
What would you do if a pipeline failed and you couldn't find the reason?
When would you architecturally choose Dataset[T] over DataFrame in a Scala Spark pipeline, and what are the scalability and portability trade-offs? Include type-safety benefits vs. operational constraints.
Why are you leaving your current company?
Why do you want to join this company?
A data pipeline processes files for different clients stored in separate directories. Explain how you would use dynamic DAG creation to handle client-specific workflows in Airflow.
A task intermittently fails due to external API limitations. How would you configure Airflow retries and alerts to manage this situation efficiently?
About Jira
ACID Properties
ADF Optimization Techniques?
After cleaning, how would you store the transformed data into Delta Lake?
Agile in project management?
Agile Methodologies - sprint planning, standups, retrospectives
Download the complete interview prep bundle with expert answers. Study offline, on your commute, anywhere.