Data engineering interview questions
Design an end-to-end data pipeline using Glue, Lambda, EC2, S3, Redshift, and Athena.
Design: Migrate data from multiple sources (Hadoop, S3, Oracle DB) into a final S3 bucket
Difference between linked services and datasets in ADF.
Difference between pipelines and data flows in ADF
Differentiate between global and local variables in ADF.
Discuss S3's advantages, including scalability and durability.
Discuss how versioning works in S3 and its use cases, such as data recovery and auditing.
Discuss the key differences between AWS Glue, Lambda, and Data Pipeline for orchestrating data workflows.
Discuss versioning in S3.
Docker - purpose and handling dependencies
Error Handling in ADF?
Explain AWS Lake Formation and its benefits.
Explain GetMetadata, ForEach, and Copy Data in Azure Data Factory.
Explain Microsoft Fabric and its use in data integration.
Explain Snowpipe as a continuous data ingestion service.
Explain Step Functions for orchestration of workflows.
Explain a linked service and how to create one.
Explain how AWS Glue interacts with on-premises SQL databases to extract data efficiently.
Explain how Access Control Lists (ACLs) can affect IAM role permissions.
Explain how Infrastructure as Code (IaC) works in AWS and its advantages
Type or paste your answer to any of these questions and our AI Coach scores it, highlights gaps, and rewrites it at FAANG quality. Free to try.
Learn the platform used by your target companies. AWS is most common overall (Glue, Redshift, S3, Kinesis). GCP is preferred by Google and startups (BigQuery, Dataflow, Pub/Sub). Azure is dominant in enterprise (Synapse, Data Factory). Learn one deeply and understand the equivalents on others.
Core tools: SQL, Python, Spark, Airflow (or equivalent orchestrator), one cloud platform. Increasingly important: dbt, Kafka, Terraform, Docker/Kubernetes, Delta Lake or Apache Iceberg, a data observability tool. The specific stack varies by company.
Yes. Apache Airflow is the most widely used orchestration tool and questions about DAG design, task dependencies, XComs, operators, and failure handling are common. If the company uses a different orchestrator, expect similar questions adapted to their tool.