Data engineering interview questions
Data Factory vs. Databricks: When to use which?
Data Lakehouse architecture in Azure?
Describe AWS Glue components and their functions.
Copy Large Files from On-Premises to Azure in ADF
Could you describe a specific cost optimization strategy you implemented in the cloud and its results?
Data Load in Synapse Table?
Describe Amazon Athena and how it interacts with S3.
Describe a real-world use case for using Step Functions with Lambda in a data workflow.
Describe a scenario where AWS Data Pipeline is preferred over Glue. Why?
Describe an AWS EC2 instance and how IAM roles/policies enhance security.
Describe how Adidas could use S3 and Athena to analyze clickstream data.
Describe how to secure sensitive data in cloud storage solutions.
Describe how to set up retries and timeout for tasks in Cloud Composer.
Describe how you deploy code to a production environment using Jenkins
Describe how you would use AWS Glue to schedule and manage Spark jobs.
Describe step scaling policies vs. target tracking policies in AWS Auto Scaling.
Describe the process and use cases of implementing Azure Data Factory pipelines.
Describe the use of side inputs in Dataflow.
Describe using Step Functions to handle retries and error notifications.
Describe your experience with cloud platforms like AWS, Azure, or GCP
Type or paste your answer to any of these questions and our AI Coach scores it, highlights gaps, and rewrites it at FAANG quality. Free to try.
Learn the platform used by your target companies. AWS is most common overall (Glue, Redshift, S3, Kinesis). GCP is preferred by Google and startups (BigQuery, Dataflow, Pub/Sub). Azure is dominant in enterprise (Synapse, Data Factory). Learn one deeply and understand the equivalents on others.
Core tools: SQL, Python, Spark, Airflow (or equivalent orchestrator), one cloud platform. Increasingly important: dbt, Kafka, Terraform, Docker/Kubernetes, Delta Lake or Apache Iceberg, a data observability tool. The specific stack varies by company.
Yes. Apache Airflow is the most widely used orchestration tool and questions about DAG design, task dependencies, XComs, operators, and failure handling are common. If the company uses a different orchestrator, expect similar questions adapted to their tool.