Data engineering interview questions
Explain using AWS Glue for ETL. What challenges might you face with large datasets?
Explain using IAM roles for secure cross-account access to an S3 bucket.
Explain when you would use Glue instead of Lambda for a data ingestion use case.
Explain your cloud-based data pipeline on AWS
Fabric dataflows vs. ADF dataflows
Fabric pipelines vs. ADF pipelines
GCP Authentication with Jenkins
Glue ETL optimization: Performance improvement strategies?
Handling Large-Scale Data Ingestion in AWS Pipelines
How Airflow operates in a Kubernetes environment
How Airflow stores logs and the role of its backend database
How are Logic Apps used in ADF projects?
How can you increase parallelism in ADF pipelines?
How did you contribute to cost optimization initiatives while working with cloud technologies?
How do Logic Apps enhance notification workflows for monitoring pipelines?
How do you copy all files from one source path to target in ADF?
How do you delete files older than 30 days using ADF?
How do you ensure message ordering in Kinesis Streams?
How do you handle API rate limits in ADF?
How do you handle cost optimization in AWS EMR clusters?
Type or paste your answer to any of these questions and our AI Coach scores it, highlights gaps, and rewrites it at FAANG quality. Free to try.
Learn the platform used by your target companies. AWS is most common overall (Glue, Redshift, S3, Kinesis). GCP is preferred by Google and startups (BigQuery, Dataflow, Pub/Sub). Azure is dominant in enterprise (Synapse, Data Factory). Learn one deeply and understand the equivalents on others.
Core tools: SQL, Python, Spark, Airflow (or equivalent orchestrator), one cloud platform. Increasingly important: dbt, Kafka, Terraform, Docker/Kubernetes, Delta Lake or Apache Iceberg, a data observability tool. The specific stack varies by company.
Yes. Apache Airflow is the most widely used orchestration tool and questions about DAG design, task dependencies, XComs, operators, and failure handling are common. If the company uses a different orchestrator, expect similar questions adapted to their tool.