Data engineering interview questions
How would you implement a secure data lake on AWS?
How would you optimize an ADF pipeline for high performance?
In ADF, how do you handle a scenario where you need to process only new or changed files from a source?
How would you handle security and privacy concerns when working with sensitive data in a cloud environment?
How would you implement VPC peering between two AWS accounts?
How would you migrate 1TB of data using ADF?
How would you monitor a data pipeline in AWS to ensure SLA compliance?
How would you optimize cost when using AWS for large-scale data processing?
How would you pass data between Lambda functions in Step Functions?
How would you secure sensitive credentials in Cloud Composer workflows?
How would you use Amazon Glue to merge small files?
In AWS Data Pipeline, how would you design a process to copy only recently modified files from one S3 bucket to another?
Lambda vs. Glue: Discuss use cases for both services.
Moving pipelines from development to production: ARM templates for deployment.
On-Premises to Cloud Integration Runtime
Parallel Copies in ADF?
Provide Data Pipeline for GCP Data Engineering
Running multiple notebooks - dbutils.notebook.run()
S3 Storage Options: Describe Standard, Intelligent-Tiering, and Glacier.
Secret Scope usage for managing credentials securely.
Type or paste your answer to any of these questions and our AI Coach scores it, highlights gaps, and rewrites it at FAANG quality. Free to try.
Learn the platform used by your target companies. AWS is most common overall (Glue, Redshift, S3, Kinesis). GCP is preferred by Google and startups (BigQuery, Dataflow, Pub/Sub). Azure is dominant in enterprise (Synapse, Data Factory). Learn one deeply and understand the equivalents on others.
Core tools: SQL, Python, Spark, Airflow (or equivalent orchestrator), one cloud platform. Increasingly important: dbt, Kafka, Terraform, Docker/Kubernetes, Delta Lake or Apache Iceberg, a data observability tool. The specific stack varies by company.
Yes. Apache Airflow is the most widely used orchestration tool and questions about DAG design, task dependencies, XComs, operators, and failure handling are common. If the company uses a different orchestrator, expect similar questions adapted to their tool.