Data engineering interview questions
What integration challenges might you face with Glue Catalog in non-AWS environments?
What is Azure Data Lake Storage (ADLS) Gen2, and how does it differ from Blob Storage?
What is Integration Runtime?
What is Secret Scope, and how is it used in Databricks?
What is Unity Catalog, and how is it implemented in your project?
What is XCom in Airflow?
What is the difference between S3 and EFS? When would you use each?
What is the role of AWS KMS in securing sensitive data?
What is your experience with cloud technologies?
What metrics would you track in CloudWatch for a Kinesis-based pipeline?
What role does Amazon Macie play in securing sensitive data in S3?
What steps would you take to secure data stored in S3?
What techniques do you use to balance compute costs and performance in cloud-based data solutions?
What types of queries would not be efficient in Athena?
Which AWS services do you use for data ingestion and processing?
Which cloud services (AWS or others) did you leverage in your project? Why?
Why specific cloud services (AWS Glue, EMR) were chosen for scalability and cost-effectiveness
Write Terraform configurations for configuring an EC2 machine
Write code to upload Parquet files to an S3 bucket using boto3
Type or paste your answer to any of these questions and our AI Coach scores it, highlights gaps, and rewrites it at FAANG quality. Free to try.
Learn the platform used by your target companies. AWS is most common overall (Glue, Redshift, S3, Kinesis). GCP is preferred by Google and startups (BigQuery, Dataflow, Pub/Sub). Azure is dominant in enterprise (Synapse, Data Factory). Learn one deeply and understand the equivalents on others.
Core tools: SQL, Python, Spark, Airflow (or equivalent orchestrator), one cloud platform. Increasingly important: dbt, Kafka, Terraform, Docker/Kubernetes, Delta Lake or Apache Iceberg, a data observability tool. The specific stack varies by company.
Yes. Apache Airflow is the most widely used orchestration tool and questions about DAG design, task dependencies, XComs, operators, and failure handling are common. If the company uses a different orchestrator, expect similar questions adapted to their tool.