Question 1

Describe a scenario where partitioning and bucketing would improve query performance.

Accepted Answer

Situation: An events table with billions of rows serving time-range and user-level analytics. Task: Achieve sub-second query latency while controlling storage and compute costs. Why Partitioning: Partition pruning at read time eliminates entire data scans—a query filtering by date_range only touches relevant partition dirs. This reduces I/O by orders of magnitude (e.g., 365 partitions → scan 1 vs all)....

Question 2

What is the small-file problem in Spark, and how do you solve it?

Accepted Answer

Problem: Thousands of tiny files (KB–MB) cause metadata overhead, slow S3/HDFS listing, many small tasks, I/O thrashing. Root causes: High parallelism, over-partitioning, streaming micro-batches. Why it hurts: Each file = task; 10K files = 10K tasks = scheduling overhead. S3 LIST is rate-limited; listing 100K files can take minutes. Solutions: 1) coalesce/repartition before write to target 128MB–1GB per file. 2) Delta Lake/Spark auto-compaction. 3) Larger partition sizes....

Question 3

Implement a query to find the top 5 customers by total sales amount.

Accepted Answer

**Architectural Logic**: Two approaches. 1. GROUP BY + ORDER BY + LIMIT: SELECT customer_id, SUM(sales_amount) total_sales FROM sales GROUP BY customer_id ORDER BY total_sales DESC LIMIT 5. 2. Window: SELECT * FROM (SELECT customer_id, SUM(sales_amount) total_sales, RANK() OVER (ORDER BY SUM(sales_amount) DESC) rk FROM sales GROUP BY customer_id) t WHERE rk <= 5. **Why**: LIMIT is simpler, stops early in some engines....

Question 4

Write an SQL query to find duplicate emails in a users table.

Accepted Answer

**Architectural Logic**: Simple: SELECT email FROM users GROUP BY email HAVING COUNT(*) > 1. With context: SELECT email, COUNT(*) cnt, ARRAY_AGG(user_id) ids FROM users GROUP BY email HAVING COUNT(*) > 1. Window: SELECT DISTINCT email FROM (SELECT email, COUNT(*) OVER (PARTITION BY email) cnt FROM users) t WHERE cnt > 1. **Why**: GROUP BY + HAVING is standard; window useful if you need other columns per duplicate....

Question 5

What is the small-file problem in Spark, and how do you solve it?

Accepted Answer

Small-file problem: Too many tiny files (KB–MB) cause metadata explosion (S3/HDFS list operations), slow scans, and many small tasks. **Root causes**: High parallelism (many partitions), over-partitioning by high-cardinality key, streaming append with small batches. **Why it hurts**: S3 list costs $0.005/1000 requests; 1M files = $5 just for listing. Query engines (Athena, Presto) open each file; latency grows with file count....

Question 6

Why a batch process over real-time?

Accepted Answer

Situation: Batch vs real-time. Task: Trade-off analysis. Action: Why batch: latency allows it; cheaper; simpler; some sources batch-only; exactly-once easier. Real-time when: sub-second needed (fraud); event-driven. Best: start batch; add real-time when needed. Many 'real-time' use cases served by frequent batch (e.g. 5 min)....

Question 7

Glue ETL optimization: Performance improvement strategies?

Accepted Answer

Strategies: (1) Right-size DPUs; auto-scaling. (2) Partition read/write; predicates. (3) Parallelism—parallelCopies, repartition. (4) Parquet/ORC; compact small files. (5) Pushdown at source. (6) Job bookmarks for incremental. (7) Salting for skew. Trade-off: Over-parallelism can throttle source....

Question 8

How to manage AWS IAM roles and policies for data security?

Accepted Answer

Principles: (1) Least privilege. (2) Roles over users for services. (3) SCPs for org; IAM for accounts. (4) Conditions (MFA, IP, tags). (5) Rotate keys; prefer roles. Best practice: Fine-grained policies; MFA for humans; permission boundaries; CloudTrail and Access Analyzer; periodic review.

Question 9

How would you implement a secure data lake on AWS?

Accepted Answer

Layers: (1) S3 encryption (SSE-KMS). (2) IAM + bucket policies; Lake Formation for fine-grained. (3) VPC endpoints for S3/Glue. (4) Glue Catalog permissions. (5) CloudTrail and S3 access logs. (6) Tagging, lifecycle....

Question 10

Securing AWS Lambda: IAM roles, VPC integration, and security measures?

Accepted Answer

**Why Lambda security**: Lambda executes your code; compromised function = access to everything the role allows. **IAM**: Assign minimal role—only permissions needed. Avoid wildcards; scope resources (e.g., specific bucket/prefix). Use resource-based policies for cross-account. **VPC**: Place in private subnets when accessing RDS, ElastiCache, or internal APIs. Use VPC endpoints for S3/DynamoDB to avoid NAT and public internet. VPC adds cold-start latency—only use when necessary....

Daniel Wellington Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 18 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading