Question 1

What is the difference between groupByKey and reduceByKey in Spark?

Accepted Answer

**groupByKey()**: Shuffles all (key, value) pairs to group values per key. Transfers O(total_values) over the network. No local aggregation—you combine values afterward. High memory and network cost.

**reduceByKey(func)**: Performs local reduce (e.g., sum) on each partition before shuffle. Shuffles only O(unique_keys) aggregated values. Combines locally first, then across partitions.

**Architectural Logic (Why reduceByKey Wins)**: Shuffle is the bottleneck....

Question 2

Demonstrate the difference between DENSE_RANK() and RANK()

Accepted Answer

**RANK()**: Same rank for ties; skips subsequent ranks (e.g., 1, 2, 2, 4, 5). **DENSE_RANK()**: Same rank for ties; no gaps (e.g., 1, 2, 2, 3, 4). **Why it matters**: RANK preserves "position" semantics (e.g., 4th place); DENSE_RANK gives consecutive integers useful for filtering (e.g., TOP 10). **Example**: `SELECT name, salary, RANK() OVER (ORDER BY salary DESC) AS rk, DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rk FROM employee`....

Question 3

Write a Python function to check if a string is a palindrome.

Accepted Answer

**Architectural logic**: A palindrome reads the same forwards and backwards. We need to normalize (case, non-alphanumeric) and compare. **Approach 1 (string ops)**: `cleaned = "".join(c.lower() for c in s if c.isalnum()); return cleaned == cleaned[::-1]`—O(n) time, O(n) space. **Approach 2 (two-pointer)**: Compare from both ends; O(n) time, O(1) space if not normalizing....

Question 4

Implement a Spark job to find the top 10 most frequent words in a large text file.

Accepted Answer

Core logic: read text → split → explode → filter empty → groupBy → count → orderBy desc → limit 10. Code: from pyspark.sql import functions as F; df = spark.read.text("path/to/file.txt"); words = df.select(F.explode(F.split(F.col("value"), "\s+")).alias("word")); top10 = words.filter(F.length(F.col("word")) > 0).groupBy("word").count().orderBy(F.desc("count")).limit(10). **Why \s+**: Handles multiple spaces/tabs; more robust than single space....

Question 5

Describe a real-world use case for using Step Functions with Lambda in a data workflow.

Accepted Answer

Use case: ML inference and reporting pipeline. Raw events land in S3; Lambda validates; Step Functions orchestrates: preprocessing Lambda → external ML API (wait) → result Lambda writes to DynamoDB/S3 → Slack summary. Why Step Functions + Lambda: Lambda = stateless, short compute; Step Functions = state, retries, branching, observability. Architectural trade-off: Express Workflows for high-volume, short runs (cheaper); Standard for long-running, complex branching....

Question 6

Describe using Step Functions to handle retries and error notifications.

Accepted Answer

Architectural logic: Retry handles transient failures; Catch routes terminal failures to notifications or DLQ. Config: Retry with ErrorEquals, IntervalSeconds, BackoffRate, MaxAttempts; Catch with Next (e.g., NotifyFailure). Example: Lambda.ServiceException retry 3x with exponential backoff; States.ALL → NotifyFailure (SNS). Why: Observability; no lost failures....

Question 7

Explain how Access Control Lists (ACLs) can affect IAM role permissions.

Accepted Answer

Architectural logic: S3 has IAM and ACLs (legacy). Both must allow—ACLs can deny even if IAM allows. ACLs = per-object/bucket grants. IAM = identity-based. Trade-off: ACLs add complexity; prefer bucket policies + IAM....

Question 8

Explain how Step Functions integrate with other AWS services.

Accepted Answer

Architectural logic: Step Functions has native integrations—Lambda, Glue, ECS, SNS, SQS, DynamoDB, SageMaker, EventBridge. Each = Task state with service ARN and payload. Flow: Lambda validate → Glue ETL → Lambda write DynamoDB. Why: No polling; optimized integrations....

Question 9

Explain how using a staging area in S3 can help.

Accepted Answer

Architectural logic: Staging = buffer between producers and consumers. Benefits: Decouple ingestion from processing; absorb bursts; validate before load; replay without re-fetch. Flow: API → s3://staging/raw/ → Glue → s3://curated/ → Athena/Redshift. Why: Producers write async; consumers process batch. Cost: Staging storage; lifecycle to archive/delete....

Question 10

Explain the role of Glue Catalog in Athena.

Accepted Answer

Architectural role: Glue Catalog = metadata (schema, location, partitions); Athena = query engine. Athena reads S3 using Catalog metadata. Why: Single catalog for Glue, EMR, Athena—consistency. Best practice: Crawlers or manual tables; partition; columnar; one catalog.

Capco Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 60 Questions

More Interview Prep Guides

Unlock All Expert Answers