Question 1

Describe a scenario where partitioning and bucketing would improve query performance.

Accepted Answer

Situation: An events table with billions of rows serving time-range and user-level analytics. Task: Achieve sub-second query latency while controlling storage and compute costs. Why Partitioning: Partition pruning at read time eliminates entire data scans—a query filtering by date_range only touches relevant partition dirs. This reduces I/O by orders of magnitude (e.g., 365 partitions → scan 1 vs all)....

Question 2

When would you choose a Snowflake schema over a Star schema?

Accepted Answer

Star: One fact, denormalized dimensions—simple, fewer joins, fast. Snowflake: Normalized dimensions (e.g., dim_product → dim_category → dim_category_group)—more joins, less redundancy. Why Snowflake: When dimension tables are large and shared—a single dim_category serves many products; denormalizing would duplicate millions of rows. Avoids inconsistent attributes across copies (e.g., category name updated in one place). Storage: Snowflake saves space when hierarchy is deep and wide....

Question 3

Implement a query to find the top 5 customers by total sales amount.

Accepted Answer

**Architectural Logic**: Two approaches. 1. GROUP BY + ORDER BY + LIMIT: SELECT customer_id, SUM(sales_amount) total_sales FROM sales GROUP BY customer_id ORDER BY total_sales DESC LIMIT 5. 2. Window: SELECT * FROM (SELECT customer_id, SUM(sales_amount) total_sales, RANK() OVER (ORDER BY SUM(sales_amount) DESC) rk FROM sales GROUP BY customer_id) t WHERE rk <= 5. **Why**: LIMIT is simpler, stops early in some engines....

Question 4

Write an SQL query to find duplicate emails in a users table.

Accepted Answer

**Architectural Logic**: Simple: SELECT email FROM users GROUP BY email HAVING COUNT(*) > 1. With context: SELECT email, COUNT(*) cnt, ARRAY_AGG(user_id) ids FROM users GROUP BY email HAVING COUNT(*) > 1. Window: SELECT DISTINCT email FROM (SELECT email, COUNT(*) OVER (PARTITION BY email) cnt FROM users) t WHERE cnt > 1. **Why**: GROUP BY + HAVING is standard; window useful if you need other columns per duplicate....

Question 5

Given a streaming dataset from Kafka, how would you ingest the data in real-time using Spark?

Accepted Answer

Kafka ingestion with Spark Structured Streaming follows a standard pattern: readStream → parse → writeStream with checkpoint. **Architectural decisions**: (1) **startingOffsets**: 'earliest' for backfill, 'latest' for tail-only; use JSON per-partition offsets for exactly-once replay. (2) **Checkpoint**: Mandatory for exactly-once; stores offsets + write metadata; without it, duplicates or data loss on restart....

Question 6

Tell me about a time you handled a data pipeline failure during a critical operation.

Accepted Answer

Situation: Partition overwrite corrupted Delta table during critical op. Task: Restore and prevent. Action: Paused downstream; diagnosed via Delta log. Used time travel: RESTORE TABLE ... TO TIMESTAMP. Validated; re-ran from checkpoint. Implemented write-audit-commit, automated backups, staging validation....

Question 7

Compare batch processing and stream processing for financial data.

Accepted Answer

**Batch**: Reporting, reconciliation, EOD. Higher latency; simpler; better for large scans. **Stream**: Fraud, trading, alerts. Low latency; complex; exactly-once critical. **Trade-off**: Batch cheaper; stream real-time. **Best practice**: Use both—batch for EOD; stream for time-sensitive....

Question 8

Compute the moving average of daily transactions over a 7-day window.

Accepted Answer

**SQL**: SELECT trans_date, SUM(amount) AS daily, AVG(SUM(amount)) OVER (ORDER BY trans_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS ma_7d FROM transactions GROUP BY trans_date. **Logic**: Inner SUM for daily total; outer AVG over 7 rows. **Spark**: Group by date, then window....

Question 9

Describe a time when you had to deal with a major data quality issue. How did you handle it?

Accepted Answer

**STAR**: **Situation**: Revenue metrics doubled overnight. **Task**: Fix and restore trust. **Action**: Traced to duplicate join in pipeline. Deployed fix; ran backfill. Added unique constraint tests. Communicated to stakeholders with timeline. **Result**: Metrics restored; no recurrence....

Question 10

Describe the concept of data sharding and when to use it.

Accepted Answer

**Why Sharding Exists**: Single-node storage and throughput limits cap scalability. Sharding horizontally partitions data by a shard key (e.g., user_id, region) across N nodes, enabling linear scale-out for reads/writes.

**Architectural Logic**: Each shard holds a subset; routing uses `hash(key) mod N` or consistent hashing....

Goldman Sachs Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 41 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading