Question 1

Convert complex SQL (CTEs, window functions, subqueries) to production-grade PySpark. Discuss when to use spark.sql() vs. DataFrame API, and the implications for testability, partitioning, and execution predictability.

Accepted Answer

Two approaches: spark.sql() for direct translation and DataFrame API for programmatic logic. SQL approach: createOrReplaceTempView, run ANSI-like SQL—fast parity, but string-based, harder to unit test, and execution plan less explicit. DataFrame API: composable, testable (pass mock DataFrames), explicit transformations....

Question 2

What is the size of the teams I've worked with and how we handled sprints during the project?

Accepted Answer

Situation: Team and process. Task: Describe scale and agility. Action: Teams: 4-15 engineers. Sprints: 2-week cycles, planning, standups, retros. Story points, velocity, backlog grooming. Cross-team sync for large projects. Balanced planned work with ops capacity. Tech debt time. Adapted ceremony to need....

Question 3

Why are you considering leaving your current company?

Accepted Answer

Situation: Leaving rationale. Task: Positive framing. Action: Seeking growth, new challenges, better fit. Grateful for experience. Specific reason (limited growth, scale, domain, culture). Leaving on good terms. Focus on what I'm moving toward....

Question 4

Given the input string "AAABBBCCCDDDAAA," compress it to output "A3B3C3D3A3."

Accepted Answer

**RLE:** i=0; while i<len(s): j=i; while j<len(s) and s[j]==s[i]: j+=1; res.append(f'{s[i]}{j-i}'); i=j. **Decode:** Expand char+count. **Why:** Compression. **Production:** Handle single chars (A1 vs A).

Question 5

Explain the differences between Redshift and Snowflake, and how I've used them in previous projects.

Accepted Answer

**Redshift**: AWS-native, tightly coupled storage/compute, cluster-based; Spectrum for S3 queries; Concurrency Scaling for read bursts. **Snowflake**: Cloud-agnostic, decoupled storage/compute, per-second billing; zero-copy clone, Time Travel, Data Sharing. **Why Redshift**: Lower TCO for 24/7 steady workloads; deep AWS integration (Glue, Lambda); reserved instances cut cost 40%. **Why Snowflake**: Variable/spiky workloads; multi-cloud; DevProd parity via clones; sharing for data marketplace....

Question 6

Explain the scalability, performance, and cost-efficiency of both Redshift and Snowflake in different use cases.

Accepted Answer

**Scalability**: Redshift—vertical (add nodes) + Concurrency Scaling for query bursts; Snowflake—auto-scale warehouses, storage scales independently. **Performance**: Both columnar; Snowflake caches query results (persistent); Redshift relies on local SSD. Snowflake multi-cluster warehouses handle concurrency; Redshift WLM queues....

Question 7

Write a query to find the 5th highest salary in an employee table and calculate the number of employees whose salary is greater than that of their manager.

Accepted Answer

(1) 5th highest: SELECT salary FROM (SELECT DISTINCT salary, DENSE_RANK() OVER (ORDER BY salary DESC) r FROM employees) WHERE r = 5. (2) Employees > manager: SELECT COUNT(*) FROM employees e JOIN employees m ON e.manager_id = m.employee_id WHERE e.salary > m.salary. **Why**: Combine in CTEs if single query needed....

Question 8

Explain how I handle performance optimizations, scheduling tasks, and monitoring DAGs in Airflow.

Accepted Answer

**Situation**: Faced competing demands—multiple pipelines, stakeholders, deadlines. **Task**: Deliver impact while maintaining quality and preventing burnout. **Action**: (1) Prioritized by business impact and SLA risk. (2) Used ROI (value/time); WIP limits; timeboxing. (3) Communicated trade-offs—'Adding X pushes Y by N days.' (4) Maintained backlog with tech-debt capacity. **Result**: Shipped on time; zero incidents; stakeholder alignment on deferrals....

Question 9

Provide specific examples of challenges faced with PySpark and SQL and solutions implemented.

Accepted Answer

**Situation**: Production join job on 10B-row fact table with 100M-row dimension timed out; one task ran 2h vs 5min for others. **Task**: Identify root cause and fix without schema changes. **Action**: (1) Profiled via Spark UI—identified skew on region key (80% data in 3 partitions); (2) Implemented salting: replicated dimension with salt range, joined on (key, salt), aggregated; (3) Enabled AQE skew join as fallback. **Result**: P99 dropped 70%; job met SLA; pattern documented for team....

Question 10

How does data flow through the system? From ingestion to processing and storage?

Accepted Answer

**Section 1 — The Context (The 'Why')**
The primary challenge in 'How does data flow through the system? From ingestion to processing and storage?' centers on designing for production scale, correctness guarantees, and operational resilience. A naive or underspecified design fails under load: single points of failure cascade, non-idempotent operations cause duplicates on retry, and lack of observability blocks root-cause analysis....

S&P Global Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 10 Questions

More Interview Prep Guides

Unlock All Expert Answers