Question 1

Convert complex SQL (CTEs, window functions, subqueries) to production-grade PySpark. Discuss when to use spark.sql() vs. DataFrame API, and the implications for testability, partitioning, and execution predictability.

Accepted Answer

Two approaches: spark.sql() for direct translation and DataFrame API for programmatic logic. SQL approach: createOrReplaceTempView, run ANSI-like SQL—fast parity, but string-based, harder to unit test, and execution plan less explicit. DataFrame API: composable, testable (pass mock DataFrames), explicit transformations....

Question 2

What is the size of the teams I've worked with and how we handled sprints during the project?

Accepted Answer

Situation: Team and process. Task: Describe scale and agility. Action: Teams: 4-15 engineers. Sprints: 2-week cycles, planning, standups, retros. Story points, velocity, backlog grooming. Cross-team sync for large projects. Balanced planned work with ops capacity. Tech debt time. Adapted ceremony to need....

Question 3

Why are you considering leaving your current company?

Accepted Answer

Situation: Leaving rationale. Task: Positive framing. Action: Seeking growth, new challenges, better fit. Grateful for experience. Specific reason (limited growth, scale, domain, culture). Leaving on good terms. Focus on what I'm moving toward....

Question 4

Given the input string "AAABBBCCCDDDAAA," compress it to output "A3B3C3D3A3."

Accepted Answer

**RLE:** i=0; while i<len(s): j=i; while j<len(s) and s[j]==s[i]: j+=1; res.append(f'{s[i]}{j-i}'); i=j. **Decode:** Expand char+count. **Why:** Compression. **Production:** Handle single chars (A1 vs A).

Question 5

Explain the differences between Redshift and Snowflake, and how I've used them in previous projects.

Accepted Answer

**Redshift**: AWS-native, tightly coupled storage/compute, cluster-based; Spectrum for S3 queries; Concurrency Scaling for read bursts. **Snowflake**: Cloud-agnostic, decoupled storage/compute, per-second billing; zero-copy clone, Time Travel, Data Sharing. **Why Redshift**: Lower TCO for 24/7 steady workloads; deep AWS integration (Glue, Lambda); reserved instances cut cost 40%. **Why Snowflake**: Variable/spiky workloads; multi-cloud; DevProd parity via clones; sharing for data marketplace....

Question 6

Explain the scalability, performance, and cost-efficiency of both Redshift and Snowflake in different use cases.

Accepted Answer

**Scalability**: Redshift—vertical (add nodes) + Concurrency Scaling for query bursts; Snowflake—auto-scale warehouses, storage scales independently. **Performance**: Both columnar; Snowflake caches query results (persistent); Redshift relies on local SSD. Snowflake multi-cluster warehouses handle concurrency; Redshift WLM queues....

Question 7

Write a query to find the 5th highest salary in an employee table and calculate the number of employees whose salary is greater than that of their manager.

Accepted Answer

(1) 5th highest: SELECT salary FROM (SELECT DISTINCT salary, DENSE_RANK() OVER (ORDER BY salary DESC) r FROM employees) WHERE r = 5. (2) Employees > manager: SELECT COUNT(*) FROM employees e JOIN employees m ON e.manager_id = m.employee_id WHERE e.salary > m.salary. **Why**: Combine in CTEs if single query needed....

Question 8

Explain how I handle performance optimizations, scheduling tasks, and monitoring DAGs in Airflow.

Accepted Answer

**Situation**: Faced competing demands—multiple pipelines, stakeholders, deadlines. **Task**: Deliver impact while maintaining quality and preventing burnout. **Action**: (1) Prioritized by business impact and SLA risk. (2) Used ROI (value/time); WIP limits; timeboxing. (3) Communicated trade-offs—'Adding X pushes Y by N days.' (4) Maintained backlog with tech-debt capacity. **Result**: Shipped on time; zero incidents; stakeholder alignment on deferrals....

S&P Global Data Engineer Interview Questions

Reading isn't practice. Get AI feedback on your answers.

Other Companies

S&P Global Data Engineer Interview Questions

Reading isn't practice. Get AI feedback on your answers.

Other Companies