The #1 SQL topic in data engineering interviews. ROW_NUMBER, RANK, DENSE_RANK, LEAD, LAG, PARTITION BY, and OVER()-with real questions and answers.
Window functions are the most frequently tested SQL topic in data engineering interviews. These questions cover ROW_NUMBER, RANK, DENSE_RANK, LEAD, LAG, running totals, moving averages, and complex PARTITION BY / ORDER BY patterns. Master this topic to ace the SQL round at any company.
Write an SQL query to find the second-highest salary from an employee table.
Demonstrate the difference between DENSE_RANK() and RANK()
Discuss differences between ROW_NUMBER(), RANK(), and DENSE_RANK(), and provide examples from your projects.
Explain the differences between Repartition and Coalesce. When would you use each?
What strategies can you use to handle skewed data in Spark?
How do you remove duplicate rows in BigQuery?
Joins and window functions - INNER, LEFT, RIGHT, FULL OUTER, ROW_NUMBER(), RANK(), DENSE_RANK()
Convert complex SQL (CTEs, window functions, subqueries) to production-grade PySpark. Discuss when to use spark.sql() vs. DataFrame API, and the implications for testability, partitioning, and execution predictability.
Difference between ROW_NUMBER(), RANK(), and DENSE_RANK() with examples.
Explain SQL Window Functions with examples.
Explain the concept of checkpointing in Spark and why it is important.
How would you handle duplicate records in an SQL table?
Implement a query to find the top 5 customers by total sales amount.
Prioritize Spark optimizations by impact and effort. Discuss partitioning strategy, caching policy, join selection, shuffle reduction, and when each becomes a scalability or cost bottleneck.
Retrieve the most recent sale_timestamp for each product (Latest Transaction).
SQL query to find the second highest salary from each department.
What is a self-join, and when would you use it?
What is a window function? Explain with an example.
Write a query to find the top three highest-paid employees in each department using window functions.
Write a SQL query to find top 3 earners in each department.
Write an SQL query to find duplicate emails in a users table.
Write complex SQL queries involving multiple joins, subqueries, and data aggregation logic.
Write the PySpark code to find the second highest salary in each department.
Add a column to the Employees table that shows the name of the employee with the next higher employee_id.
Add a new column with the average salary by department.
Add Row Numbers using window function in PySpark
Advanced SQL with CTEs and Conditional Joins
Aggregate surface areas and calculate cumulative surface area using the LAG function.
Calculate a 7-day moving average of clicks for each user_id
Calculate a 7-day moving average of orders for each city in the Swiggy database.
Calculate cumulative sales for each product in each store, ordered by sale_date
Calculate the cumulative transaction amount for each month using a transaction table.
Can Schema Evolution lead to data inconsistencies? If so, how do you manage them?
Can you provide an example of a time when you went above and beyond for a project?
Can you share a time when you had to shift focus due to urgent tasks?
Can you share a time you faced a significant challenge and how you overcame it?
Can you share an experience where you resolved a conflict within your team?
Check for duplicates in a table.
Create a function to detect anomalies in sales trends using Pandas and NumPy.
Create data models for storing users, artists, and related data for music platform
Create partitioned table
Describe a recent project where you used AWS services extensively. What was your role, and what challenges did you face?
Describe a time when you had to work with a team to solve a complex problem.
Describe a time when you went above and beyond for a project or a customer.
Describe how Adidas could use S3 and Athena to analyze clickstream data.
Describe how you would implement Slowly Changing Dimensions (SCD) in an ETL workflow.
Describe how you would use PySpark to aggregate and summarize large transaction datasets.
Describe your approach to managing data deduplication.
Describe your role in a team project.
Difference between Presto vs. Spark underlying architecture
Download the complete interview prep bundle with expert answers. Study offline, on your commute, anywhere.