The most frequently asked sql questions in data engineering interviews.
Master sql for your next data engineering interview. These questions cover core concepts, advanced patterns, and real-world scenarios that interviewers test. This set leans toward the medium-difficulty band most real interviews actually live in (25 of 60). Recurring themes are sql, spark, and partition — these patterns appear most often in real interviews and reward the deepest preparation. These questions have been reported across 48 companies including Incedo and Accenture. Average answer is around 1 minute of reading — plan roughly 1 hour to work through the full set thoughtfully.
This collection contains 60 curated questions: 14 easy, 25 medium, and 21 hard. The distribution skews toward harder problems, reflecting the depth expected in senior-level interviews.
The most frequently tested areas in this set are sql (60), spark (30), partition (26), join (22), optimization (10), and window (10). Focusing on these topics will give you the highest return on your preparation time.
Start with the easy questions to warm up and solidify fundamentals. Medium-difficulty questions form the bulk of real interviews — spend the most time here and practice explaining your reasoning out loud. Hard questions often appear in senior and staff-level rounds; attempt them after you're comfortable with the basics. For each question, try answering before revealing the solution. Use our AI Mock Interview to simulate real interview conditions and get instant feedback on your responses.
What is the difference between SparkSession and SparkContext in Spark?
Write an SQL query to find the second-highest salary from an employee table.
What are Airflow Operators? Give examples.
Explain the differences between Data Warehouse, Data Lake, and Delta Lake
Explain the differences between a Data Lake and a Data Warehouse.
Can you explain the difference between OLTP and OLAP?
Describe a time when you had to optimize a slow SQL query. What steps did you take?
How do you handle NULL values in SQL? Mention functions like COALESCE and NULLIF.
What is a Common Table Expression (CTE), and when would you use it?
What is the difference between a primary key and a unique key?
What is the difference between WHERE and HAVING clauses in SQL?
How does Spark's Catalyst Optimizer work? Explain its stages.
Explain the concept of Broadcast Join in Spark. When should it be used?
How do you optimize Spark jobs for better performance? Mention at least 5 techniques.
What is the most difficult task you've ever worked on?
What is Snowflake's architecture, and why is it unique?
Retrieve the most recent sale_timestamp for each product (Latest Transaction).
What is the difference between OLTP and OLAP?
What is the difference between SQL and NoSQL databases?
Explain SQL Window Functions with examples.
Explain the use of the MERGE statement in SQL.
How do you handle NULL values in SQL? Mention functions like COALESCE and ISNULL.
How do you optimize a long-running SQL query?
How would you handle duplicate records in an SQL table?
SQL query to find the second highest salary from each department.
What is the difference between a clustered and non-clustered index?
Write an SQL query to find duplicate emails in a users table.
Write a SQL query to find top 3 earners in each department.
Write complex SQL queries involving multiple joins, subqueries, and data aggregation logic.
When would you architecturally choose Dataset[T] over DataFrame in a Scala Spark pipeline, and what are the scalability and portability trade-offs? Include type-safety benefits vs. operational constraints.
Convert complex SQL (CTEs, window functions, subqueries) to production-grade PySpark. Discuss when to use spark.sql() vs. DataFrame API, and the implications for testability, partitioning, and execution predictability.
Design an anti-skew strategy for a join on a high-cardinality key with a long-tail distribution (e.g., a few keys hold 80% of rows). Cover salting, split-skew, AQE, and cost/operational trade-offs.
Walk through the three AQE features in Spark 3.x (coalesce, join switch, skew join)—how they operate at shuffle boundaries, which configs enable them, and what happens when AQE cannot help.
How do you handle data skewness in Spark?
How do you optimize Spark jobs for performance?
Implement a Spark job to find the top 10 most frequent words in a large text file.
What is Adaptive Query Execution (AQE) in Spark 3.x, and how does it improve performance?
What is Spark's Catalyst Optimizer? Explain its stages.
When and how do you use Broadcast Join in Spark?
What is broadcasting in Spark, and why is it used? Can you give an example of its use?
When and how do you use Broadcast Join?
Why is SparkSession used in Spark 2.0 and later versions?
Write a Python script to find the count of each word in a text file using Spark.
Write the PySpark code to find the second highest salary in each department.
What are your strengths and weaknesses?
Introduce yourself, highlighting key projects and tech stacks
Tell me about yourself and your professional background.
What database would you choose for handling transactional and non-transactional data? Why?
Core services of AWS used in data engineering?
Describe the process and use cases of implementing Azure Data Factory pipelines.
Explain how AWS Glue interacts with on-premises SQL databases to extract data efficiently.
Explain the differences between Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse.
How Airflow stores logs and the role of its backend database
How does ADF help streamline data movement in your project?
Moving pipelines from development to production: ARM templates for deployment.
On-Premises to Cloud Integration Runtime
Provide Data Pipeline for GCP Data Engineering
Synapse Analytics Features and Use Cases?
Types of Integration Runtimes (IR) - self-hosted, Azure, SSIS
What are Managed Identities in Azure, and how are they used in securing resources?
Get full access to 1,800+ expert answers, AI mock interviews, and personalized progress tracking.