The most frequently asked sql questions in data engineering interviews.
Master sql for your next data engineering interview. These questions cover core concepts, advanced patterns, and real-world scenarios that interviewers test.
What is the difference between SparkSession and SparkContext in Spark?
Write an SQL query to find the second-highest salary from an employee table.
What are Airflow Operators? Give examples.
Explain the differences between Data Warehouse, Data Lake, and Delta Lake
Explain the differences between a Data Lake and a Data Warehouse.
Can you explain the difference between OLTP and OLAP?
Describe a time when you had to optimize a slow SQL query. What steps did you take?
How do you handle NULL values in SQL? Mention functions like COALESCE and NULLIF.
What is a Common Table Expression (CTE), and when would you use it?
What is the difference between a primary key and a unique key?
What is the difference between WHERE and HAVING clauses in SQL?
How does Spark's Catalyst Optimizer work? Explain its stages.
Explain the concept of Broadcast Join in Spark. When should it be used?
How do you optimize Spark jobs for better performance? Mention at least 5 techniques.
What is the most difficult task you've ever worked on?
What is Snowflake's architecture, and why is it unique?
Retrieve the most recent sale_timestamp for each product (Latest Transaction).
What is the difference between OLTP and OLAP?
What is the difference between SQL and NoSQL databases?
Explain SQL Window Functions with examples.
Explain the use of the MERGE statement in SQL.
How do you handle NULL values in SQL? Mention functions like COALESCE and ISNULL.
How do you optimize a long-running SQL query?
How would you handle duplicate records in an SQL table?
SQL query to find the second highest salary from each department.
What is the difference between a clustered and non-clustered index?
Write an SQL query to find duplicate emails in a users table.
Write a SQL query to find top 3 earners in each department.
Write complex SQL queries involving multiple joins, subqueries, and data aggregation logic.
When would you architecturally choose Dataset[T] over DataFrame in a Scala Spark pipeline, and what are the scalability and portability trade-offs? Include type-safety benefits vs. operational constraints.
Convert complex SQL (CTEs, window functions, subqueries) to production-grade PySpark. Discuss when to use spark.sql() vs. DataFrame API, and the implications for testability, partitioning, and execution predictability.
Design an anti-skew strategy for a join on a high-cardinality key with a long-tail distribution (e.g., a few keys hold 80% of rows). Cover salting, split-skew, AQE, and cost/operational trade-offs.
Walk through the three AQE features in Spark 3.x (coalesce, join switch, skew join)—how they operate at shuffle boundaries, which configs enable them, and what happens when AQE cannot help.
How do you handle data skewness in Spark?
How do you optimize Spark jobs for performance?
Implement a Spark job to find the top 10 most frequent words in a large text file.
What is Adaptive Query Execution (AQE) in Spark 3.x, and how does it improve performance?
What is Spark's Catalyst Optimizer? Explain its stages.
When and how do you use Broadcast Join in Spark?
What is broadcasting in Spark, and why is it used? Can you give an example of its use?
When and how do you use Broadcast Join?
Why is SparkSession used in Spark 2.0 and later versions?
Write a Python script to find the count of each word in a text file using Spark.
Write the PySpark code to find the second highest salary in each department.
What are your strengths and weaknesses?
Introduce yourself, highlighting key projects and tech stacks
Tell me about yourself and your professional background.
What database would you choose for handling transactional and non-transactional data? Why?
Core services of AWS used in data engineering?
Describe the process and use cases of implementing Azure Data Factory pipelines.
Explain how AWS Glue interacts with on-premises SQL databases to extract data efficiently.
Explain the differences between Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse.
How Airflow stores logs and the role of its backend database
How does ADF help streamline data movement in your project?
Moving pipelines from development to production: ARM templates for deployment.
On-Premises to Cloud Integration Runtime
Provide Data Pipeline for GCP Data Engineering
Synapse Analytics Features and Use Cases?
Types of Integration Runtimes (IR) - self-hosted, Azure, SSIS
What are Managed Identities in Azure, and how are they used in securing resources?
Download the complete interview prep bundle with expert answers. Study offline, on your commute, anywhere.