Data engineering interview questions
Finding nth Highest Salary within each department
Given a CSV file with raw customer transactions, design an ETL pipeline that cleans data, aggregates total sales by region and product, and loads into target table
Given a complex nested query, how would you refactor it for better readability and efficiency?
Given a dataset, perform transformations: Filter rows where sales > 1000, Add a new column calculating a 10% discount on sales, Group data by region and calculate total revenue.
Given a table of sales data, use window functions to calculate a running total.
Given a table with 10 records and another with 4 records, how many records result from a cross join?
Given a table with sales data, write a query to find consecutive days with decreasing revenue.
Given an unoptimized query execution plan, how would you diagnose and improve performance?
Given the data below, explain the results of different types of joins: Inner Join, Left Join, Right Join. Will a schema be created?
Given two dataframes (df1: id, name and df2: id, country, address, city, count), join them, filter for rows where country = 'Singapore', and pivot the output. Sort cities in descending order of population count
Given two tables, calculate the row count for different types of joins (inner, left, right, and full outer)
HAVING vs WHERE - explain
Handle nulls, duplicates, and inconsistent timestamp formats in data.
Hierarchical Employee-Manager Query - navigate hierarchy with head of company as employee_id = 1
How can you automate data insertion into BigQuery using Python?
How can you delete partitions from a table in Hive using a command?
How did you manage a situation where you lacked knowledge for a task?
How do partitioning strategies differ between source and sink?
How do partitions improve query performance in fact tables?
How do quarantine tables ensure data quality in downstream pipelines?
Type or paste your answer to any of these questions and our AI Coach scores it, highlights gaps, and rewrites it at FAANG quality. Free to try.
SQL is the most tested topic in data engineering interviews. Most companies dedicate an entire round to SQL, typically asking 3-5 questions covering window functions, CTEs, joins, optimization, and platform-specific features.
Focus on: window functions (RANK, ROW_NUMBER, LAG/LEAD), CTEs and recursive queries, query optimization and execution plans, indexing strategies, and platform-specific features for BigQuery, Redshift, or Snowflake depending on the company.
Yes. Data engineering SQL rounds emphasize analytical queries (window functions, aggregations), large-scale optimization (partitioning, indexing), and data warehouse concepts (star schema, slowly changing dimensions). Software engineering SQL tends to focus on CRUD operations and basic joins.
For a mid-level data engineering role, plan 2-4 weeks of focused SQL practice. Cover window functions, CTEs, optimization, and practice writing queries under time pressure. Use real interview questions from companies you're targeting.