SQL·13 min read·Published May 1, 2026

SQL Joins for Data Engineer Interviews: Every Type Explained with Real Examples (2026)

SQL joins are the most tested topic in data engineering interviews. Most candidates know INNER and LEFT JOIN but stumble on CROSS APPLY, anti-joins, and self-joins. This guide covers them all with real interview patterns.

Key Takeaways

✓SQL Joins: The #1 Topic That Decides Data Engineering Interviews
✓The 4 Standard Joins (And What Interviewers Actually Test)
✓Anti-Join: Find Rows That DON'T Match
✓Self-Join: Comparing Rows Within the Same Table

SQL Joins: The #1 Topic That Decides Data Engineering Interviews

Every data engineering interview includes at least 2-3 SQL join questions. The topic seems basic, but interviews test much deeper than INNER vs LEFT JOIN:

Can you write a self-join to compare rows within the same table?
Do you know the performance difference between EXISTS and LEFT JOIN ... IS NULL?
Can you handle many-to-many joins without creating duplicates?
Do you understand when a CROSS JOIN is actually useful?

This guide covers every join type with the interview pattern where it appears, including the anti-join and semi-join patterns that most candidates don't know by name.

The 4 Standard Joins (And What Interviewers Actually Test)

INNER JOIN: Returns only matching rows from both tables.

sql

SELECT o.order_id, c.name
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id

Interview twist: "What happens if customer_id has NULLs?" Answer: NULL never equals NULL, so rows with NULL customer_id are silently dropped. Use COALESCE or IS NOT DISTINCT FROM if you need NULL-safe joins.

LEFT JOIN: Returns all rows from the left table, NULLs from right where no match.

sql

SELECT c.name, COUNT(o.order_id) as order_count
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
GROUP BY c.name

Interview pattern: "Find customers who have never placed an order." This is the anti-join pattern — see below.

RIGHT JOIN: Same as LEFT JOIN with tables reversed. Rarely used in practice — prefer LEFT JOIN for readability.

FULL OUTER JOIN: Returns all rows from both tables, NULLs where no match.

Interview pattern: Data reconciliation — compare two data sources and find records that exist in one but not the other.

Anti-Join: Find Rows That DON'T Match

The anti-join is one of the most commonly tested patterns. Three ways to write it:

Method 1 — LEFT JOIN + IS NULL (most common):

sql

SELECT c.name
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
WHERE o.customer_id IS NULL

Method 2 — NOT EXISTS (often fastest):

sql

SELECT c.name
FROM customers c
WHERE NOT EXISTS (
  SELECT 1 FROM orders o WHERE o.customer_id = c.id
)

Method 3 — NOT IN (dangerous!):

sql

SELECT c.name
FROM customers c
WHERE c.id NOT IN (SELECT customer_id FROM orders)

Critical interview knowledge: NOT IN returns no rows if the subquery contains ANY NULL value. This is a trap that catches most candidates. Always prefer NOT EXISTS or LEFT JOIN + IS NULL.

Performance comparison:

NOT EXISTS: Usually fastest (short-circuits on first match)
LEFT JOIN + IS NULL: Close second, sometimes faster on large datasets
NOT IN: Slowest (materializes full subquery result), plus the NULL trap

Self-Join: Comparing Rows Within the Same Table

Self-joins appear in nearly every SQL interview round. Common patterns:

Find employees who earn more than their manager:

sql

SELECT e.name as employee, e.salary, m.name as manager, m.salary as mgr_salary
FROM employees e
INNER JOIN employees m ON e.manager_id = m.id
WHERE e.salary > m.salary

Find consecutive day logins:

sql

SELECT DISTINCT a.user_id
FROM logins a
INNER JOIN logins b ON a.user_id = b.user_id
  AND a.login_date = b.login_date + INTERVAL '1 day'

Find duplicate records:

sql

SELECT a.*
FROM transactions a
INNER JOIN transactions b
  ON a.amount = b.amount
  AND a.user_id = b.user_id
  AND a.timestamp = b.timestamp
  AND a.id < b.id  -- avoid matching with itself

Pro tip: When writing self-joins, always use meaningful aliases (e for employee, m for manager) and add a comment explaining the relationship. Interviewers evaluate code readability.

CROSS JOIN: The Underestimated Workhorse

Most candidates avoid CROSS JOIN because "it creates a Cartesian product." But there are real use cases:

Generate a calendar/date spine:

sql

WITH dates AS (
  SELECT generate_series('2026-01-01'::date, '2026-12-31'::date, '1 day') AS dt
),
users AS (
  SELECT DISTINCT user_id FROM events
)
SELECT u.user_id, d.dt, COALESCE(e.event_count, 0) as events
FROM users u
CROSS JOIN dates d
LEFT JOIN (
  SELECT user_id, event_date, COUNT(*) as event_count
  FROM events GROUP BY 1, 2
) e ON u.user_id = e.user_id AND d.dt = e.event_date

This pattern (CROSS JOIN to generate all combinations, then LEFT JOIN to fill in actual data) is essential for:

Filling gaps in time-series data
Calculating metrics for ALL user-date combinations (including days with zero activity)
Building feature matrices for ML models

Interview signal: Mentioning the date spine + CROSS JOIN pattern shows you've dealt with real analytics problems.

Join Performance: What Senior Engineers Know

Hash Join vs Sort-Merge Join vs Nested Loop:

Hash Join: Builds a hash table from the smaller table, probes with the larger. Best for equi-joins when one table fits in memory.
Sort-Merge Join: Sorts both tables on the join key, then merges. Best when both tables are already sorted or have indexes.
Nested Loop: For each row in table A, scans table B. Terrible for large tables, but optimal for very small lookups.

How to control which join is used:

In Spark: broadcast(small_df) forces a broadcast hash join
In PostgreSQL: SET enable_hashjoin = off to force sort-merge (for debugging)
In BigQuery: Automatic, but broadcast joins happen when one side <10MB

Many-to-many join explosion:

sql

-- If orders has 1000 rows and payments has 3 rows per order
-- This returns 3000 rows, not 1000
SELECT * FROM orders o JOIN payments p ON o.id = p.order_id

Fix: Aggregate one side first, then join:

sql

SELECT o.*, p.total_paid
FROM orders o
JOIN (
  SELECT order_id, SUM(amount) as total_paid
  FROM payments GROUP BY order_id
) p ON o.id = p.order_id

Practice your SQL join answers with real-time AI feedback at DataEngPrep's Answer Analyzer.

Reviewed by Aditya Kumar · Founder and editor

Drafted with research assistance and reviewed by Aditya Kumar, founder and lead editor at DataEngPrep. Questions are sourced from real interviews, initial answers are drafted with AI assistance, and every section is human-edited for technical accuracy, relevance to current FAANG hiring rubrics, and clarity. Articles are reviewed periodically as interview patterns evolve.

Editorial standards →How we use AI →Suggest a correction →

Level up your prep

Recommended

Educative

Educative Unlimited

800+ hands-on courses — Grokking System Design, Coding Patterns, and AI mock interviews for your DE loop.

Start learning →

Fenzo

Fenzo AI

Turn any topic or your own notes into an interactive, personalized course in 60 seconds.

Try it free →

Book · Martin Kleppmann

Designing Data-Intensive Applications

The book that gets data engineers through system-design rounds. Essential reading.

Get the book →

Some links below are affiliate links. If you buy through them we may earn a small commission at no extra cost to you — it helps keep DataEngPrep free.

🪟

SQL Window Functions & CTEs: The Complete Interview Guide for Data Engineers (2026)

SQL · 18 min read

→⚡

Spark Performance Tuning: 15 Interview Questions That Separate Senior Engineers from Juniors (2026)

Spark/Big Data · 20 min read

→🏠

Delta Lake Interview Questions: The Complete Lakehouse Guide for Data Engineers (2026)

Spark/Big Data · 14 min read

→

Practice These Questions

mediumWrite an SQL query to find the second-highest salary from an employee table.→mediumDemonstrate the difference between DENSE_RANK() and RANK()→

Think you can answer these questions? Find out in 30 seconds

Paste your answer and get instant AI feedback — see exactly where your answer is weak and how a interview-ready candidate would respond.

Analyze My Answer — Free See Plans

← Back to all articles

SQL Joins: The #1 Topic That Decides Data Engineering Interviews

Every data engineering interview includes at least 2-3 SQL join questions. The topic seems basic, but interviews test much deeper than INNER vs LEFT JOIN:

Can you write a self-join to compare rows within the same table?
Do you know the performance difference between EXISTS and LEFT JOIN ... IS NULL?
Can you handle many-to-many joins without creating duplicates?
Do you understand when a CROSS JOIN is actually useful?

This guide covers every join type with the interview pattern where it appears, including the anti-join and semi-join patterns that most candidates don't know by name.

The 4 Standard Joins (And What Interviewers Actually Test)

INNER JOIN: Returns only matching rows from both tables.

sql

SELECT o.order_id, c.name
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id

LEFT JOIN: Returns all rows from the left table, NULLs from right where no match.

sql

SELECT c.name, COUNT(o.order_id) as order_count
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
GROUP BY c.name

Interview pattern: "Find customers who have never placed an order." This is the anti-join pattern — see below.

RIGHT JOIN: Same as LEFT JOIN with tables reversed. Rarely used in practice — prefer LEFT JOIN for readability.

FULL OUTER JOIN: Returns all rows from both tables, NULLs where no match.

Interview pattern: Data reconciliation — compare two data sources and find records that exist in one but not the other.

Anti-Join: Find Rows That DON'T Match

The anti-join is one of the most commonly tested patterns. Three ways to write it:

Method 1 — LEFT JOIN + IS NULL (most common):

sql

SELECT c.name
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
WHERE o.customer_id IS NULL

Method 2 — NOT EXISTS (often fastest):

sql

SELECT c.name
FROM customers c
WHERE NOT EXISTS (
  SELECT 1 FROM orders o WHERE o.customer_id = c.id
)

Method 3 — NOT IN (dangerous!):

sql

SELECT c.name
FROM customers c
WHERE c.id NOT IN (SELECT customer_id FROM orders)

Critical interview knowledge: NOT IN returns no rows if the subquery contains ANY NULL value. This is a trap that catches most candidates. Always prefer NOT EXISTS or LEFT JOIN + IS NULL.

Performance comparison:

NOT EXISTS: Usually fastest (short-circuits on first match)
LEFT JOIN + IS NULL: Close second, sometimes faster on large datasets
NOT IN: Slowest (materializes full subquery result), plus the NULL trap

Self-Join: Comparing Rows Within the Same Table

Self-joins appear in nearly every SQL interview round. Common patterns:

Find employees who earn more than their manager:

sql

SELECT e.name as employee, e.salary, m.name as manager, m.salary as mgr_salary
FROM employees e
INNER JOIN employees m ON e.manager_id = m.id
WHERE e.salary > m.salary

Find consecutive day logins:

sql

SELECT DISTINCT a.user_id
FROM logins a
INNER JOIN logins b ON a.user_id = b.user_id
  AND a.login_date = b.login_date + INTERVAL '1 day'

Find duplicate records:

sql

SELECT a.*
FROM transactions a
INNER JOIN transactions b
  ON a.amount = b.amount
  AND a.user_id = b.user_id
  AND a.timestamp = b.timestamp
  AND a.id < b.id  -- avoid matching with itself

Pro tip: When writing self-joins, always use meaningful aliases (e for employee, m for manager) and add a comment explaining the relationship. Interviewers evaluate code readability.

CROSS JOIN: The Underestimated Workhorse

Most candidates avoid CROSS JOIN because "it creates a Cartesian product." But there are real use cases:

Generate a calendar/date spine:

sql

WITH dates AS (
  SELECT generate_series('2026-01-01'::date, '2026-12-31'::date, '1 day') AS dt
),
users AS (
  SELECT DISTINCT user_id FROM events
)
SELECT u.user_id, d.dt, COALESCE(e.event_count, 0) as events
FROM users u
CROSS JOIN dates d
LEFT JOIN (
  SELECT user_id, event_date, COUNT(*) as event_count
  FROM events GROUP BY 1, 2
) e ON u.user_id = e.user_id AND d.dt = e.event_date

This pattern (CROSS JOIN to generate all combinations, then LEFT JOIN to fill in actual data) is essential for:

Filling gaps in time-series data
Calculating metrics for ALL user-date combinations (including days with zero activity)
Building feature matrices for ML models

Interview signal: Mentioning the date spine + CROSS JOIN pattern shows you've dealt with real analytics problems.

Join Performance: What Senior Engineers Know

Hash Join vs Sort-Merge Join vs Nested Loop:

Hash Join: Builds a hash table from the smaller table, probes with the larger. Best for equi-joins when one table fits in memory.
Sort-Merge Join: Sorts both tables on the join key, then merges. Best when both tables are already sorted or have indexes.
Nested Loop: For each row in table A, scans table B. Terrible for large tables, but optimal for very small lookups.

How to control which join is used:

In Spark: broadcast(small_df) forces a broadcast hash join
In PostgreSQL: SET enable_hashjoin = off to force sort-merge (for debugging)
In BigQuery: Automatic, but broadcast joins happen when one side <10MB

Many-to-many join explosion:

sql

-- If orders has 1000 rows and payments has 3 rows per order
-- This returns 3000 rows, not 1000
SELECT * FROM orders o JOIN payments p ON o.id = p.order_id

Fix: Aggregate one side first, then join:

sql

SELECT o.*, p.total_paid
FROM orders o
JOIN (
  SELECT order_id, SUM(amount) as total_paid
  FROM payments GROUP BY order_id
) p ON o.id = p.order_id

Practice your SQL join answers with real-time AI feedback at DataEngPrep's Answer Analyzer.

SQL Joins for Data Engineer Interviews: Every Type Explained with Real Examples (2026)

Key Takeaways

SQL Joins: The #1 Topic That Decides Data Engineering Interviews

The 4 Standard Joins (And What Interviewers Actually Test)

Anti-Join: Find Rows That DON'T Match

Self-Join: Comparing Rows Within the Same Table

CROSS JOIN: The Underestimated Workhorse

Join Performance: What Senior Engineers Know

Reviewed by Aditya Kumar · Founder and editor

Level up your prep

Related Articles

Practice These Questions

Think you can answer these questions? Find out in 30 seconds

SQL Joins for Data Engineer Interviews: Every Type Explained with Real Examples (2026)

Key Takeaways

SQL Joins: The #1 Topic That Decides Data Engineering Interviews

The 4 Standard Joins (And What Interviewers Actually Test)

Anti-Join: Find Rows That DON'T Match

Self-Join: Comparing Rows Within the Same Table

CROSS JOIN: The Underestimated Workhorse

Join Performance: What Senior Engineers Know

Reviewed by Aditya Kumar · Founder and editor

Level up your prep

Related Articles

Practice These Questions

Think you can answer these questions? Find out in 30 seconds