SQL Joins for Data Engineer Interviews: Every Type Explained with Real Examples (2026)
SQL joins are the most tested topic in data engineering interviews. Most candidates know INNER and LEFT JOIN but stumble on CROSS APPLY, anti-joins, and self-joins. This guide covers them all with real interview patterns.
Key Takeaways
- βSQL Joins: The #1 Topic That Decides Data Engineering Interviews
- βThe 4 Standard Joins (And What Interviewers Actually Test)
- βAnti-Join: Find Rows That DON'T Match
- βSelf-Join: Comparing Rows Within the Same Table
SQL Joins: The #1 Topic That Decides Data Engineering Interviews
Every data engineering interview includes at least 2-3 SQL join questions. The topic seems basic, but interviews test much deeper than INNER vs LEFT JOIN:
- Can you write a self-join to compare rows within the same table?
- Do you know the performance difference between EXISTS and LEFT JOIN ... IS NULL?
- Can you handle many-to-many joins without creating duplicates?
- Do you understand when a CROSS JOIN is actually useful?
This guide covers every join type with the interview pattern where it appears, including the anti-join and semi-join patterns that most candidates don't know by name.
The 4 Standard Joins (And What Interviewers Actually Test)
INNER JOIN: Returns only matching rows from both tables.
SELECT o.order_id, c.name
FROM orders o
INNER JOIN customers c ON o.customer_id = c.idInterview twist: "What happens if customer_id has NULLs?" Answer: NULL never equals NULL, so rows with NULL customer_id are silently dropped. Use COALESCE or IS NOT DISTINCT FROM if you need NULL-safe joins.
LEFT JOIN: Returns all rows from the left table, NULLs from right where no match.
SELECT c.name, COUNT(o.order_id) as order_count
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
GROUP BY c.nameInterview pattern: "Find customers who have never placed an order." This is the anti-join pattern β see below.
RIGHT JOIN: Same as LEFT JOIN with tables reversed. Rarely used in practice β prefer LEFT JOIN for readability.
FULL OUTER JOIN: Returns all rows from both tables, NULLs where no match.
Interview pattern: Data reconciliation β compare two data sources and find records that exist in one but not the other.
Anti-Join: Find Rows That DON'T Match
The anti-join is one of the most commonly tested patterns. Three ways to write it:
Method 1 β LEFT JOIN + IS NULL (most common):
SELECT c.name
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
WHERE o.customer_id IS NULLMethod 2 β NOT EXISTS (often fastest):
SELECT c.name
FROM customers c
WHERE NOT EXISTS (
SELECT 1 FROM orders o WHERE o.customer_id = c.id
)Method 3 β NOT IN (dangerous!):
SELECT c.name
FROM customers c
WHERE c.id NOT IN (SELECT customer_id FROM orders)Critical interview knowledge: NOT IN returns no rows if the subquery contains ANY NULL value. This is a trap that catches most candidates. Always prefer NOT EXISTS or LEFT JOIN + IS NULL.
Performance comparison:
- NOT EXISTS: Usually fastest (short-circuits on first match)
- LEFT JOIN + IS NULL: Close second, sometimes faster on large datasets
- NOT IN: Slowest (materializes full subquery result), plus the NULL trap
Self-Join: Comparing Rows Within the Same Table
Self-joins appear in nearly every SQL interview round. Common patterns:
Find employees who earn more than their manager:
SELECT e.name as employee, e.salary, m.name as manager, m.salary as mgr_salary
FROM employees e
INNER JOIN employees m ON e.manager_id = m.id
WHERE e.salary > m.salaryFind consecutive day logins:
SELECT DISTINCT a.user_id
FROM logins a
INNER JOIN logins b ON a.user_id = b.user_id
AND a.login_date = b.login_date + INTERVAL '1 day'Find duplicate records:
SELECT a.*
FROM transactions a
INNER JOIN transactions b
ON a.amount = b.amount
AND a.user_id = b.user_id
AND a.timestamp = b.timestamp
AND a.id < b.id -- avoid matching with itselfPro tip: When writing self-joins, always use meaningful aliases (e for employee, m for manager) and add a comment explaining the relationship. Interviewers evaluate code readability.
CROSS JOIN: The Underestimated Workhorse
Most candidates avoid CROSS JOIN because "it creates a Cartesian product." But there are real use cases:
Generate a calendar/date spine:
WITH dates AS (
SELECT generate_series('2026-01-01'::date, '2026-12-31'::date, '1 day') AS dt
),
users AS (
SELECT DISTINCT user_id FROM events
)
SELECT u.user_id, d.dt, COALESCE(e.event_count, 0) as events
FROM users u
CROSS JOIN dates d
LEFT JOIN (
SELECT user_id, event_date, COUNT(*) as event_count
FROM events GROUP BY 1, 2
) e ON u.user_id = e.user_id AND d.dt = e.event_dateThis pattern (CROSS JOIN to generate all combinations, then LEFT JOIN to fill in actual data) is essential for:
- Filling gaps in time-series data
- Calculating metrics for ALL user-date combinations (including days with zero activity)
- Building feature matrices for ML models
Interview signal: Mentioning the date spine + CROSS JOIN pattern shows you've dealt with real analytics problems.
Join Performance: What Senior Engineers Know
Hash Join vs Sort-Merge Join vs Nested Loop:
- Hash Join: Builds a hash table from the smaller table, probes with the larger. Best for equi-joins when one table fits in memory.
- Sort-Merge Join: Sorts both tables on the join key, then merges. Best when both tables are already sorted or have indexes.
- Nested Loop: For each row in table A, scans table B. Terrible for large tables, but optimal for very small lookups.
How to control which join is used:
- In Spark:
broadcast(small_df)forces a broadcast hash join - In PostgreSQL:
SET enable_hashjoin = offto force sort-merge (for debugging) - In BigQuery: Automatic, but broadcast joins happen when one side <10MB
Many-to-many join explosion:
-- If orders has 1000 rows and payments has 3 rows per order
-- This returns 3000 rows, not 1000
SELECT * FROM orders o JOIN payments p ON o.id = p.order_idFix: Aggregate one side first, then join:
SELECT o.*, p.total_paid
FROM orders o
JOIN (
SELECT order_id, SUM(amount) as total_paid
FROM payments GROUP BY order_id
) p ON o.id = p.order_idPractice your SQL join answers with real-time AI feedback at DataEngPrep's Answer Analyzer.
Written by the DataEngPrep Team
Our editorial team consists of experienced data engineers who have worked at top tech companies and gone through hundreds of real interviews. Every article is reviewed for technical accuracy and practical relevance to help you prepare effectively.
Learn more about our team βRelated Articles
Practice These Questions
Think you can answer these questions? Find out in 30 seconds
Paste your answer and get instant AI feedback β see exactly where your answer is weak and how a FAANG-level candidate would respond.