Question 1

What is a Common Table Expression (CTE), and when would you use it?

Accepted Answer

**CTE**: A named temporary result set in a WITH clause, referenced in the main query. **Use cases**: Readability—break complex queries into steps. Reusability—reference same CTE multiple times. Recursion—hierarchies (org chart, bills of materials). **Why it matters**: CTEs improve maintainability; deep subqueries are hard to debug. **Scalability**: In some engines (e.g., PostgreSQL), CTEs are optimization fences—materialized once. In others (Snowflake, BigQuery), they're inlined....

Question 2

What is the difference between a primary key and a unique key?

Accepted Answer

**Primary Key**: Unique identifier; NOT NULL; one per table; often clustered. **Unique Key**: Enforces uniqueness; can have NULL (SQL allows one NULL per column in uniqueness); multiple per table. **Why it matters**: PK defines identity and referential integrity; unique constrains alternate keys (e.g., email). **Scalability**: PK is often the clustering key; choice affects physical layout. Unique indexes enable lookups. **Cost**: Each constraint adds index overhead....

Question 3

Explain Fact and Dimension Tables with examples.

Accepted Answer

Architecture: Star schema centralizes measurable events in fact tables; dimensions provide semantic context. Why this design: Facts are append-heavy and grow unbounded; dimensions are smaller and change slowly. Separating them optimizes for different access patterns. Fact grain defines the entire schema—get it wrong and joins become wrong. Example: sales_fact (quantity, revenue, date_key, product_key, customer_key) at grain one row per transaction....

Question 4

Joins and window functions - INNER, LEFT, RIGHT, FULL OUTER, ROW_NUMBER(), RANK(), DENSE_RANK()

Accepted Answer

Joins: INNER = intersection only; LEFT = all left + matching right (NULL fill); RIGHT = mirror of LEFT; FULL OUTER = union of both. Why it matters: Join choice affects result cardinality and NULL handling—wrong join = wrong business logic (e.g., LEFT to preserve all customers even without orders). Window functions: ROW_NUMBER() = unique rank 1,2,3; RANK() = ties same rank, gaps after; DENSE_RANK() = ties same rank, no gaps....

Question 5

Difference Between Internal and External Tables in BigQuery

Accepted Answer

**Architectural Logic**: Internal (native) tables: BigQuery owns metadata and storage (Colossus). Columnar layout, partitioning, clustering, automatic optimization. Dropping = data deleted. External tables: BigQuery owns metadata only; data lives in GCS, Drive, Bigtable. Dropping = metadata deleted, data persists. **Why**: Internal enables DML, streaming inserts, partitioning strategies, cost-effective storage classes....

Question 6

How do you optimize a long-running SQL query?

Accepted Answer

**Architectural Logic**: Optimization is diagnostic-first. 1. Profile: EXPLAIN/EXPLAIN ANALYZE to find bottleneck (scan, join, sort, spill). 2. Reduce input: Filter early (WHERE, partition pruning); SELECT only needed columns. 3. Indexing: B-tree on filter/join columns; avoid over-indexing (writes slow). 4. Partitioning: Date/tenant partitioning for pruning. 5. Join strategy: Broadcast small dims; avoid cross joins. 6. Statistics: Up-to-date stats for planner. 7....

Question 7

Cloud Architecture - explain

Accepted Answer

**Section 1 — The Context (The 'Why')**
The primary challenge for this design in SQL is balancing scale, cost, and reliability. At scale, naive approaches fail: single points of failure cause cascades, schema evolution breaks consumers, and over-provisioning explodes cost. Failure modes include silent data loss from non-idempotent writes, cascading job failures from tight coupling, and operational burden from manual intervention....

Question 8

Consolidate hotel reviews and create a dashboard. Design a data model for the reviews.

Accepted Answer

**Section 1 — The Context (The 'Why')**
The primary challenge for this design in SQL is balancing scale, cost, and reliability. At scale, naive approaches fail: single points of failure cause cascades, schema evolution breaks consumers, and over-provisioning explodes cost. Failure modes include silent data loss from non-idempotent writes, cascading job failures from tight coupling, and operational burden from manual intervention....

Question 9

Create Spark Session, read CSV, join, and write as table. Provide example code.

Accepted Answer

**Architectural Logic**: Production Spark patterns require config, partitioning, and join optimization.

```python
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ETL")\
  .config("spark.sql.adaptive.enabled", "true")\
  .getOrCreate()
df1 = spark.read.option("header", True).csv("s3://bucket/orders.csv")
df2 = spark.read.option("header", True).csv("s3://bucket/customers.csv")
joined = df1.join(broadcast(df2), df1.customer_id == df2.id, "left")
joined.write.partitionBy(...

Question 10

Data Warehouse Design from scratch

Accepted Answer

**Section 1 — The Context (The 'Why')**
The primary challenge for this design in SQL is balancing scale, cost, and reliability. At scale, naive approaches fail: single points of failure cause cascades, schema evolution breaks consumers, and over-provisioning explodes cost. Failure modes include silent data loss from non-idempotent writes, cascading job failures from tight coupling, and operational burden from manual intervention....

Hard SQL Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 60 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading

Hard SQL Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 60 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading