Question 1

Explain the differences between Data Warehouse, Data Lake, and Delta Lake

Accepted Answer

**Data Warehouse**: Structured, schema-on-write; optimized for SQL analytics (Snowflake, BigQuery). High compute cost, fast queries. **Data Lake**: Raw/semi-structured object storage (S3, ADLS); schema-on-read; low cost, flexible. **Delta Lake**: Open-source storage layer on a data lake adding ACID transactions, schema enforcement, time travel, upserts. **Why the distinction**: Warehouses scale compute and storage together; lakes decouple them....

Question 2

Can you explain the difference between OLTP and OLAP?

Accepted Answer

**OLTP**: Optimized for many small transactions (inserts, updates, deletes). Row-oriented, normalized, high concurrency. Examples: MySQL, PostgreSQL. **OLAP**: Optimized for complex analytical queries and aggregations on large datasets. Column-oriented, denormalized (star/snowflake). Examples: Snowflake, BigQuery, Redshift. **Why the split**: Different access patterns; mixing them degrades both. OLTP needs low latency and ACID; OLAP needs scan throughput....

Question 3

What is a Common Table Expression (CTE), and when would you use it?

Accepted Answer

**CTE**: A named temporary result set in a WITH clause, referenced in the main query. **Use cases**: Readability—break complex queries into steps. Reusability—reference same CTE multiple times. Recursion—hierarchies (org chart, bills of materials). **Why it matters**: CTEs improve maintainability; deep subqueries are hard to debug. **Scalability**: In some engines (e.g., PostgreSQL), CTEs are optimization fences—materialized once. In others (Snowflake, BigQuery), they're inlined....

Question 4

How do you remove duplicate rows in BigQuery?

Accepted Answer

Approach: Use ROW_NUMBER() OVER (PARTITION BY dedup_keys ORDER BY tie_breaker) to define which row to keep; filter rn = 1. Preferred pattern: CREATE OR REPLACE TABLE ... AS SELECT * EXCEPT(rn) FROM (SELECT *, ROW_NUMBER() OVER (...) AS rn ...) WHERE rn = 1. Why CREATE OR REPLACE over DELETE: BigQuery is columnar; DELETE is a rewrite under the hood. For large tables, CREATE OR REPLACE is a single scan+write vs DELETE's read-modify-write....

Question 5

What is Snowflake's architecture, and why is it unique?

Accepted Answer

**Section 1 — The Context (The 'Why')**
Traditional data warehouses collapse under elastic concurrency: fixed clusters either over-provision (cost) or under-provision (queuing). Storage-compute coupling means scaling queries requires scaling storage nodes....

Question 6

Have you worked on Data Warehousing projects?

Accepted Answer

**Architectural context**: A data warehouse is the semantic layer between raw data and business decisions. Design choices—star vs snowflake, SCD strategy, partitioning—directly impact query latency, storage cost, and maintenance burden. **Key responsibilities**: (1) **Schema design**: Star for BI simplicity, snowflake for normalized flexibility. SCD Type 2 for slowly changing dimensions (audit trail, point-in-time correctness)....

Question 7

Retrieve the most recent sale_timestamp for each product (Latest Transaction).

Accepted Answer

**SQL (full row)**: `SELECT * FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY product_id ORDER BY sale_timestamp DESC) AS rn FROM sales) t WHERE rn = 1` — Returns the full record. **SQL (timestamp only)**: `SELECT product_id, MAX(sale_timestamp) FROM sales GROUP BY product_id` — Use when you only need the timestamp. **PySpark**: `df.withColumn('rn', F.row_number().over(Window.partitionBy('product_id').orderBy(F.desc('sale_timestamp')))).filter(F.col('rn') == 1).drop('rn')` — Full row....

Question 8

What is the difference between OLTP and OLAP?

Accepted Answer

**Why the distinction exists**: They serve different access patterns. OLTP = many small, random writes and point reads. OLAP = few, large sequential scans and aggregations. Optimizing for one degrades the other. **OLTP**: Row-oriented storage (fast single-row access). Normalized schema (3NF) to avoid update anomalies. Indexes for lookup (B-tree). ACID for consistency. High concurrency via locking/mvcc. Examples: PostgreSQL, MySQL, Oracle....

Question 9

Difference Between Internal and External Tables in BigQuery

Accepted Answer

**Architectural Logic**: Internal (native) tables: BigQuery owns metadata and storage (Colossus). Columnar layout, partitioning, clustering, automatic optimization. Dropping = data deleted. External tables: BigQuery owns metadata only; data lives in GCS, Drive, Bigtable. Dropping = metadata deleted, data persists. **Why**: Internal enables DML, streaming inserts, partitioning strategies, cost-effective storage classes....

Question 10

Explain Common Table Expressions (CTEs) and their benefits.

Accepted Answer

**Architectural Logic**: CTEs are named subqueries in a WITH clause, evaluated as defined (or materialized, depending on engine). They provide logical decomposition without forcing physical materialization. **Why**: Readability and reuse—complex pipelines split into stages (raw → cleansed → aggregated). Recursion for hierarchies (org charts, bill-of-materials). Some engines inline CTEs; others (e.g., BigQuery) can materialize for reuse....

Bigquery Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 60 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading