Question 1

What is the difference between SparkSession and SparkContext in Spark?

Accepted Answer

**SparkContext** (Spark 1.x): Low-level entry point for RDD operations. Manages cluster connections, configuration, and RDD creation. One active SparkContext per JVM. RDD-only.

**SparkSession** (Spark 2.0+): Unified entry point subsuming SparkContext, SQLContext, HiveContext, StreamingContext. Provides DataFrame, Dataset, SQL, and Structured Streaming APIs....

Question 2

Write an SQL query to find the second-highest salary from an employee table.

Accepted Answer

**Using subquery with MAX**:
```sql
SELECT MAX(salary) AS second_highest
FROM employee
WHERE salary < (SELECT MAX(salary) FROM employee);
```

**Using LIMIT/OFFSET** (MySQL, PostgreSQL):
```sql
SELECT DISTINCT salary
FROM employee
ORDER BY salary DESC
LIMIT 1 OFFSET 1;
```

**Using DENSE_RANK** (ANSI SQL, most robust):
```sql
SELECT salary
FROM (
  SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rk
  FROM employee
) t
WHERE rk = 2;
```

**Architectural Logic & Trade-offs**:
- **Subque...

Question 3

What are Airflow Operators? Give examples.

Accepted Answer

Airflow Operators define a single unit of work in a DAG—each operator performs one atomic, idempotent task. **Why they matter**: They encapsulate work so DAGs remain declarative and schedulable; the scheduler doesn't need to understand task logic. **Examples**: BashOperator, PythonOperator, SqlOperator, HTTPOperator, DockerOperator, KubernetesPodOperator, Sensor. **Scalability**: Heavy logic should live in external scripts or services; operators should only orchestrate....

Question 4

Explain the differences between Data Warehouse, Data Lake, and Delta Lake

Accepted Answer

**Data Warehouse**: Structured, schema-on-write; optimized for SQL analytics (Snowflake, BigQuery). High compute cost, fast queries. **Data Lake**: Raw/semi-structured object storage (S3, ADLS); schema-on-read; low cost, flexible. **Delta Lake**: Open-source storage layer on a data lake adding ACID transactions, schema enforcement, time travel, upserts. **Why the distinction**: Warehouses scale compute and storage together; lakes decouple them....

Question 5

Explain the differences between a Data Lake and a Data Warehouse.

Accepted Answer

**Data Lake**: Low-cost object storage (S3, ADLS) for raw, semi-structured, unstructured data. Schema-on-read; used for exploratory analytics, ML, archival. **Data Warehouse**: Structured, curated storage optimized for SQL; schema-on-write; used for BI and reporting. **Why both exist**: Lakes offer flexibility and cost at scale; warehouses offer query performance and concurrency....

Question 6

Can you explain the difference between OLTP and OLAP?

Accepted Answer

**OLTP**: Optimized for many small transactions (inserts, updates, deletes). Row-oriented, normalized, high concurrency. Examples: MySQL, PostgreSQL. **OLAP**: Optimized for complex analytical queries and aggregations on large datasets. Column-oriented, denormalized (star/snowflake). Examples: Snowflake, BigQuery, Redshift. **Why the split**: Different access patterns; mixing them degrades both. OLTP needs low latency and ACID; OLAP needs scan throughput....

Question 7

Describe a time when you had to optimize a slow SQL query. What steps did you take?

Accepted Answer

**Situation**: A critical exec report was timing out at 30+ minutes; SLA was 5 minutes. **Task**: Diagnose and fix without changing business logic. **Action**: I ran EXPLAIN (ANALYZE) and found: (1) missing index on join key causing full table scan, (2) cross join that could be inner join with a filter, (3) filter in HAVING that could move to WHERE to reduce rows early, (4) unnecessary ORDER BY on a subquery....

Question 8

How do you handle NULL values in SQL? Mention functions like COALESCE and NULLIF.

Accepted Answer

**Approaches**: IS NULL / IS NOT NULL for filtering. **COALESCE(val1, val2, ...)**: First non-NULL value; useful for defaults. **NULLIF(val1, val2)**: Returns NULL if equal; e.g., NULLIF(divisor, 0) to avoid divide-by-zero. **Why it matters**: NULL propagates in expressions; aggregate functions ignore NULL (except COUNT(*)). JOIN on NULL yields no match (NULL ≠ NULL). **Scalability**: COALESCE in SELECT is cheap; in WHERE or JOIN it can prevent index use....

Question 9

What is a Common Table Expression (CTE), and when would you use it?

Accepted Answer

**CTE**: A named temporary result set in a WITH clause, referenced in the main query. **Use cases**: Readability—break complex queries into steps. Reusability—reference same CTE multiple times. Recursion—hierarchies (org chart, bills of materials). **Why it matters**: CTEs improve maintainability; deep subqueries are hard to debug. **Scalability**: In some engines (e.g., PostgreSQL), CTEs are optimization fences—materialized once. In others (Snowflake, BigQuery), they're inlined....

Question 10

What is the difference between a primary key and a unique key?

Accepted Answer

**Primary Key**: Unique identifier; NOT NULL; one per table; often clustered. **Unique Key**: Enforces uniqueness; can have NULL (SQL allows one NULL per column in uniqueness); multiple per table. **Why it matters**: PK defines identity and referential integrity; unique constrains alternate keys (e.g., email). **Scalability**: PK is often the clustering key; choice affects physical layout. Unique indexes enable lookups. **Cost**: Each constraint adds index overhead....

SQL Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 60 Questions

More Interview Prep Guides

Unlock All Expert Answers