Question 1

What is the difference between SparkSession and SparkContext in Spark?

Accepted Answer

**SparkContext** (Spark 1.x): Low-level entry point for RDD operations. Manages cluster connections, configuration, and RDD creation. One active SparkContext per JVM. RDD-only.

**SparkSession** (Spark 2.0+): Unified entry point subsuming SparkContext, SQLContext, HiveContext, StreamingContext. Provides DataFrame, Dataset, SQL, and Structured Streaming APIs....

Question 2

Write an SQL query to find the second-highest salary from an employee table.

Accepted Answer

**Using subquery with MAX**:
```sql
SELECT MAX(salary) AS second_highest
FROM employee
WHERE salary < (SELECT MAX(salary) FROM employee);
```

**Using LIMIT/OFFSET** (MySQL, PostgreSQL):
```sql
SELECT DISTINCT salary
FROM employee
ORDER BY salary DESC
LIMIT 1 OFFSET 1;
```

**Using DENSE_RANK** (ANSI SQL, most robust):
```sql
SELECT salary
FROM (
  SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rk
  FROM employee
) t
WHERE rk = 2;
```

**Architectural Logic & Trade-offs**:
- **Subque...

Question 3

Explain Fact and Dimension Tables with examples.

Accepted Answer

Architecture: Star schema centralizes measurable events in fact tables; dimensions provide semantic context. Why this design: Facts are append-heavy and grow unbounded; dimensions are smaller and change slowly. Separating them optimizes for different access patterns. Fact grain defines the entire schema—get it wrong and joins become wrong. Example: sales_fact (quantity, revenue, date_key, product_key, customer_key) at grain one row per transaction....

Question 4

How do you remove duplicate rows in BigQuery?

Accepted Answer

Approach: Use ROW_NUMBER() OVER (PARTITION BY dedup_keys ORDER BY tie_breaker) to define which row to keep; filter rn = 1. Preferred pattern: CREATE OR REPLACE TABLE ... AS SELECT * EXCEPT(rn) FROM (SELECT *, ROW_NUMBER() OVER (...) AS rn ...) WHERE rn = 1. Why CREATE OR REPLACE over DELETE: BigQuery is columnar; DELETE is a rewrite under the hood. For large tables, CREATE OR REPLACE is a single scan+write vs DELETE's read-modify-write....

Question 5

How do you handle late-arriving data in Spark Structured Streaming?

Accepted Answer

Watermark: Defines max lateness (e.g., 10 min). Events older than (max_event_time - watermark) are dropped. State: Kept for aggregations within watermark; beyond that, state is purged to avoid unbounded growth. Output modes: Append (only finalized results), Update (changed rows), Complete (full result). Why: Trade-off between latency and correctness—tighter watermark = less state, more dropped late data. Looser = more state, more complete results....

Question 6

What is the small-file problem in Spark, and how do you solve it?

Accepted Answer

Problem: Thousands of tiny files (KB–MB) cause metadata overhead, slow S3/HDFS listing, many small tasks, I/O thrashing. Root causes: High parallelism, over-partitioning, streaming micro-batches. Why it hurts: Each file = task; 10K files = 10K tasks = scheduling overhead. S3 LIST is rate-limited; listing 100K files can take minutes. Solutions: 1) coalesce/repartition before write to target 128MB–1GB per file. 2) Delta Lake/Spark auto-compaction. 3) Larger partition sizes....

Question 7

What is the most difficult task you've ever worked on?

Accepted Answer

Situation: Migrating a multi-petabyte legacy data warehouse to a cloud-native lakehouse with zero downtime for 500+ daily users. Task: Achieve data consistency, performance parity, and seamless user transition without blocking business. Action: I designed a dual-write and dual-read strategy with automated reconciliation (row counts, checksums, sample validation). I built a query translation layer to map legacy SQL to Spark SQL, enabling gradual cutover....

Question 8

Why are you leaving your current company?

Accepted Answer

I've grown significantly at [Current Company]—I led [concrete achievement, e.g., migration of X TB, reduction of pipeline cost by Y%]. I'm proud of what we built. However, I'm at a point where I want to deepen my impact: I'm looking for a role where I can own architecture for systems at a larger scale, work with [specific tech: e.g., real-time streaming at petabyte scale], and mentor other engineers. [Target Company]'s work in [specific area—cite a blog, product, or news] aligns with that....

Question 9

Why should we hire you for this role?

Accepted Answer

Three reasons: (1) Technical depth: I've built and operated data platforms processing [X TB/day or Y events/sec] with [specific stack—Spark, Kafka, Snowflake, etc.]. I've led migrations, optimized costs by [Z%], and reduced incident MTTR through observability. (2) Business impact: I connect technical decisions to outcomes—e.g., schema validation reduced production incidents by [%]; incremental processing cut compute cost by [%]....

Question 10

Explain the difference between Azure Data Factory (ADF) and Databricks.

Accepted Answer

ADF is an orchestration and data-movement service; Databricks is a compute platform for analytics and ML. Why it matters: ADF excels at scheduling, branching, retries, and connectors—it's the 'conductor.' Databricks excels at heavy transforms (Spark), Delta Lake, and ML—it's the 'orchestra.' Scalability: ADF scales by parallelism (activities, self-hosted IR nodes); Databricks scales via cluster sizing and auto-scaling....

Incedo Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 44 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading