Question 1

What architecture are you following in your current project, and why?

Accepted Answer

**Section 1 — The Context (The 'Why')**
The primary challenge in choosing and justifying an architecture is alignment between technical complexity, operational cost, and business latency requirements. A naive monolithic pipeline fails when schema evolution hits, when a single component becomes a bottleneck, or when teams scale—causing coordination overhead and deployment risk....

Question 2

Write complex SQL queries involving multiple joins, subqueries, and data aggregation logic.

Accepted Answer

**Architectural Logic:** Complex SQL = joins + CTEs + window functions + aggregation. Structure for readability and optimizer-friendliness. **Example:** Revenue by segment and category with YoY growth, excluding returns....

Question 3

Concatenating lists within a range using list comprehensions

Accepted Answer

**Flatten:** `[item for sublist in lists for item in sublist]` or `list(chain(*lists))`. **Range:** `[x for start,end in ranges for x in range(start,end+1)]`. **Nested:** Outer iterates first. **Why:** Readability; `chain` for many lists....

Question 4

Count occurrences of elements in a list of tuples using Spark RDDs

Accepted Answer

**RDD:** `rdd.flatMap(lambda t: t).map(lambda x: (x,1)).reduceByKey(lambda a,b: a+b)`. Or `countByValue()` for whole tuples. **Why:** `reduceByKey` for distributed count. **Scalability:** Partitioning key; avoid skew....

Question 5

Flatten nested lists recursively using Python

Accepted Answer

**Recursive:** `def flatten(lst): return [y for x in lst for y in (flatten(x) if isinstance(x,list) else [x])]`. Handles arbitrary depth. **Iterative:** Stack. **Why:** Nested structures. **Production:** Stack for deep to avoid overflow.

Question 6

Why I chose specific technologies (e.g., Spark over traditional ETL tools)

Accepted Answer

**Situation**: Legacy ETL (Informatica, SSIS) couldn't scale to TB; per-row licensing expensive.

**Task**: Build scalable, cost-effective data platform.

**Action**: Chose Spark for (1) **Scale**—handles TB. (2) **Cost**—open source. (3) **Flexibility**—code-based, custom logic. (4) **Unified**—batch + streaming. (5) **Ecosystem**—Delta, MLlib, connectors.

**Result**: 10x data volume, 60% cost reduction. ETL tools for simple, operational pipelines.

**Trade-offs**: Spark = more engineering....

Question 7

How we manage dependencies and retries in data pipelines

Accepted Answer

**Section 1 — The Context (The 'Why')**
The primary challenge in 'How we manage dependencies and retries in data pipelines' centers on designing for production scale, correctness guarantees, and operational resilience. A naive or underspecified design fails under load: single points of failure cascade, non-idempotent operations cause duplicates on retry, and lack of observability blocks root-cause analysis....

Tiger Analytics Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 7 Questions

More Interview Prep Guides

Unlock All Expert Answers