System Design/Architecture·15 min read·April 30, 2026

7 Data Pipeline Design Patterns Every Senior Data Engineer Must Know (2026)

Interviewers don't ask 'build a pipeline.' They ask 'how would you handle late data, schema changes, and exactly-once processing?' Master the 7 patterns that answer these questions.

Key Takeaways

✓Why 'I'd Use Airflow + Spark' Isn't a Good Enough Answer
✓Pattern 1: Idempotent Writes (The Foundation)
✓Pattern 2: Watermark-Based Late Data Handling
✓Pattern 3: Schema Evolution Without Breaking Downstream

Why 'I'd Use Airflow + Spark' Isn't a Good Enough Answer

When an interviewer asks you to design a data pipeline, they're not asking which tools to use. They're testing whether you understand the fundamental patterns that make pipelines reliable, scalable, and maintainable.

The difference between a mid-level and senior answer isn't Airflow vs Prefect or Spark vs Flink. It's whether you can articulate:

How do you handle late-arriving data?
What happens when the pipeline fails halfway?
How do you guarantee exactly-once processing?
How do you evolve the schema without breaking downstream?

These 7 patterns form the building blocks that answer every pipeline design question.

Pattern 1: Idempotent Writes (The Foundation)

The problem: Pipelines fail and get retried. If your write isn't idempotent, retries create duplicates.

The pattern: Every write operation must produce the same result whether executed once or many times.

Implementation:

sql

-- Bad: INSERT INTO (creates duplicates on retry)
INSERT INTO target SELECT * FROM source WHERE date = '2026-01-01'

-- Good: INSERT OVERWRITE (idempotent)
INSERT OVERWRITE TABLE target PARTITION (date = '2026-01-01')
SELECT * FROM source WHERE date = '2026-01-01'

-- Better: MERGE (handles updates + inserts idempotently)
MERGE INTO target USING source ON target.id = source.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

Interview signal: Mention idempotency proactively in ANY pipeline design question. It shows you've dealt with production failures.

Pattern 2: Watermark-Based Late Data Handling

The problem: Events arrive late. An event from 11:55 PM might arrive at 12:05 AM — after the hourly batch already ran.

The pattern: Use watermarks to define how late you'll accept data, then reprocess affected windows.

For batch pipelines:

python

# Process current hour + reprocess previous 2 hours
for hour in [current_hour, current_hour - 1, current_hour - 2]:
    process_partition(hour)  # OVERWRITE ensures idempotency

For streaming (Spark Structured Streaming):

python

df.withWatermark('event_time', '2 hours') \
  .groupBy(window('event_time', '1 hour')) \
  .agg(count('*'))

Trade-off: Longer watermarks catch more late data but increase latency and resource usage. Production rule: Set watermark to 2-3x your observed p99 late arrival time.

Advanced: For pipelines where late data can arrive days late (IoT, mobile), use a reconciliation job that runs daily and reprocesses the last 7 days of data.

Pattern 3: Schema Evolution Without Breaking Downstream

The problem: Upstream adds a column. Your pipeline crashes because the schema changed.

The pattern: Decouple schema changes from pipeline code using schema registries and evolution rules.

Three strategies:

Additive-only (safest): New columns are added with defaults. Existing columns never change type or get removed. All consumers continue working.

Schema Registry enforcement: Producers register schemas. The registry rejects breaking changes before they reach Kafka/S3.

Backward compatible: new consumer can read old data ✓
Forward compatible: old consumer can read new data ✓

Bronze layer absorption: The Bronze layer accepts ANY schema. The Bronze-to-Silver transformation maps source fields to a stable internal schema. Schema changes only require updating one mapping.

Production checklist:

Never rename columns in-place (add new, deprecate old)
Store raw data in Bronze BEFORE applying schema transforms
Use Delta Lake's mergeSchema for additive changes
Alert on schema drift (compare current schema against registered schema)

Pattern 4: Dead-Letter Queue for Poison Messages

The problem: One malformed record crashes your entire pipeline, blocking all subsequent records.

The pattern: Route failed records to a dead-letter queue (DLQ) instead of failing the pipeline.

python

def process_batch(records):
    good, bad = [], []
    for record in records:
        try:
            result = transform(record)
            good.append(result)
        except Exception as e:
            bad.append({**record, 'error': str(e), 'failed_at': datetime.now()})
    
    write_to_target(good)
    write_to_dlq(bad)  # Separate table/topic for investigation

DLQ management:

Set up alerts when DLQ exceeds a threshold (e.g., >0.1% failure rate)
Include the original record + error message + timestamp in the DLQ
Build a reprocessing job that reads from DLQ after fixes are deployed
Set TTL on DLQ records (e.g., 30 days) to prevent unbounded growth

Interview tip: Mention DLQ patterns whenever discussing error handling. It demonstrates production maturity.

Pattern 5: Backfill-Safe Pipeline Design

The problem: You need to reprocess 6 months of historical data after a bug fix, but the pipeline was designed for incremental daily runs.

The pattern: Design every pipeline to accept a date range parameter, not just 'process today.'

python

# Bad: hardcoded to current date
def run_pipeline():
    date = datetime.today().strftime('%Y-%m-%d')
    process(date)

# Good: parameterized + idempotent
def run_pipeline(start_date: str, end_date: str):
    for date in date_range(start_date, end_date):
        process_partition(date)  # OVERWRITE, not APPEND

Backfill checklist:

Every pipeline accepts start_date and end_date parameters
Writes are partitioned by date and use OVERWRITE
Dependencies are also backfill-safe (no circular wait conditions)
Resource limits: Backfills run with lower priority (Airflow pools, Spark resource queues)
Downstream SLAs are met: Don't backfill production tables during business hours

Pattern 6: Circuit Breaker for External Dependencies

The problem: Your pipeline calls an external API (geocoding, enrichment) that goes down. The pipeline retries endlessly, wasting resources and creating partial data.

The pattern: Implement a circuit breaker that stops calling a failing dependency after N failures.

Three states:

Closed (normal): Calls pass through. Count failures.
Open (tripped): After N failures in M seconds, stop calling. Return cached/default values.
Half-open (recovery): After a cooldown period, try one call. If it succeeds, close the circuit.

In practice:

Use cached values during open state (last known good enrichment)
Write records that couldn't be enriched to a DLQ for later reprocessing
Alert the team when the circuit opens
Set different thresholds per dependency (critical API: 3 failures/60s, nice-to-have enrichment: 10 failures/300s)

Pattern 7: Event-Driven vs Schedule-Driven Orchestration

Schedule-driven (Airflow cron): Pipeline runs at a fixed time regardless of whether new data exists.

Pros: Predictable, easy to monitor, simple SLA management
Cons: Wastes resources when no new data; misses data that arrives between runs

Event-driven: Pipeline triggers when new data arrives (S3 event notification, Kafka message).

Pros: Processes data immediately, no wasted runs
Cons: Harder to monitor, thundering herd problem, no guaranteed ordering

Hybrid (production best practice):

S3 event → Lambda → Triggers Airflow DAG (if not already running)
                   → Rate-limited to max 1 trigger per 15 minutes

This gives you the responsiveness of event-driven with the monitoring and retry capabilities of scheduled orchestration.

Interview tip: Don't say 'I'd use Airflow with a daily cron.' Say 'I'd use event-driven triggering with Airflow as the orchestrator, with a daily catchup job for any missed events.' This shows you understand both approaches and their trade-offs.

Practice articulating these patterns with DataEngPrep's Answer Analyzer — get instant feedback on your pipeline design answers.

Written by the DataEngPrep Team

Our editorial team consists of experienced data engineers who have worked at top tech companies and gone through hundreds of real interviews. Every article is reviewed for technical accuracy and practical relevance to help you prepare effectively.

Learn more about our team →

🏗️

System Design Interview Patterns for Data Pipelines

System Design/Architecture · 22 min read

→📘

Top 50 SQL Interview Questions for Data Engineers (2026)

SQL · 18 min read

→⚡

Apache Spark Interview Questions: Beginner to Advanced

Spark · 22 min read

→

Practice These Questions

hardWhat architecture are you following in your current project, and why?→easyCDC During Migration - explain approaches for real-time Change Data Capture→easyWhat are Airflow Operators? Give examples.→

Think you can answer these questions? Find out in 30 seconds

Paste your answer and get instant AI feedback — see exactly where your answer is weak and how a FAANG-level candidate would respond.

Analyze My Answer — Free See Plans

← Back to all articles

Why 'I'd Use Airflow + Spark' Isn't a Good Enough Answer

The difference between a mid-level and senior answer isn't Airflow vs Prefect or Spark vs Flink. It's whether you can articulate:

How do you handle late-arriving data?
What happens when the pipeline fails halfway?
How do you guarantee exactly-once processing?
How do you evolve the schema without breaking downstream?

These 7 patterns form the building blocks that answer every pipeline design question.

Pattern 1: Idempotent Writes (The Foundation)

The problem: Pipelines fail and get retried. If your write isn't idempotent, retries create duplicates.

The pattern: Every write operation must produce the same result whether executed once or many times.

Implementation:

sql

-- Bad: INSERT INTO (creates duplicates on retry)
INSERT INTO target SELECT * FROM source WHERE date = '2026-01-01'

-- Good: INSERT OVERWRITE (idempotent)
INSERT OVERWRITE TABLE target PARTITION (date = '2026-01-01')
SELECT * FROM source WHERE date = '2026-01-01'

-- Better: MERGE (handles updates + inserts idempotently)
MERGE INTO target USING source ON target.id = source.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

Interview signal: Mention idempotency proactively in ANY pipeline design question. It shows you've dealt with production failures.

Pattern 2: Watermark-Based Late Data Handling

The problem: Events arrive late. An event from 11:55 PM might arrive at 12:05 AM — after the hourly batch already ran.

The pattern: Use watermarks to define how late you'll accept data, then reprocess affected windows.

For batch pipelines:

python

# Process current hour + reprocess previous 2 hours
for hour in [current_hour, current_hour - 1, current_hour - 2]:
    process_partition(hour)  # OVERWRITE ensures idempotency

For streaming (Spark Structured Streaming):

python

df.withWatermark('event_time', '2 hours') \
  .groupBy(window('event_time', '1 hour')) \
  .agg(count('*'))

Trade-off: Longer watermarks catch more late data but increase latency and resource usage. Production rule: Set watermark to 2-3x your observed p99 late arrival time.

Advanced: For pipelines where late data can arrive days late (IoT, mobile), use a reconciliation job that runs daily and reprocesses the last 7 days of data.

Pattern 3: Schema Evolution Without Breaking Downstream

The problem: Upstream adds a column. Your pipeline crashes because the schema changed.

The pattern: Decouple schema changes from pipeline code using schema registries and evolution rules.

Three strategies:

Additive-only (safest): New columns are added with defaults. Existing columns never change type or get removed. All consumers continue working.

Schema Registry enforcement: Producers register schemas. The registry rejects breaking changes before they reach Kafka/S3.

Backward compatible: new consumer can read old data ✓
Forward compatible: old consumer can read new data ✓

Bronze layer absorption: The Bronze layer accepts ANY schema. The Bronze-to-Silver transformation maps source fields to a stable internal schema. Schema changes only require updating one mapping.

Production checklist:

Never rename columns in-place (add new, deprecate old)
Store raw data in Bronze BEFORE applying schema transforms
Use Delta Lake's mergeSchema for additive changes
Alert on schema drift (compare current schema against registered schema)

Pattern 4: Dead-Letter Queue for Poison Messages

The problem: One malformed record crashes your entire pipeline, blocking all subsequent records.

The pattern: Route failed records to a dead-letter queue (DLQ) instead of failing the pipeline.

python

def process_batch(records):
    good, bad = [], []
    for record in records:
        try:
            result = transform(record)
            good.append(result)
        except Exception as e:
            bad.append({**record, 'error': str(e), 'failed_at': datetime.now()})
    
    write_to_target(good)
    write_to_dlq(bad)  # Separate table/topic for investigation

DLQ management:

Set up alerts when DLQ exceeds a threshold (e.g., >0.1% failure rate)
Include the original record + error message + timestamp in the DLQ
Build a reprocessing job that reads from DLQ after fixes are deployed
Set TTL on DLQ records (e.g., 30 days) to prevent unbounded growth

Interview tip: Mention DLQ patterns whenever discussing error handling. It demonstrates production maturity.

Pattern 5: Backfill-Safe Pipeline Design

The problem: You need to reprocess 6 months of historical data after a bug fix, but the pipeline was designed for incremental daily runs.

The pattern: Design every pipeline to accept a date range parameter, not just 'process today.'

python

# Bad: hardcoded to current date
def run_pipeline():
    date = datetime.today().strftime('%Y-%m-%d')
    process(date)

# Good: parameterized + idempotent
def run_pipeline(start_date: str, end_date: str):
    for date in date_range(start_date, end_date):
        process_partition(date)  # OVERWRITE, not APPEND

Backfill checklist:

Every pipeline accepts start_date and end_date parameters
Writes are partitioned by date and use OVERWRITE
Dependencies are also backfill-safe (no circular wait conditions)
Resource limits: Backfills run with lower priority (Airflow pools, Spark resource queues)
Downstream SLAs are met: Don't backfill production tables during business hours

Pattern 6: Circuit Breaker for External Dependencies

The problem: Your pipeline calls an external API (geocoding, enrichment) that goes down. The pipeline retries endlessly, wasting resources and creating partial data.

The pattern: Implement a circuit breaker that stops calling a failing dependency after N failures.

Three states:

Closed (normal): Calls pass through. Count failures.
Open (tripped): After N failures in M seconds, stop calling. Return cached/default values.
Half-open (recovery): After a cooldown period, try one call. If it succeeds, close the circuit.

In practice:

Use cached values during open state (last known good enrichment)
Write records that couldn't be enriched to a DLQ for later reprocessing
Alert the team when the circuit opens
Set different thresholds per dependency (critical API: 3 failures/60s, nice-to-have enrichment: 10 failures/300s)

Pattern 7: Event-Driven vs Schedule-Driven Orchestration

Schedule-driven (Airflow cron): Pipeline runs at a fixed time regardless of whether new data exists.

Pros: Predictable, easy to monitor, simple SLA management
Cons: Wastes resources when no new data; misses data that arrives between runs

Event-driven: Pipeline triggers when new data arrives (S3 event notification, Kafka message).

Pros: Processes data immediately, no wasted runs
Cons: Harder to monitor, thundering herd problem, no guaranteed ordering

Hybrid (production best practice):

S3 event → Lambda → Triggers Airflow DAG (if not already running)
                   → Rate-limited to max 1 trigger per 15 minutes

This gives you the responsiveness of event-driven with the monitoring and retry capabilities of scheduled orchestration.

Practice articulating these patterns with DataEngPrep's Answer Analyzer — get instant feedback on your pipeline design answers.

7 Data Pipeline Design Patterns Every Senior Data Engineer Must Know (2026)

Key Takeaways

Why 'I'd Use Airflow + Spark' Isn't a Good Enough Answer

Pattern 1: Idempotent Writes (The Foundation)

Pattern 2: Watermark-Based Late Data Handling

Pattern 3: Schema Evolution Without Breaking Downstream

Pattern 4: Dead-Letter Queue for Poison Messages

Pattern 5: Backfill-Safe Pipeline Design

Pattern 6: Circuit Breaker for External Dependencies

Pattern 7: Event-Driven vs Schedule-Driven Orchestration

Written by the DataEngPrep Team

Related Articles

Practice These Questions

Think you can answer these questions? Find out in 30 seconds

7 Data Pipeline Design Patterns Every Senior Data Engineer Must Know (2026)

Key Takeaways

Why 'I'd Use Airflow + Spark' Isn't a Good Enough Answer

Pattern 1: Idempotent Writes (The Foundation)

Pattern 2: Watermark-Based Late Data Handling

Pattern 3: Schema Evolution Without Breaking Downstream

Pattern 4: Dead-Letter Queue for Poison Messages

Pattern 5: Backfill-Safe Pipeline Design

Pattern 6: Circuit Breaker for External Dependencies

Pattern 7: Event-Driven vs Schedule-Driven Orchestration

Written by the DataEngPrep Team

Related Articles

Practice These Questions

Think you can answer these questions? Find out in 30 seconds