7 Data Pipeline Design Patterns Every Senior Data Engineer Must Know (2026)
Interviewers don't ask 'build a pipeline.' They ask 'how would you handle late data, schema changes, and exactly-once processing?' Master the 7 patterns that answer these questions.
Key Takeaways
- βWhy 'I'd Use Airflow + Spark' Isn't a Good Enough Answer
- βPattern 1: Idempotent Writes (The Foundation)
- βPattern 2: Watermark-Based Late Data Handling
- βPattern 3: Schema Evolution Without Breaking Downstream
Why 'I'd Use Airflow + Spark' Isn't a Good Enough Answer
When an interviewer asks you to design a data pipeline, they're not asking which tools to use. They're testing whether you understand the fundamental patterns that make pipelines reliable, scalable, and maintainable.
The difference between a mid-level and senior answer isn't Airflow vs Prefect or Spark vs Flink. It's whether you can articulate:
- How do you handle late-arriving data?
- What happens when the pipeline fails halfway?
- How do you guarantee exactly-once processing?
- How do you evolve the schema without breaking downstream?
These 7 patterns form the building blocks that answer every pipeline design question.
Pattern 1: Idempotent Writes (The Foundation)
The problem: Pipelines fail and get retried. If your write isn't idempotent, retries create duplicates.
The pattern: Every write operation must produce the same result whether executed once or many times.
Implementation:
-- Bad: INSERT INTO (creates duplicates on retry)
INSERT INTO target SELECT * FROM source WHERE date = '2026-01-01'
-- Good: INSERT OVERWRITE (idempotent)
INSERT OVERWRITE TABLE target PARTITION (date = '2026-01-01')
SELECT * FROM source WHERE date = '2026-01-01'
-- Better: MERGE (handles updates + inserts idempotently)
MERGE INTO target USING source ON target.id = source.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *Interview signal: Mention idempotency proactively in ANY pipeline design question. It shows you've dealt with production failures.
Pattern 2: Watermark-Based Late Data Handling
The problem: Events arrive late. An event from 11:55 PM might arrive at 12:05 AM β after the hourly batch already ran.
The pattern: Use watermarks to define how late you'll accept data, then reprocess affected windows.
For batch pipelines:
# Process current hour + reprocess previous 2 hours
for hour in [current_hour, current_hour - 1, current_hour - 2]:
process_partition(hour) # OVERWRITE ensures idempotencyFor streaming (Spark Structured Streaming):
df.withWatermark('event_time', '2 hours') \
.groupBy(window('event_time', '1 hour')) \
.agg(count('*'))Trade-off: Longer watermarks catch more late data but increase latency and resource usage. Production rule: Set watermark to 2-3x your observed p99 late arrival time.
Advanced: For pipelines where late data can arrive days late (IoT, mobile), use a reconciliation job that runs daily and reprocesses the last 7 days of data.
Pattern 3: Schema Evolution Without Breaking Downstream
The problem: Upstream adds a column. Your pipeline crashes because the schema changed.
The pattern: Decouple schema changes from pipeline code using schema registries and evolution rules.
Three strategies:
- Additive-only (safest): New columns are added with defaults. Existing columns never change type or get removed. All consumers continue working.
- Schema Registry enforcement: Producers register schemas. The registry rejects breaking changes before they reach Kafka/S3.
Backward compatible: new consumer can read old data β
Forward compatible: old consumer can read new data β- Bronze layer absorption: The Bronze layer accepts ANY schema. The Bronze-to-Silver transformation maps source fields to a stable internal schema. Schema changes only require updating one mapping.
Production checklist:
- Never rename columns in-place (add new, deprecate old)
- Store raw data in Bronze BEFORE applying schema transforms
- Use Delta Lake's
mergeSchemafor additive changes - Alert on schema drift (compare current schema against registered schema)
Pattern 4: Dead-Letter Queue for Poison Messages
The problem: One malformed record crashes your entire pipeline, blocking all subsequent records.
The pattern: Route failed records to a dead-letter queue (DLQ) instead of failing the pipeline.
def process_batch(records):
good, bad = [], []
for record in records:
try:
result = transform(record)
good.append(result)
except Exception as e:
bad.append({**record, 'error': str(e), 'failed_at': datetime.now()})
write_to_target(good)
write_to_dlq(bad) # Separate table/topic for investigationDLQ management:
- Set up alerts when DLQ exceeds a threshold (e.g., >0.1% failure rate)
- Include the original record + error message + timestamp in the DLQ
- Build a reprocessing job that reads from DLQ after fixes are deployed
- Set TTL on DLQ records (e.g., 30 days) to prevent unbounded growth
Interview tip: Mention DLQ patterns whenever discussing error handling. It demonstrates production maturity.
Pattern 5: Backfill-Safe Pipeline Design
The problem: You need to reprocess 6 months of historical data after a bug fix, but the pipeline was designed for incremental daily runs.
The pattern: Design every pipeline to accept a date range parameter, not just 'process today.'
# Bad: hardcoded to current date
def run_pipeline():
date = datetime.today().strftime('%Y-%m-%d')
process(date)
# Good: parameterized + idempotent
def run_pipeline(start_date: str, end_date: str):
for date in date_range(start_date, end_date):
process_partition(date) # OVERWRITE, not APPENDBackfill checklist:
- Every pipeline accepts
start_dateandend_dateparameters - Writes are partitioned by date and use OVERWRITE
- Dependencies are also backfill-safe (no circular wait conditions)
- Resource limits: Backfills run with lower priority (Airflow pools, Spark resource queues)
- Downstream SLAs are met: Don't backfill production tables during business hours
Pattern 6: Circuit Breaker for External Dependencies
The problem: Your pipeline calls an external API (geocoding, enrichment) that goes down. The pipeline retries endlessly, wasting resources and creating partial data.
The pattern: Implement a circuit breaker that stops calling a failing dependency after N failures.
Three states:
- Closed (normal): Calls pass through. Count failures.
- Open (tripped): After N failures in M seconds, stop calling. Return cached/default values.
- Half-open (recovery): After a cooldown period, try one call. If it succeeds, close the circuit.
In practice:
- Use cached values during open state (last known good enrichment)
- Write records that couldn't be enriched to a DLQ for later reprocessing
- Alert the team when the circuit opens
- Set different thresholds per dependency (critical API: 3 failures/60s, nice-to-have enrichment: 10 failures/300s)
Pattern 7: Event-Driven vs Schedule-Driven Orchestration
Schedule-driven (Airflow cron): Pipeline runs at a fixed time regardless of whether new data exists.
- Pros: Predictable, easy to monitor, simple SLA management
- Cons: Wastes resources when no new data; misses data that arrives between runs
Event-driven: Pipeline triggers when new data arrives (S3 event notification, Kafka message).
- Pros: Processes data immediately, no wasted runs
- Cons: Harder to monitor, thundering herd problem, no guaranteed ordering
Hybrid (production best practice):
S3 event β Lambda β Triggers Airflow DAG (if not already running)
β Rate-limited to max 1 trigger per 15 minutesThis gives you the responsiveness of event-driven with the monitoring and retry capabilities of scheduled orchestration.
Interview tip: Don't say 'I'd use Airflow with a daily cron.' Say 'I'd use event-driven triggering with Airflow as the orchestrator, with a daily catchup job for any missed events.' This shows you understand both approaches and their trade-offs.
Practice articulating these patterns with DataEngPrep's Answer Analyzer β get instant feedback on your pipeline design answers.
Written by the DataEngPrep Team
Our editorial team consists of experienced data engineers who have worked at top tech companies and gone through hundreds of real interviews. Every article is reviewed for technical accuracy and practical relevance to help you prepare effectively.
Learn more about our team βRelated Articles
Practice These Questions
Think you can answer these questions? Find out in 30 seconds
Paste your answer and get instant AI feedback β see exactly where your answer is weak and how a FAANG-level candidate would respond.