How do you handle pipeline failures or delays?

Question

DataEngPrep.tech · Accepted Answer

**Section 1 — The Context (The 'Why')**
Pipeline failures and delays create cascading business impact: stale dashboards, missed SLAs, and downstream jobs blocked on unavailable data. A naive design—no retries, opaque dependencies, or manual runbooks—forces firefighting instead of self-healing. At Moonfare scale, fund data pipelines have strict regulatory windows; delays breach compliance. Failure modes include task-level OOMs, upstream source delays, and orchestrator outages that prevent new runs while in-flight jobs may complete. The key challenge: distinguish transient failures (retry) from permanent ones (alert, fix).

**Section 2 — The Diagram**
```
[DAG]---->[Task A]---->[Task B]---->[Task C]
  |           |            |            |
  v           v            v            v
[Retry]   [Sensor]   [Checkpoint]   [Alert]
  |           |            |            |
  +-----------+------------+----->[Runbook|Slack]
```

**Section 3 — Component Logic**
The **DAG** defines task dependencies; **Task A/B/C** execute in order. **Retry** policy (e.g., 3 attempts with exponential backoff) handles transient failures; idempotent tasks ensure retries are safe. **Sensors** wait for upstream data (file arrival, DB watermark) before proceeding; they prevent wasted runs on missing inputs. **Checkpointing** before commit ensures exactly-once semantics—tasks commit only after success; on failure, replay from last checkpoint. **Alerts** fire on final failure or SLA breach; they link to **runbooks** for automated or guided recovery. Dependency-aware scheduling allows skipping optional downstream when critical path fails. Orchestrator down: in-flight runs complete; no new triggers until recovery. Use task pools to limit concurrency and avoid resource exhaustion; set execution timeouts to fail fast on stuck tasks. Implement SLA monitoring with alert-before-breach to enable proactive intervention.

**Section 4 — The Trade-offs (The 'Senior' part)**

- **CAP Theorem**: **Consistency** during recovery—checkpoint before commit; retries preserve order. **Availability** via dependency-aware scheduling; skip optional downstream to unblock critical path.

- **Cost vs. Performance**: Airflow Cloud Composer ~$300+/mo baseline. PagerDuty ~$20/user. Idle pipeline cost: right-size workers; spot for batch. SLA breach: quantify business impact for prioritization.

- **Blast Radius**: Task B fails: C+ blocked; auto-retry 3x then alert. Orchestrator down: in-flight runs complete; no new DAG triggers. Source delay: sensors wait; SLA alert if overdue. Use task-level timeouts to avoid hung runs; set pool limits to prevent resource exhaustion.

**Design principles**: Make every task idempotent; design for rerun from any step. Use deterministic scheduling (e.g., run date as parameter) so reruns produce same outputs. Link each alert to a runbook; automate the most common recovery steps. Test failure scenarios in staging. Define clear ownership for each pipeline; establish escalation paths for SLA breaches.

**Section 5 — Pro-Tip**
- **Pro-Move**: Runbooks linked to alerts; automate recovery where possible; version config.
- **Red Flag**: No dependency graph—cascading failures with no visibility.

How do you handle pipeline failures or delays?

Why This Question Matters

How to Approach This

Start AI Mock Interview

Free: Top 20 SQL Interview Questions (PDF)

Related System Design/Architecture Questions

Want to know if YOUR answer is good enough?

How do you handle pipeline failures or delays?

Why This Question Matters

How to Approach This

Start AI Mock Interview

Free: Top 20 SQL Interview Questions (PDF)

Related System Design/Architecture Questions

Want to know if YOUR answer is good enough?