How do you handle exceptions in data ingestion?

Question

Aditya Kumar · Accepted Answer

**Section 1 — The Context (The 'Why')**
Exception handling in data ingestion determines whether the system fails safely or corrupts silently. Unbounded retries can DDoS sources (rate limits, IP bans); silent drops lose data forever; and missing circuit breakers allow cascading failures. At Gartner-scale advisory, heterogeneous sources (APIs, DBs, files) have wildly different failure modes: transient network blips vs permanent schema breaks. A naive approach—retry forever or drop on error—either overwhelms upstream or creates undetectable data loss. Key failure modes: connector OOM under backpressure, schema drift causing validation storms, and DLQ overflow when bad data volume spikes.

**Section 2 — The Diagram**
```
[Source]---->[Connector]---->[Validate]
   |              |               |
   v              v               v
[Retry+Circuit]  [OK]        [Fail->DLQ]
   |              |               |
   +--------------+----------->[Alert]
```

**Section 3 — Component Logic**
The **connector** pulls data with **retry logic** (exponential backoff, capped attempts—e.g., 5 retries with 2^n seconds) and a **circuit breaker** that opens after N consecutive failures, failing fast to protect the source. **Validation** (schema, format, business rules) separates good from bad; good records flow to the sink; bad records follow a **fan-out pattern** to the **DLQ** with full error context (record, error message, timestamp). The DLQ enables replay after fixing parsers. **Idempotency** in the sink ensures safe retries without duplicates. Backpressure handling: when the consumer lags, the connector slows or blocks rather than buffering unboundedly. Alerts fire on DLQ growth, circuit open, or validation failure rate spikes. Implement a batch job to periodically drain and reprocess the DLQ; tag records with error type for triage. Rate-limit retries per source to avoid hammering failing APIs.

**Section 4 — The Trade-offs (The 'Senior' part)**

- **CAP Theorem**: We choose **AP**. Circuit breaker fails fast to preserve upstream; DLQ ensures no data loss; eventual processing of quarantined records. Availability of the ingestion path over blocking on perfection.

- **Cost vs. Performance**: SQS DLQ ~$0.40/1M requests. S3 quarantine ~$0.023/GB. Retry storms cause Lambda/Glue cost spikes—cap retries. Dead letter processing: batch job on EMR Spot.

- **Blast Radius**: Connector OOM: circuit opens; backpressure to source. DLQ full: alert; manual intervention. Schema change: validation failures spike—rollback or fix consumers; coordinate with source owners. Isolate connector processes to limit resource contention.

**Design principles**: Cap retries at 5–10; use exponential backoff (1s, 2s, 4s, 8s) to avoid thundering herd. Open circuit after 5 consecutive failures; half-open after 30s to test recovery. Always include record ID, timestamp, and error message in DLQ payload for debugging. Distinguish transient errors (retry) from permanent errors (DLQ, manual fix); use error classification to route appropriately.

**Section 5 — Pro-Tip**
- **Pro-Move**: Never drop—persist to DLQ with error context. Implement backoff and circuit breakers.
- **Red Flag**: Silent drops or unbounded retries—both are production anti-patterns.

How do you handle exceptions in data ingestion?

Why This Question Matters

How to Approach This

Free: Top 20 SQL Interview Questions (PDF)

Related System Design/Architecture Questions

Companies that ask this System Design/Architecture question

Want to know if YOUR answer is good enough?

How do you handle exceptions in data ingestion?

Why This Question Matters

How to Approach This

Free: Top 20 SQL Interview Questions (PDF)

Related System Design/Architecture Questions

Companies that ask this System Design/Architecture question

Want to know if YOUR answer is good enough?