**Section 1 — The Context (The 'Why')** Reliable and correct data is the hardest guarantee in distributed systems: silent corruption, duplicate records from retries, and schema drift can invalidate entire analytics foundations. A naive approach—trusting sources, writing without...
This hard-level System Design/Architecture question appears frequently in data engineering interviews at companies like Netflix. While less common, it tests deeper understanding that distinguishes strong candidates. Mastering the underlying concepts (partition, snowflake, spark) will help you answer variations of this question confidently.
This is a senior-level question that tests architectural thinking. Lead with the high-level design, then drill into specifics. Discuss trade-offs explicitly - there is rarely one correct answer. Show awareness of scale, fault tolerance, and operational complexity. The expert answer includes a code example that demonstrates the implementation pattern.
Section 1 — The Context (The 'Why')
Reliable and correct data is the hardest guarantee in distributed systems: silent corruption, duplicate records from retries, and schema drift can invalidate entire analytics foundations. A naive approach—trusting sources, writing without validation, or lacking lineage—means downstream consumers make decisions on garbage. At Netflix scale, a single bad record multiplied across millions of recommendations creates revenue and trust damage. Failure modes include non-idempotent sinks (duplicates on retry), schema-validation bypasses (malformed data propagates), and orphaned data with no traceability for impact assessment.
Section 2 — The Diagram
[Sources]---->[Schema Validate]---->[Curated Sink]
| | |
v v v
[DLQ] [Rules Engine] [Delta/Snowflake]
| | |
+------------------+------------>[Recon|Lineage]
Section 3 — Component Logic
Schema validation gates all incoming data against a contract (Great Expectations, JSON Schema, Avro); records failing validation are routed to a dead-letter queue (DLQ) rather than dropped—preserving evidence for debugging and replay. The rules engine applies domain rules (null checks, referential integrity, range validation); critical failures block the pipeline while non-critical go to quarantine. The idempotent sink uses deterministic keys (e.g., hash(source_id, timestamp)) so retries produce exactly-once semantics; Kafka plus transactional writes (e.g., Delta MERGE) achieve this. Reconciliation runs periodically comparing source and sink counts; lineage (OpenLineage, DataHub) traces data flow for impact analysis when bugs occur. Idempotency is essential for safe replay from checkpoints. Implement data contracts as code; version them and run compatibility tests in CI. Track quality metrics (completeness, freshness, correctness) and alert on degradation.
Section 4 — The Trade-offs (The 'Senior' part)
Design principles: Define schema contracts as the source of truth; version them and enforce in CI. Use idempotent keys derived from business identifiers; avoid sequence numbers or timestamps that can collide on retry. Reconcile batch totals between source and sink; investigate gaps immediately. Implement column-level lineage to trace each field from source to consumption; this accelerates impact analysis during incidents.
Section 5 — Pro-Tip
Want feedback on your answer?
Paste your answer to this question and our AI Coach scores it, finds gaps, and shows you the FAANG-level version.
Get the most asked SQL questions with expert answers. Instant download.
No spam. Unsubscribe anytime.
Paste your answer and get instant AI feedback with a FAANG-level improved version.
Analyze My Answer — FreeAccording to DataEngPrep.tech, this is one of the most frequently asked System Design/Architecture interview questions, reported at 1 company. DataEngPrep.tech maintains a curated database of 1,863+ real data engineering interview questions across 7 categories, verified by industry professionals.