**Why SLA monitoring**: SLAs are contractual; breaches trigger credits and erode trust. Define SLOs first (e.g., 99.5% success rate, data latency < 30 min). **Architecture**: CloudWatch for metrics—track Lambda/Glue duration, error rates, data freshness (custom metric:...
This hard-level Cloud/Tools question appears frequently in data engineering interviews at companies like Capco. While less common, it tests deeper understanding that distinguishes strong candidates.
This is a senior-level question that tests architectural thinking. Lead with the high-level design, then drill into specifics. Discuss trade-offs explicitly - there is rarely one correct answer. Show awareness of scale, fault tolerance, and operational complexity.
Why SLA monitoring: SLAs are contractual; breaches trigger credits and erode trust. Define SLOs first (e.g., 99.5% success rate, data latency < 30 min). Architecture: CloudWatch for metrics—track Lambda/Glue duration, error rates, data freshness (custom metric: last_successful_run_timestamp). CloudWatch Alarms on threshold breaches (e.g., job duration > 2× baseline, failure count > 0). EventBridge triggers runbooks, PagerDuty, or Step Functions for auto-remediation. Scalability: At 100+ pipelines, custom metrics explode—use namespace + dimension strategy; consider centralized dashboards (Grafana) with templating. Cost: Custom metrics cost $0.30 per metric per month; 50 pipelines × 10 metrics = $150/month—budget for it. For Glue: monitor DPU hours (cost driver) and job bookmarks. Integrate data quality (Great Expectations, dbt tests) that emit pass/fail metrics—quality is part of SLA. Observability: X-Ray for distributed tracing across Lambda + Glue. Store SLAs as code; automate incident response with Step Functions or Lambda so humans handle exceptions, not routine failures.
This answer is partially locked
Unlock the full expert answer with code examples and trade-offs
Practice real interviews with AI feedback, track progress, and get interview-ready faster.
Pro starts at $24/mo - cancel anytime
Get the most asked SQL questions with expert answers. Instant download.
No spam. Unsubscribe anytime.
Paste your answer and get instant AI feedback with a FAANG-level improved version.
Analyze My Answer — FreeAccording to DataEngPrep.tech, this is one of the most frequently asked Cloud/Tools interview questions, reported at 1 company. DataEngPrep.tech maintains a curated database of 1,863+ real data engineering interview questions across 7 categories, verified by industry professionals.