Question 1

Discuss a project where you balanced business goals with technical constraints.

Accepted Answer

**Situation**: Business wanted real-time dashboards; infrastructure was batch-only. **Task**: Deliver value within constraints. **Action**: Proposed hybrid: (1) Batch for historical (nightly). (2) Near-real-time (15-min) for recent data via incremental streaming. Delivered MVP with batch first; phased streaming in Q2. Communicated trade-offs; aligned on phased approach. **Result**: Met 80% of requirements with existing stack; streaming added later....

Question 2

Walk through a production incident where data freshness or correctness was at risk. How did you balance immediate mitigation vs. root-cause remediation? What architectural changes would prevent recurrence, and what are the cost vs. reliability trade-offs?

Accepted Answer

Situation: Pipeline failed at 2 AM; source schema change (new required column) broke ingestion. Mitigation vs. Remediation: Quick fix (default column, redeploy) restores service; proper fix (schema validation, evolution policy) prevents recurrence. Architectural Logic: Schema-on-write pipelines fail on evolution; schema validation (e.g., Glue Schema Registry, Avro) catches drift early. Resiliency vs....

Question 3

Design a star schema for retail analytics (e.g., Adidas). Explain the dimensional modeling choices, SCD strategy, and how you would scale this schema for global multi-currency, multi-region deployments. What are the refresh and storage cost implications?

Accepted Answer

**Section 1 — The Context (The 'Why')**
The primary challenge for this design in SQL is balancing scale, cost, and reliability. At scale, naive approaches fail: single points of failure cause cascades, schema evolution breaks consumers, and over-provisioning explodes cost. Failure modes include silent data loss from non-idempotent writes, cascading job failures from tight coupling, and operational burden from manual intervention....

Question 4

Explain how partitioning and bucketing in Hive/Spark optimize queries. What are the trade-offs in bucket count, partition cardinality, and small-file problem? When does over-partitioning or over-bucketing become counterproductive?

Accepted Answer

Partitioning: Splits by column (e.g., dt, region); pruning skips non-matching partitions. Bucketing: Hashes rows into N files by key; enables co-located joins when both tables bucketed on same key. Why combined: PARTITIONED BY (dt) CLUSTERED BY (user_id) INTO 32 BUCKETS—prune by date, efficient join on user_id. Trade-offs: Partition cardinality too high (e.g., by hour for years) → small-file problem, metadata overload; too low → coarse pruning....

Question 5

Explain the differences between OLTP and OLAP databases and their relevance in Adidas's operations.

Accepted Answer

**OLTP** (SAP, PostgreSQL): Optimized for high-volume, low-latency transactional writes; ACID; point lookups; indexes on primary/foreign keys. **OLAP** (Snowflake, Redshift): Optimized for bulk reads, aggregations, scans; columnar storage; few indexes. **Why separate**: OLTP can't scale for analytics without degrading transaction latency; OLAP schemas (Star) are write-inefficient. **Adidas relevance**: OLTP powers e-commerce checkout, POS, inventory—sub-100ms response....

Question 6

How would you create a materialized view for frequently accessed aggregated sales data?

Accepted Answer

CREATE MATERIALIZED VIEW mv_sales_agg AS SELECT region, product, SUM(amount) FROM sales GROUP BY region, product; **Refresh**: REFRESH MATERIALIZED VIEW mv_sales_agg; **Incremental**: Some systems (Snowflake, Delta) support incremental. **Schedule**: Cron or dbt. **Trade-off**: Refresh cost vs. query savings....

Question 7

How would you handle duplicate or corrupted data in a batch ETL job?

Accepted Answer

Handling duplicate or corrupted data in batch ETL involves multiple layers. First, implement idempotency—use MERGE/UPSERT or truncate-and-reload with deterministic keys. For duplicates, apply business rules: keep most recent (MAX(updated_at)), first occurrence, or aggregate. Use ROW_NUMBER() OVER (PARTITION BY composite_key ORDER BY timestamp DESC) and filter rn=1. For corrupted data, validate schemas at ingestion, reject bad records to a quarantine table, and alert....

Question 8

How would you optimize a query fetching sales data across multiple countries with billions of rows?

Accepted Answer

For billions of rows across countries: (1) Partition by country and date—SELECT * FROM sales WHERE country IN ('US','UK') AND sale_date BETWEEN ... enables partition pruning. (2) Use columnar storage (Redshift, BigQuery, Snowflake)—only scan needed columns. (3) Aggregate at source—pre-aggregate by country/date in a summary table. (4) Use approximate queries (HyperLogLog, APPROX_COUNT_DISTINCT) when exact counts aren't needed. (5) Implement incremental processing—only process new/changed data....

Question 9

Tell us about a project where you optimized an existing process or pipeline. What was the impact?

Accepted Answer

**Situation**: Context of the challenge. **Task**: Your responsibility. **Action**: Specific steps, tools, collaboration. **Result**: Quantified outcome. I optimized a legacy daily ETL pipeline that took 8+ hours. The pipeline loaded 50+ tables sequentially with full refreshes....

Question 10

What are the benefits of using a cloud data warehouse (e.g., Redshift, Snowflake) for analytics?

Accepted Answer

Cloud data warehouse benefits (Redshift, Snowflake): (1) Elasticity—scale compute and storage independently; pay for what you use. (2) Managed operations—fewer DBA tasks; auto-tuning, backups. (3) Performance—columnar storage, query optimization, caching. (4) Integration—native connectors to cloud storage, streaming, BI tools. (5) Security—encryption, access controls, compliance certifications. (6) Global availability—multi-region options. Trade-off: Vendor lock-in; egress costs....

Adidas SQL Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 12 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading