Question 1

Describe a scenario where you used Databricks for real-time data processing.

Accepted Answer

**Situation**: Near-real-time fraud detection needed sub-30-second latency. **Task**: Build streaming pipeline with exactly-once semantics. **Action**: Architecture: Kafka → Databricks Structured Streaming → Delta Lake; ML model scoring in streaming; alerts to operations. Implemented checkpointing for exactly-once; windowed aggregations for patterns. Challenges: Late data (watermarking), backpressure (auto-scaling, rate limiting). **Result**: Detection latency under 30 seconds; 99.9% uptime....

Question 2

Describe a cross-team data project where you had to align architectural boundaries, ownership, and SLAs. How did you handle conflicting priorities, technical debt, and the scalability of communication as the number of stakeholders grew?

Accepted Answer

Architectural Logic: Cross-team projects need clear data contracts (schema, freshness, quality), defined ownership (who owns ingestion, transformation, consumption), and SLAs that map to business impact. Why boundaries matter: Prevents circular dependencies; enables parallel work; makes failure domains explicit. Scalability of coordination: Weekly syncs don't scale past ~5 teams; move to async updates, shared docs, and contract-first interfaces....

Question 3

Implement a recursive query for hierarchy (employee-manager). Explain the termination guarantees, depth limits, and when a recursive CTE becomes a scalability bottleneck. What alternatives exist for graph-scale hierarchies in Spark or a data lake?

Accepted Answer

Implementation: WITH RECURSIVE emp_tree AS (SELECT id, name, manager_id, 1 AS level FROM employees WHERE manager_id IS NULL UNION ALL SELECT e.id, e.name, e.manager_id, t.level + 1 FROM employees e JOIN emp_tree t ON e.manager_id = t.id) SELECT * FROM emp_tree. Why it works: Base case = roots; recursive case = join to prior level. Termination: Ensured by acyclic graph; cycles require cycle detection or LIMIT....

Question 4

Explain bloom filters in Spark: how they reduce I/O and when they introduce false positives that hurt performance. What are the scalability and cost implications of enabling dynamic partition pruning and bloom filter pushdown at petabyte scale?

Accepted Answer

Bloom filter: Probabilistic set membership; no false negatives; tunable false positive rate. Why in Spark: Dynamic partition pruning—filter on one side, build bloom filter, push to other side to skip partitions/rows; reduces I/O. Architectural Logic: Effective when filter has high selectivity; dimension keys applied to fact table partitions. False positive impact: Over-inclusion means extra rows scanned; tunable via bits-per-element....

Question 5

Given a table of sales data, use window functions to calculate a running total.

Accepted Answer

SELECT date, amount, SUM(amount) OVER (ORDER BY date) AS running_total FROM sales. **Default frame**: RANGE UNBOUNDED PRECEDING = cumulative. **Partitioned**: SUM(amount) OVER (PARTITION BY region ORDER BY date). **Determinism**: If date has duplicates, RANGE treats peers together; use ROWS for row-by-row....

Question 6

How do you handle schema evolution in data lakes or data warehouses?

Accepted Answer

**Strategies**: (1) Schema-on-read (Parquet, JSON)—flexible; add columns at read. (2) Additive evolution—add column with default; deprecated columns ignored. (3) Schema registry (Avro)—compatibility checks. (4) Versioned datasets (v1, v2)—backfill or dual-read. **Best practice**: Additive preferred; avoid breaking changes; compatibility mode (backward/forward). **Backfill**: New columns; backfill historical or NULL....

Question 7

How would you optimize a query with multiple joins and subqueries?

Accepted Answer

Optimizing multi-join/subquery queries: (1) Reduce join cardinality early—filter before joining. (2) Replace correlated subqueries with JOINs: e.g., SELECT * FROM a WHERE a.id IN (SELECT id FROM b WHERE ...) becomes JOIN (SELECT id FROM b WHERE ...) b ON a.id = b.id. (3) Use CTEs for readability and to let optimizer materialize. (4) Ensure join columns are indexed. (5) Join smallest tables first when possible. (6) Avoid SELECT * in subqueries. (7) Use EXPLAIN to detect full scans....

Question 8

Write a query to find the first number repeating consecutively three times in a sequence.

Accepted Answer

WITH consec AS (SELECT num, id, num - ROW_NUMBER() OVER (ORDER BY id) AS grp FROM t) SELECT num FROM consec GROUP BY num, grp HAVING COUNT(*) >= 3 ORDER BY MIN(id) LIMIT 1. **Why**: Consecutive same nums share (num - rn); first group with 3+ = first occurrence....

American Express SQL Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 8 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading