Real questions from top companies
Implement a recursive query for hierarchy (employee-manager). Explain the termination guarantees, depth limits, and when a recursive CTE becomes a scalability bottleneck. What alternatives exist for graph-scale hierarchies in Spark or a data lake?
Explain bloom filters in Spark: how they reduce I/O and when they introduce false positives that hurt performance. What are the scalability and cost implications of enabling dynamic partition pruning and bloom filter pushdown at petabyte scale?
Design a star schema for retail analytics (e.g., Adidas). Explain the dimensional modeling choices, SCD strategy, and how you would scale this schema for global multi-currency, multi-region deployments. What are the refresh and storage cost implications?
Compare Glue partition discovery with Hive MSCK/ADD PARTITION. Explain the operational and cost implications of crawler-based vs. partition-projection approaches. When does partition projection become necessary, and what are its limitations?
Explain how partitioning and bucketing in Hive/Spark optimize queries. What are the trade-offs in bucket count, partition cardinality, and small-file problem? When does over-partitioning or over-bucketing become counterproductive?
Explain how to flatten a multi-level nested JSON file while loading it into BigQuery.
Explain how to implement cumulative sum in SQL.
Explain how you would implement partitioning and bucketing for data stored in S3 to improve query performance.
Explain how you would optimize Redshift query performance for a reporting system with large fact tables.
Explain how you would use repartition or coalesce effectively to optimize processing when analyzing data only for a specific region.
Explain indexing and its impact on database performance.
Explain normalization and its disadvantages.
Explain normalization in databases and its importance. Write an SQL query to handle SCD-1 or SCD-3
Explain offset management, Sync vs. Async commits, partition assignment strategies and Consumer groups, and handling backpressure in Kafka streams.
Explain peer code review and team lead review.
Explain row_number, rank, and dense_rank with examples.
Explain the Medallion Architecture (Bronze, Silver, Gold).
Explain the difference between a clustered and non-clustered index.
Explain the difference between a fact table and a dimension table.
Explain the difference between a primary key and a unique key.
Type or paste your answer to any of these questions and our AI Coach scores it, highlights gaps, and rewrites it at FAANG quality. Free to try.