Question 1

Demonstrate the difference between DENSE_RANK() and RANK()

Accepted Answer

**RANK()**: Same rank for ties; skips subsequent ranks (e.g., 1, 2, 2, 4, 5). **DENSE_RANK()**: Same rank for ties; no gaps (e.g., 1, 2, 2, 3, 4). **Why it matters**: RANK preserves "position" semantics (e.g., 4th place); DENSE_RANK gives consecutive integers useful for filtering (e.g., TOP 10). **Example**: `SELECT name, salary, RANK() OVER (ORDER BY salary DESC) AS rk, DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rk FROM employee`....

Question 2

Compare Glue partition discovery with Hive MSCK/ADD PARTITION. Explain the operational and cost implications of crawler-based vs. partition-projection approaches. When does partition projection become necessary, and what are its limitations?

Accepted Answer

Glue: Crawler infers partitions from S3 path (e.g., s3://bucket/table/year=2024/month=01/); catalogs in Glue Catalog. Hive: MSCK REPAIR TABLE or ADD PARTITION; expects Hive-style paths. Why Glue: AWS-native; integrates with Athena, Redshift Spectrum; no cluster to run. Partition projection: Define partition schema (type, range); no crawler needed; queries resolve partitions from path at query time. Operational: Crawler runs on schedule, costs DPU time; projection has no crawler cost....

Question 3

Explain how you would optimize Redshift query performance for a reporting system with large fact tables.

Accepted Answer

**Architectural Logic**: Distribution, sort keys, vacuum, materialized views—foundation for Redshift. **Strategies**: (1) DISTKEY on join column. (2) SORTKEY on filter columns (date, region). (3) VACUUM and ANALYZE regularly. (4) Result cache. (5) Materialized views for common aggregates. (6) Column compression. (7) Avoid SELECT *. **Scalability**: Design distribution and sort from start; monitor query plans. **Cost**: Right-sizing; concurrency scaling for spikes....

Question 4

Explain the differences between table re-creation and ALTER TABLE operations.

Accepted Answer

**ALTER TABLE**: In-place metadata or incremental change; minimal downtime; supported for add column, drop column (some systems), rename. **Re-creation** (CTAS, DROP+CREATE): Full copy; changes storage format, clustering, partitioning; reclaims space from deletes. **Why ALTER**: Zero-downtime schema evolution; additive changes (new column with default) are instant in most systems....

Question 5

Explain the use of Amazon Athena for serverless querying.

Accepted Answer

Athena is serverless Presto over S3; pay per TB scanned. **Why serverless**: No cluster to manage; auto-scaling; instant start. **Architecture**: Glue Catalog provides schema; queries scan S3 directly. **Scalability**: Concurrency limited by account quotas; partition pruning limits scan. **Cost**: $5/TB scanned—unpartitioned full scans are expensive. **Optimization**: Partition by date/region; use Parquet/ORC (columnar, 10× less scan); compress; avoid SELECT *....

Question 6

Explain the use of Elastic Resize vs. Classic Resize in Redshift.

Accepted Answer

**Elastic Resize**: Add/remove nodes same type; data stays in place via leader node redistribution; ~2–5 min; no data copy. **Classic Resize**: Change node type or count; full data redistribution; hours for large clusters. **Why Elastic**: Quick scale for concurrency (e.g., holiday); same node type required. **Why Classic**: Migrate dc2 → ra3 (managed storage); major capacity change. **Scalability**: Elastic limited to 2× nodes; Classic supports full resize....

Question 7

How does partitioning in S3 affect Athena query performance?

Accepted Answer

**S3 layout**: s3://bucket/dt=2024-01-01/region=US/—Hive-style. **Athena**: WHERE dt='2024-01-01' skips irrelevant partitions; only scans matching prefix. **Cost**: $5/TB scanned; partition pruning = less scan = lower cost. **Catalog**: Glue MSCK REPAIR or add partitions. **Best**: Partition by high-filter columns; columnar format (Parquet)....

Question 8

How does the MAXERROR parameter affect data loading in Redshift?

Accepted Answer

**MAXERROR**: COPY command parameter; max bad rows before abort. MAXERROR 0 (default) = fail on first error. MAXERROR 100 = load continues until 100 errors; then stops. **Failed rows**: Written to STL_LOAD_ERRORS or error table. **Use case**: Dirty source with known error rate; don't want one bad row to block load....

Question 9

How would you add columns to a table without impacting queries?

Accepted Answer

**Add with default**: ALTER TABLE t ADD COLUMN new_col INT DEFAULT 0; existing rows get default; no full scan in most systems. **Add at end**: Avoids rewriting. **NOT NULL**: May require backfill; add as nullable first, backfill, then alter to NOT NULL. **Backfill**: In batches if default not sufficient....

Question 10

How would you automate Redshift cluster scaling for peak loads?

Accepted Answer

**Concurrency Scaling**: Automatic for read bursts; no config. **Compute scaling**: (1) Elastic Resize: Lambda + EventBridge triggers resize API before known peak. (2) Schedule: Cron for recurring (e.g., month-end). (3) Spectrum: Overflow to S3 for ad-hoc. **Automation**: EventBridge rule → Lambda → modify-cluster. Scale-down rule after peak....

Capco SQL Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 15 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading