Q: Create a script to parse and transform a JSON file into a structured CSV.

**Approach:** `json.load`; flatten nested; `DictWriter` or `pd.json_normalize`. **Nested:** Flatten or expand. **Why:** Analytics need tabular. **Production:** Schema validation; handle missing keys. **Cost:** Large JSON—stream or chunk.

Question 1

What is the difference between repartition and coalesce in Apache Spark?

Accepted Answer

**Repartition(n)**: Performs a full shuffle to redistribute data across exactly `n` partitions. Can increase or decrease partition count. Uses hash partitioning by default—all rows are exchanged across the network.

**Coalesce(n)**: Merges existing partitions into fewer partitions without a full shuffle. Only decreases partition count....

Question 2

Write an SQL query to find the second-highest salary from an employee table.

Accepted Answer

**Using subquery with MAX**:
```sql
SELECT MAX(salary) AS second_highest
FROM employee
WHERE salary < (SELECT MAX(salary) FROM employee);
```

**Using LIMIT/OFFSET** (MySQL, PostgreSQL):
```sql
SELECT DISTINCT salary
FROM employee
ORDER BY salary DESC
LIMIT 1 OFFSET 1;
```

**Using DENSE_RANK** (ANSI SQL, most robust):
```sql
SELECT salary
FROM (
  SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rk
  FROM employee
) t
WHERE rk = 2;
```

**Architectural Logic & Trade-offs**:
- **Subque...

Question 3

What strategies can you use to handle skewed data in Spark?

Accepted Answer

**1. Salting**: Add random suffix to skewed keys to spread load; requires two-phase aggregation. **2. Two-phase aggregation**: Aggregate with salted key, then aggregate again without salt. **3. Broadcast**: For small dimension tables, broadcast to avoid shuffle. **4. Custom partitioning**: Pre-partition by known skewed keys. **5. Increase partitions**: Spreads work but doesn't fix root cause. **6. AQE Skew Join (Spark 3.0+)**: Automatically splits skewed partitions....

Question 4

Design a Delta table layout for mixed workload: point lookups by user_id, range scans by date, and full partition scans. Compare partitioning vs. Z-ordering—when to use each, and the rewrite cost trade-off.

Accepted Answer

**Section 1 — The Context (The 'Why')**
A Delta table serving point lookups (by user_id) and full scans (analytics) faces conflicting optimization. Point lookups want partition pruning by user; analytics want date partitioning.

**Section 2 — The Diagram**
```
[Delta] Z-Order:user_id | Partition:date
  |
  v
[Lookup][Scan][Stream]
```

**Section 3 — Component Logic**
**Partitioning strategies** use date for incremental and range pruning....

Question 5

Describe how to secure sensitive data in cloud storage solutions.

Accepted Answer

Architectural layers: (1) At rest—SSE-S3, SSE-KMS, or customer-managed keys; Azure Blob encryption. (2) In transit—TLS 1.2+; VPC endpoints to avoid public internet. (3) Access—IAM least privilege; bucket policies; Block Public Access; Lake Formation/Unity Catalog for column-level. (4) Masking—tokenize PII in ETL; views with column-level security. Why layered: Defense in depth; compliance (GDPR, HIPAA). Cost: KMS adds ~$1/10K keys; VPC endpoints reduce data transfer cost....

Question 6

What are the pros and cons of using a data lake on AWS, GCP, or Azure?

Accepted Answer

**AWS**: Mature ecosystem (S3, Glue, Athena, EMR). Strong for hybrid (DataSync, Snowball). Service sprawl—many options, steeper learning. **GCP**: Strong BigQuery (serverless SQL); Dataflow (Beam); good integration. Smaller market share; fewer third-party integrations. **Azure**: ADLS Gen2 (hierarchical namespace); Synapse combines lake + warehouse. Strong enterprise presence; complexity in multi-service setup....

Question 7

Explain how you gather and define requirements for a complex data platform project.

Accepted Answer

**Situation**: Complex platforms have multiple stakeholders, ambiguous scope.

**Action**: (1) Stakeholder workshops—personas, use cases, success metrics; (2) Data discovery—sources, volume, quality, latency; (3) Gap analysis—current vs. desired; (4) Prioritization—MoSCoW or RICE; (5) PRD—scope, assumptions, acceptance criteria.

**BCG Style**: Align with business strategy; quantify impact. Iterative refinement—prototype early, validate....

Question 8

How would you model customer transaction data for both analytical and operational use cases?

Accepted Answer

Hybrid model with clear separation of concerns. WHY: OLTP and analytics have opposing requirements—low-latency writes vs. analytical scans. OLTP: Normalized schema (customers, accounts, transactions) with indexes; event sourcing for auditability. Analytics: Denormalized star/snowflake—fact table (transaction_id, customer_id, product_id, amount, ts) + dimensions. Bridge via CDC....

Question 9

Create a script to parse and transform a JSON file into a structured CSV.

Accepted Answer

**Approach:** `json.load`; flatten nested; `DictWriter` or `pd.json_normalize`. **Nested:** Flatten or expand. **Why:** Analytics need tabular. **Production:** Schema validation; handle missing keys. **Cost:** Large JSON—stream or chunk.

Question 10

Compare Redshift, BigQuery, and Snowflake in terms of cost, performance, and scalability.

Accepted Answer

**Architectural Logic**: Each warehouse optimizes for different workload and cost profiles. **Redshift**: Cluster-based; provisioned nodes; predictable cost for stable workloads. Manual scaling; strong for high-volume, consistent batch. Requires tuning (sort keys, distribution keys). **BigQuery**: Serverless; pay-per-query; auto-scales to zero. Best for variable, ad-hoc analytics; no provisioning. **Snowflake**: Hybrid; compute and storage separate; multi-cloud....

BCG Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 36 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading