Question 1

What is the difference between repartition and coalesce in Apache Spark?

Accepted Answer

**Repartition(n)**: Performs a full shuffle to redistribute data across exactly `n` partitions. Can increase or decrease partition count. Uses hash partitioning by default—all rows are exchanged across the network.

**Coalesce(n)**: Merges existing partitions into fewer partitions without a full shuffle. Only decreases partition count....

Question 2

Write an SQL query to find the second-highest salary from an employee table.

Accepted Answer

**Using subquery with MAX**:
```sql
SELECT MAX(salary) AS second_highest
FROM employee
WHERE salary < (SELECT MAX(salary) FROM employee);
```

**Using LIMIT/OFFSET** (MySQL, PostgreSQL):
```sql
SELECT DISTINCT salary
FROM employee
ORDER BY salary DESC
LIMIT 1 OFFSET 1;
```

**Using DENSE_RANK** (ANSI SQL, most robust):
```sql
SELECT salary
FROM (
  SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rk
  FROM employee
) t
WHERE rk = 2;
```

**Architectural Logic & Trade-offs**:
- **Subque...

Question 3

What strategies can you use to handle skewed data in Spark?

Accepted Answer

**1. Salting**: Add random suffix to skewed keys to spread load; requires two-phase aggregation. **2. Two-phase aggregation**: Aggregate with salted key, then aggregate again without salt. **3. Broadcast**: For small dimension tables, broadcast to avoid shuffle. **4. Custom partitioning**: Pre-partition by known skewed keys. **5. Increase partitions**: Spreads work but doesn't fix root cause. **6. AQE Skew Join (Spark 3.0+)**: Automatically splits skewed partitions....

Question 4

Design a Delta table layout for mixed workload: point lookups by user_id, range scans by date, and full partition scans. Compare partitioning vs. Z-ordering—when to use each, and the rewrite cost trade-off.

Accepted Answer

**Section 1 — The Context (The 'Why')**
A Delta table serving point lookups (by user_id) and full scans (analytics) faces conflicting optimization. Point lookups want partition pruning by user; analytics want date partitioning.

**Section 2 — The Diagram**
```
[Delta] Z-Order:user_id | Partition:date
  |
  v
[Lookup][Scan][Stream]
```

**Section 3 — Component Logic**
**Partitioning strategies** use date for incremental and range pruning....

Question 5

Describe how to secure sensitive data in cloud storage solutions.

Accepted Answer

Architectural layers: (1) At rest—SSE-S3, SSE-KMS, or customer-managed keys; Azure Blob encryption. (2) In transit—TLS 1.2+; VPC endpoints to avoid public internet. (3) Access—IAM least privilege; bucket policies; Block Public Access; Lake Formation/Unity Catalog for column-level. (4) Masking—tokenize PII in ETL; views with column-level security. Why layered: Defense in depth; compliance (GDPR, HIPAA). Cost: KMS adds ~$1/10K keys; VPC endpoints reduce data transfer cost....

Question 6

What are the pros and cons of using a data lake on AWS, GCP, or Azure?

Accepted Answer

**AWS**: Mature ecosystem (S3, Glue, Athena, EMR). Strong for hybrid (DataSync, Snowball). Service sprawl—many options, steeper learning. **GCP**: Strong BigQuery (serverless SQL); Dataflow (Beam); good integration. Smaller market share; fewer third-party integrations. **Azure**: ADLS Gen2 (hierarchical namespace); Synapse combines lake + warehouse. Strong enterprise presence; complexity in multi-service setup....

Question 7

Explain how you gather and define requirements for a complex data platform project.

Accepted Answer

**Situation**: Complex platforms have multiple stakeholders, ambiguous scope.

**Action**: (1) Stakeholder workshops—personas, use cases, success metrics; (2) Data discovery—sources, volume, quality, latency; (3) Gap analysis—current vs. desired; (4) Prioritization—MoSCoW or RICE; (5) PRD—scope, assumptions, acceptance criteria.

**BCG Style**: Align with business strategy; quantify impact. Iterative refinement—prototype early, validate....

Question 8

How would you model customer transaction data for both analytical and operational use cases?

Accepted Answer

Hybrid model with clear separation of concerns. WHY: OLTP and analytics have opposing requirements—low-latency writes vs. analytical scans. OLTP: Normalized schema (customers, accounts, transactions) with indexes; event sourcing for auditability. Analytics: Denormalized star/snowflake—fact table (transaction_id, customer_id, product_id, amount, ts) + dimensions. Bridge via CDC....

BCG Data Engineer Interview Questions

Reading isn't practice. Get AI feedback on your answers.

Other Companies

BCG Data Engineer Interview Questions

Reading isn't practice. Get AI feedback on your answers.

Other Companies