Question 1

What is the purpose of the Bronze, Silver, and Gold layers in a data pipeline?

Accepted Answer

**Bronze**: Raw ingestion; immutable; schema-on-read; data as-is from source. Single source of truth for replay and lineage. **Silver**: Cleansed, deduplicated, conformed; schema enforced; business-level quality. Trusted layer for analytics. **Gold**: Aggregated, modeled for consumption; star schema, metrics, reporting-ready. **Why it matters**: Clear ownership, incremental processing, and deterministic replay....

Question 2

Adaptive Query Execution (AQE): Discuss how AQE optimizes query execution in Spark dynamically based on runtime stats.

Accepted Answer

AQE (Spark 3.x): Reoptimizes at runtime. Coalesces partitions after shuffle; converts sort-merge to broadcast when small side discovered; handles skew. Set spark.sql.adaptive.enabled=true. **Why**: Plan-time stats can be wrong; runtime reveals actual sizes. **Scalability**: Reduces shuffle partitions; avoids broadcast of large tables. **Cost**: Less shuffle I/O; better resource use....

Question 3

Cache() vs Persist(): Explain the difference and use cases for caching and persisting data in Spark with memory levels.

Accepted Answer

**Why this matters**: Wrong storage level = OOM or unnecessary disk I/O. **cache()** = `persist(StorageLevel.MEMORY_AND_DISK)`—default for quick reuse. **persist()** accepts levels: MEMORY_ONLY (fast, evicted if OOM), MEMORY_AND_DISK (spill to disk), MEMORY_ONLY_SER (serialized, less memory), MEMORY_AND_DISK_SER, DISK_ONLY. **Use cases**: MEMORY_ONLY for hot small DataFrames; MEMORY_AND_DISK when data may not fit; serialized for memory savings....

Question 4

Define what a User-Defined Function (UDF) is and how to register it in PySpark.

Accepted Answer

**Why UDFs matter**: Custom logic when built-ins don't suffice; but UDFs break Catalyst optimization. **UDF**: Custom function; runs row-by-row in Python (slow) or as Pandas UDF (vectorized, faster). **Registration**: `@udf(ReturnType)` or `spark.udf.register("name", func)` for SQL. **Scalability trade-offs**: Python UDF = serialization overhead, no predicate pushdown; Pandas UDF = vectorized, better. **Cost implications**: UDFs often 10–50x slower than built-in; avoid when possible....

Question 5

Describe the cluster configuration used in your project, including memory allocation, number of nodes, and executor/driver settings.

Accepted Answer

**Why config matters**: Right-sizing = cost and performance. **Typical**: Driver 4–8GB, 2–4 cores. Executors: 8–16GB, 4–8 cores each. `spark.executor.memory`, `spark.executor.cores`. Dynamic allocation. YARN memory overhead ~10%. **Scalability trade-offs**: 4–5 tasks per core; too many small executors = overhead. **Cost implications**: Over-provision = waste; under = OOM....

Question 6

Discuss how you integrated Azure services into your Spark application.

Accepted Answer

**Why integration matters**: Native services = managed, secure. **Integration**: ADLS for storage (abfss://); Event Hubs for streaming (Spark connector); Synapse for warehouse. AAD for auth; managed identities. **Scalability trade-offs**: Azure-native = less ops; vendor lock-in. **Cost implications**: Managed = premium; optimize data transfer....

Question 7

Discuss the process of moving files in Databricks File System (DBFS).

Accepted Answer

**Why process matters**: DBFS is abstraction over cloud storage; wrong move = copy cost. **Process**: `dbutils.fs.mv("source", "dest")` or `cp` then `rm`. For large: Spark read/write. **Scalability trade-offs**: mv in same filesystem may be metadata-only; cross-bucket = copy. **Cost implications**: Cross-region/bucket copy = egress....

Question 8

Explain the architecture of Spark, including its components such as driver, executor, and cluster manager.

Accepted Answer

**Section 1 — The Context (The 'Why')**
Spark driver schedules; executors execute. Cluster manager allocates. Mis-sizing driver wastes money. Executor OOM from skew differs from driver OOM from collect().

**Section 2 — The Diagram**
```
[Driver] <-> [Cluster Mgr]
   |
   v
[DAG] --> [Stages]
   |
   v
[Executors] [Tasks]
```

**Section 3 — Component Logic**
**Driver** builds DAG, schedules stages. **Cluster Manager** allocates executors. **Executors** run tasks, cache RDDs....

Question 9

List all the technologies you have worked on in your project (e.g., Spark, Hadoop, Hive, Databricks).

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Technologies in project: List with context—e.g., 'Spark for ETL processing 10TB daily; Hive for ad-hoc queries; Databricks for notebook development; Airflow for orchestration.' Be specific: versions, scale, your role....

Question 10

Solve the dataset transformation using PySpark.

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Dataset transformation: Define input/output schema, transformations (filter, join, agg, pivot), write PySpark. Example: Normalize, deduplicate, aggregate. Best practice: Document logic; handle nulls; validate output.

**Scalability trade-offs**: Partition/parallelism limits; single points of failure; horizontal vs vertical scaling....

Capgemini Spark & Big Data Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 13 Questions

More Interview Prep Guides

Unlock All Expert Answers