Question 1

Design a fault-tolerant Spark Streaming checkpoint strategy: what to persist, recovery semantics, and cost/scalability trade-offs with checkpoint frequency.

Accepted Answer

**Section 1 — The Context (The 'Why')**
Spark Streaming fault tolerance requires checkpointing state and offsets. Checkpoint corruption loses replay; too-frequent checkpoints add overhead.

**Section 2 — The Diagram**
```
[Source] --> [Stream] --> [Sink]
  Checkpoint:S3  State:RocksDB
```

**Section 3 — Component Logic**
**Checkpoint** stores offsets and metadata to S3/HDFS. On restart, driver replays from last offset. **State store** (RocksDB) backs aggregation state....

Question 2

Can you give a use case where Delta Live Tables would be ideal?

Accepted Answer

**Why DLT matters**: Declarative pipelines with built-in expectations, lineage, and orchestration—reduces boilerplate and improves reliability. **Ideal use case**: Medallion lakehouse with (1) declarative table definitions (what, not how); (2) expectations (VALIDATE, FAIL, DROP); (3) CDC/SCD merge patterns; (4) incremental processing. Example: Bronze raw from Kafka → Silver deduped, validated (expect revenue >= 0) → Gold aggregated marts....

Question 3

Explain Delta Live Tables and their features, such as declarative pipeline definition and automatic data validation.

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Delta Live Tables (DLT) is Databricks' declarative framework for building production data pipelines. It enables declarative pipeline definition through Python or SQL, where you define what the data should look like rather than imperative transformation steps....

Question 4

Explain data encryption in Databricks, both at rest and in transit.

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Databricks encryption: At rest—data in Databricks-managed S3/ADLS and Delta tables is encrypted via cloud provider (AWS KMS, Azure Key Vault); customer-managed keys (CMK) optional. In transit—TLS 1.2+ for all connections (REST, JDBC, internal). Table ACLs and secrets management via Databricks Secrets....

Question 5

Explain the architecture of Databricks, including the control plane and data plane.

Accepted Answer

**Section 1 — The Context (The 'Why')**
Databricks separates the control plane (workspace, jobs, clusters config) from the data plane (your VPC, S3/ADLS, compute). This split enables compliance—data never leaves the customer cloud—but creates confusion. Teams often assume Databricks hosts data, leading to wrong compliance answers....

Question 6

How do Delta Live Tables ensure data quality during transformations?

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

DLT data quality: EXPECTATIONS—constraints that validate data. Types: expect (record violations), expectOrFail (fail pipeline), expectOrDrop (quarantine). Example: .expect('valid_id', 'id IS NOT NULL').expectOrDrop('valid_amount', 'amount >= 0')....

Question 7

How do you implement row and column-level security in Databricks?

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Row/column security in Databricks: (1) Row filters—CREATE ROW FILTER; (2) Column masks—CREATE COLUMN MASK; (3) Use with Unity Catalog; (4) Grant table/column access. Example: CREATE ROW FILTER dept_filter ON table AS (dept) = current_user()....

Question 8

How do you move a Databricks notebook to higher environments?

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Move Databricks notebook: (1) Export/import; (2) Repos—git sync across workspaces; (3) Workspace export/import; (4) Databricks CLI. Best practice: use Repos for CI/CD; version control; use deployment workflows; test in lower env first.

**Scalability trade-offs**: Partition/parallelism limits; single points of failure; horizontal vs vertical scaling....

Question 9

How does Auto Loader avoid reloading files with the same name?

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Auto Loader uses a cloud-based file notification service (e.g., S3 event notifications, GCS Pub/Sub) or directory listing to track which files were already processed. It maintains a checkpoint (default: `_checkpoints/`) with file paths and modification times....

Question 10

How does Databricks integrate with external storage systems?

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Databricks integrates with external storage via: (1) Cloud object storage—S3, ADLS, GCS mounted or accessed via URIs (`s3://`, `abfss://`, `gs://`). (2) Unity Catalog—external locations and storage credentials for managed access. (3) JDBC/ODBC—for databases. (4) Kafka, Kinesis—for streaming....

TCS Spark & Big Data Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 19 Questions

More Interview Prep Guides

Unlock All Expert Answers

TCS Spark & Big Data Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 19 Questions

More Interview Prep Guides

Unlock All Expert Answers