Question 1

How can lifecycle management policies complement ADF for this task?

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Azure lifecycle management (blob storage): auto-tier (hot to cool to archive), delete old versions. Complements ADF by: (1) Reducing storage cost for old data; (2) ADF focuses on movement/transformation; lifecycle handles retention; (3) Tier before ADF reads to optimize cost....

Question 2

How does Data Flow optimize data transformations for large datasets?

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Azure Data Flow (Mapping Data Flows) optimizes by: (1) Spark-based execution—runs on Spark clusters. (2) Partitioning—supports hash, round-robin, key. (3) Optimize tab—set partition count. (4) Cache—sink to cache for reuse. (5) Schema drift—handles dynamic columns....

Question 3

What configurations are needed to pass parameters to a Databricks notebook?

Accepted Answer

**Widgets**: `dbutils.widgets.text("param1", "default")`; `value = dbutils.widgets.get("param1")`. In Run Now: Base parameters. For Jobs: task parameters map to widgets.

**JSON params**: `dbutils.widgets.text("params", "{}")`; `json.loads(dbutils.widgets.get("params"))` for complex config.

**Why Parameterize**: Same notebook for dev/prod; different dates, paths, env. Reproducibility and reuse.

**Scalability Trade-offs**: Too many widgets = clutter....

Question 4

What techniques ensure deduplication in large datasets?

Accepted Answer

**Techniques**: (1) **dropDuplicates** on keys. (2) **Window**: row_number() over (partition by key order by ts desc) — keep latest. (3) **Delta MERGE** with dedup in merge key. (4) **Hash** (e.g., md5 of cols) for idempotency. (5) **Bloom filter** for approximate.

**When**: Event streams (at-least-once), CDC, replays.

**Why**: Duplicates corrupt analytics, billing, inventory.

**Scalability Trade-offs**: Window on 1B rows = shuffle. Partition by key....

Virtusa Spark & Big Data Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 4 Questions

More Interview Prep Guides

Unlock All Expert Answers