Question 1

Explain the differences between Repartition and Coalesce. When would you use each?

Accepted Answer

**Repartition(n)**: Full shuffle; creates exactly n partitions. Can increase or decrease. **Coalesce(n)**: Merges partitions without full shuffle; only decreases. **Why it matters**: Shuffle is expensive—network and disk I/O. Coalesce avoids shuffle when reducing partitions by merging within existing partitions. **When Repartition**: Increasing partitions, fixing skew (repartition by key), or before a join to align partition counts....

Question 2

Explain the types of triggers in ADF, including schedule, tumbling window, and event-based triggers.

Accepted Answer

Schedule: Fixed cadence (cron, every N mins). Predictable batch windows; simple ops. Tumbling window: Fixed non-overlapping intervals; fires once per window. Ideal for idempotent, exactly-once semantics—no overlap means no double-processing. Event-based: Fires on blob created, queue message, etc. Enables near real-time pipelines....

Question 3

Joins and window functions - INNER, LEFT, RIGHT, FULL OUTER, ROW_NUMBER(), RANK(), DENSE_RANK()

Accepted Answer

Joins: INNER = intersection only; LEFT = all left + matching right (NULL fill); RIGHT = mirror of LEFT; FULL OUTER = union of both. Why it matters: Join choice affects result cardinality and NULL handling—wrong join = wrong business logic (e.g., LEFT to preserve all customers even without orders). Window functions: ROW_NUMBER() = unique rank 1,2,3; RANK() = ties same rank, gaps after; DENSE_RANK() = ties same rank, no gaps....

Question 4

Can you explain the architecture of Apache Spark and its components?

Accepted Answer

**Section 1 — The Context (The 'Why')**
Apache Spark's distributed execution model faces the core challenge of coordinating hundreds of executors while avoiding driver bottlenecks and shuffle storms. At scale, the driver's single-threaded scheduling and result aggregation become failure points....

Question 5

Provide a detailed walkthrough of your career journey

Accepted Answer

Situation: Behavioral deep dive. Task: Narrative with progression. Action: I walked through: (1) Backend/DB foundations, (2) First data role—batch ETL, dimensional modeling, (3) Scale—Spark, cloud, real-time, (4) Platform—self-serve tooling, quality frameworks, (5) Leadership—mentoring, RFCs, open source. Each step tied to skills and impact....

Question 6

Share examples of successful stakeholder communication

Accepted Answer

Situation: Stakeholders needed visibility without meeting overload. Task: Structured, async communication. Action: (1) Weekly data quality one-pagers, (2) Architecture review decks before major changes, (3) Slack channels per domain with SLAs, (4) Proactive incident updates (15-min rule), (5) Shared roadmaps in Confluence, (6) Glossaries for self-serve....

Question 7

Difference between pipelines and data flows in ADF

Accepted Answer

Architectural logic: Pipelines = orchestration (activities, flow); Data Flows = Spark-based transformation activity. Pipelines orchestrate; Data Flows transform. A pipeline can contain Copy, Lookup, Data Flow, etc. When to use: Copy for simple move; Data Flow for filter, join, aggregate. Scalability: Data Flow runs on Spark; tune partitions. Cost: Data Flow consumes more than Copy....

Question 8

Fabric dataflows vs. ADF dataflows

Accepted Answer

Architectural difference: Fabric Dataflows = Power Query–style, Fabric-native, OneLake. ADF Dataflows = Spark-based, ADF-native, Azure-focused. When: Fabric for Fabric-centric, Power Query users. ADF when already in ADF, need Azure integration....

Question 9

Fabric pipelines vs. ADF pipelines

Accepted Answer

Architectural difference: Fabric Pipelines = Fabric-native, OneLake, built on ADF. ADF Pipelines = standalone, hybrid, multi-cloud. When: Fabric for Fabric workloads. ADF when Fabric not primary or need ADF connectors....

Question 10

Running multiple notebooks - dbutils.notebook.run()

Accepted Answer

**Why dbutils.notebook.run()**: Modular workflows—reusable notebook components, parameterized execution. Syntax: dbutils.notebook.run("/path/to/notebook", timeout_seconds, {"param": "value"}). Returns the last evaluated expression. **Difference from %run**: %run executes inline in the same context; variables are shared. dbutils.run runs in a separate Spark context; no variable sharing; returns output....

Nihilent Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 30 Questions

More Interview Prep Guides

Unlock All Expert Answers