Question 1

What is the difference between SparkSession and SparkContext in Spark?

Accepted Answer

**SparkContext** (Spark 1.x): Low-level entry point for RDD operations. Manages cluster connections, configuration, and RDD creation. One active SparkContext per JVM. RDD-only.

**SparkSession** (Spark 2.0+): Unified entry point subsuming SparkContext, SQLContext, HiveContext, StreamingContext. Provides DataFrame, Dataset, SQL, and Structured Streaming APIs....

Question 2

How do you copy all files from one source path to target in ADF?

Accepted Answer

Options: (1) GetMetadata on folder → ForEach childItems → Copy Data with @item().Name. (2) Copy Data with wildcard (folder/*) for bulk. When: Wildcard for simple; GetMetadata+ForEach for per-file logic....

Question 3

Explain Job vs. Interactive Clusters.

Accepted Answer

**Job Clusters**: Run single workload; terminate when done. Cost-effective for batch ETL, scheduled jobs. Spin up, run, shut down.

**Interactive**: Stay alive for ad-hoc queries, notebooks. Used by analysts. Use auto-termination (e.g., 30 min idle) to control cost.

**Cost**: Job = pay per run; interactive = pay for idle. Pool clusters for faster startup with many short jobs....

Question 4

How do you increase job performance? What techniques and optimizations?

Accepted Answer

**Techniques**: (1) Partitioning—by filter columns; coalesce/repartition; (2) Predicate pushdown; partition pruning; (3) Broadcast small tables; (4) Cache reused DataFrames; (5) Right-size executors; parallelism; (6) Avoid UDFs; native functions; (7) Parquet, columnar; compaction....

Question 5

How do you run one notebook in another notebook?

Accepted Answer

**Databricks**: %run /path/to/notebook or %run /path $param=value. Variables/DataFrames from callee available in caller. Can return values. Multiple notebooks in sequence. Keep focused; document dependencies....

Question 6

How do you see files before update (history records/versioning)?

Accepted Answer

**Git**: git log, show, diff. **S3**: Versioning—list versions, get specific. **Delta Lake**: time travel—VERSION AS OF n or TIMESTAMP AS OF 'date'; DESCRIBE HISTORY. Enable versioning; define retention; document restore....

Question 7

Explain how to implement cumulative sum in SQL.

Accepted Answer

**Architectural Logic**: Window function with unbounded preceding for running total. **Implementation**: `SELECT date, amount, SUM(amount) OVER (ORDER BY date) AS cumulative_sum FROM sales`. Per group: `SUM(amount) OVER (PARTITION BY product_id ORDER BY date)`. **Frame**: Default ROWS UNBOUNDED PRECEDING. **Use**: Running balance; YTD. **Scalability**: Single pass; efficient. **Cost**: No extra scan. **Best Practice**: Deterministic ORDER BY (tie-breaker if needed); understand RANGE vs ROWS....

Question 8

Discuss Delta Logs file format and its significance.

Accepted Answer

**Why Delta Log matters**: Enables ACID, time travel, schema evolution. **Format**: JSON files in `_delta_log/`; each transaction = new file. Records add/remove file actions, metadata. **Significance**: Single source of truth for table state; concurrent readers see consistent snapshot. **Scalability trade-offs**: Log grows; VACUUM removes old files beyond retention. **Cost implications**: Log = small storage; many small transactions = many small log files....

Question 9

Explain SCD1 and SCD2 in Databricks PySpark with examples.

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

SCD1 (Slowly Changing Dimension Type 1) overwrites existing records with new data—no history. SCD2 maintains full history by adding new version rows. SCD1 in PySpark: df_target.join(df_source, 'id', 'left').select(coalesce(df_source.col, df_target.col))....

Question 10

Explain aggregation functions in PySpark with examples and use cases.

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

PySpark aggregation functions: sum(), count(), avg(), min(), max(), collect_list(), collect_set(). Use with groupBy: df.groupBy('dept').agg(sum('salary'), avg('age')). Window functions: row_number(), rank(), dense_rank(), sum() over partition. Example: df.withColumn('rank', rank().over(Window.partitionBy('dept').orderBy(desc('salary'))))....

Hexaware Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 13 Questions

More Interview Prep Guides

Unlock All Expert Answers