Question 1

What is the difference between SparkSession and SparkContext in Spark?

Accepted Answer

**SparkContext** (Spark 1.x): Low-level entry point for RDD operations. Manages cluster connections, configuration, and RDD creation. One active SparkContext per JVM. RDD-only.

**SparkSession** (Spark 2.0+): Unified entry point subsuming SparkContext, SQLContext, HiveContext, StreamingContext. Provides DataFrame, Dataset, SQL, and Structured Streaming APIs....

Question 2

When would you architecturally choose Dataset[T] over DataFrame in a Scala Spark pipeline, and what are the scalability and portability trade-offs? Include type-safety benefits vs. operational constraints.

Accepted Answer

DataFrame is an untyped collection of Row objects with schema at runtime; Dataset[T] is typed with compile-time safety. In Scala, DataFrame = Dataset[Row]. Architectural why: Dataset enables domain modeling (e.g., Dataset[Order])—catch errors at compile time, better IDE support, and Catalyst can optimize typed encoders. Scalability: both use Tungsten and Catalyst; Dataset adds encoder overhead but marginal for most workloads. Portability trade-off: PySpark has only DataFrame—no typed Dataset....

Question 3

Design a cost-aware resource strategy for a Databricks workload with spiky and batch jobs. Explain Dynamic Resource Allocation, when to disable it, and how min/max executors and spot instances affect cost and SLAs.

Accepted Answer

**Section 1 — The Context (The 'Why')**
Databricks workload cost explodes when clusters run idle, jobs are over-provisioned, or spot preemption causes thrashing. The challenge is aligning DPU allocation to actual parallelism while maintaining SLA....

Question 4

Accumulator and Broadcast Variables - explain

Accepted Answer

**Accumulator**: Write-only shared variable; workers add; driver reads. For counters. Fault-tolerant via lineage. **Broadcast**: Read-only; sent to all workers; avoids shuffling small lookup. Cached on executors. **Why**: Accumulators for diagnostics (not control flow—retries double-count). Broadcast for lookups <100MB. **Scalability**: Large broadcast = executor OOM....

Question 5

Describe building custom JARs for Spark jobs

Accepted Answer

**Why custom JARs**: Encapsulate dependencies; ensure version consistency. **Process**: Use sbt or Maven; package Spark + app code. Use `--packages` for Maven deps to avoid fat JAR. Submit via `spark-submit`. **Scalability trade-offs**: Shade conflicting deps; large JAR = slower distribute. **Cost implications**: Build time; deployment size....

Question 6

Describe the projects emphasizing Spark, Hadoop, or Azure for large-scale data processing

Accepted Answer

**Situation**: Led migration of 200+ pipelines from legacy Hadoop to Spark on Azure. **Task**: Reduce latency, improve reliability, cut cost. **Action**: (1) Assessed pipelines—batch vs streaming, data volume. (2) Architected medallion on Delta Lake; Spark for ETL; Synapse for warehouse. (3) Migrated in phases; CDC for incremental. (4) Established monitoring, SLAs. **Result**: 60% latency improvement; 30% cost reduction; 99.5% SLA....

Question 7

Load CSV from HDFS

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Load CSV from HDFS: `df = spark.read.option('header','true').csv('hdfs://namenode:8020/path/to/file.csv')`. Or `spark.read.format('csv').load('hdfs://...')`. For schema: `schema = StructType([...]); spark.read.schema(schema).csv(...)`....

Question 8

Memory Tuning in Spark

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Memory tuning: (1) Increase `spark.executor.memory`. (2) `spark.memory.fraction`—raise if cache-heavy. (3) Reduce `spark.memory.storageFraction` if little caching. (4) Increase partitions to reduce memory per task. (5) Use `MEMORY_AND_DISK` for overflow. (6) Off-heap for very large....

Question 9

Performance Tuning Techniques for Spark

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Spark performance tuning: Partitioning, broadcast, coalesce, AQE, caching, predicate pushdown, avoid UDFs, Z-ordering (Delta), file compaction. See answers 1546, 1596. Best practice: Measure first; apply incrementally; document.

**Scalability trade-offs**: Partition/parallelism limits; single points of failure; horizontal vs vertical scaling....

Question 10

Production Experience - deploying and monitoring Spark jobs

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Production Spark: Deploy via spark-submit, Airflow, or Databricks Jobs. Monitor: Spark UI, logs, metrics. Alert on failure. Use retries, backoff. Best practice: Idempotent jobs; version control; staging first; runbooks.

**Scalability trade-offs**: Partition/parallelism limits; single points of failure; horizontal vs vertical scaling....

LTIMindtree Spark & Big Data Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 13 Questions

More Interview Prep Guides

Unlock All Expert Answers

LTIMindtree Spark & Big Data Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 13 Questions

More Interview Prep Guides

Unlock All Expert Answers