Question 1

What is the difference between SparkSession and SparkContext in Spark?

Accepted Answer

**SparkContext** (Spark 1.x): Low-level entry point for RDD operations. Manages cluster connections, configuration, and RDD creation. One active SparkContext per JVM. RDD-only.

**SparkSession** (Spark 2.0+): Unified entry point subsuming SparkContext, SQLContext, HiveContext, StreamingContext. Provides DataFrame, Dataset, SQL, and Structured Streaming APIs....

Question 2

What is the difference between partitioning and bucketing in Spark, and when would you use bucketing?

Accepted Answer

**Partitioning**: Physically divides data by column values (e.g., date, region); enables partition pruning; one directory per partition value. **Bucketing**: Divides data within a partition into a fixed number of files via hash of bucketing column(s); co-locates same-key rows. **When to bucket**: Frequent joins or group-bys on a column (e.g., user_id). Same bucket count on both sides enables sort-merge join without shuffle....

Question 3

Write a Python function to check if a string is a palindrome.

Accepted Answer

**Architectural logic**: A palindrome reads the same forwards and backwards. We need to normalize (case, non-alphanumeric) and compare. **Approach 1 (string ops)**: `cleaned = "".join(c.lower() for c in s if c.isalnum()); return cleaned == cleaned[::-1]`—O(n) time, O(n) space. **Approach 2 (two-pointer)**: Compare from both ends; O(n) time, O(1) space if not normalizing....

Question 4

When would you architecturally choose Dataset[T] over DataFrame in a Scala Spark pipeline, and what are the scalability and portability trade-offs? Include type-safety benefits vs. operational constraints.

Accepted Answer

DataFrame is an untyped collection of Row objects with schema at runtime; Dataset[T] is typed with compile-time safety. In Scala, DataFrame = Dataset[Row]. Architectural why: Dataset enables domain modeling (e.g., Dataset[Order])—catch errors at compile time, better IDE support, and Catalyst can optimize typed encoders. Scalability: both use Tungsten and Catalyst; Dataset adds encoder overhead but marginal for most workloads. Portability trade-off: PySpark has only DataFrame—no typed Dataset....

Question 5

Design a cost-aware resource strategy for a Databricks workload with spiky and batch jobs. Explain Dynamic Resource Allocation, when to disable it, and how min/max executors and spot instances affect cost and SLAs.

Accepted Answer

**Section 1 — The Context (The 'Why')**
Databricks workload cost explodes when clusters run idle, jobs are over-provisioned, or spot preemption causes thrashing. The challenge is aligning DPU allocation to actual parallelism while maintaining SLA....

Question 6

Command to Read JSON Data and Options

Accepted Answer

**Spark**: spark.read.json('path') or spark.read.format('json').load('path'). **Options**: schema (explicit StructType for perf), multiLine (pretty-printed), dateFormat, primitivesAsString. **Nested**: Provide schema or from_json. **Streaming**: readStream.schema(schema).json(path)....

Question 7

Daily Data Volume - quantify

Accepted Answer

**Quantify**: Example—500GB raw, 200GB curated daily. Peak 2TB month-end. Include records and bytes. **Source**: Pipeline metrics (Glue, Dataflow). **Why**: Sizes infrastructure; capacity planning; cost....

Question 8

Describe a project you worked on, focusing on the data pipeline and your role.

Accepted Answer

**STAR**: **Situation**: Needed clickstream analytics. **Task**: Build pipeline. **Action**: Designed Kafka ingest, Flink processing, ClickHouse storage. Led design; implemented core components. **Result**: Sub-minute latency; 10M events/day....

LTIMindtree Data Engineer Interview Questions

Reading isn't practice. Get AI feedback on your answers.

Other Companies

LTIMindtree Data Engineer Interview Questions

Reading isn't practice. Get AI feedback on your answers.

Other Companies