Question 1

What is the difference between SparkSession and SparkContext in Spark?

Accepted Answer

**SparkContext** (Spark 1.x): Low-level entry point for RDD operations. Manages cluster connections, configuration, and RDD creation. One active SparkContext per JVM. RDD-only.

**SparkSession** (Spark 2.0+): Unified entry point subsuming SparkContext, SQLContext, HiveContext, StreamingContext. Provides DataFrame, Dataset, SQL, and Structured Streaming APIs....

Question 2

Architecturally, how would you justify or challenge Hadoop vs. a cloud-native data lake (S3 + EMR/Databricks) for a greenfield enterprise data platform? Discuss scalability ceilings, cost model trade-offs, and operational complexity.

Accepted Answer

Hadoop's architecture centers on HDFS (storage) + YARN (resource scheduling) + MapReduce (compute model). Components: Namenode (single point for metadata—scalability ceiling for namespace); Datanodes (block storage, 128MB default blocks, 3x replication); ResourceManager/NodeManagers for CPU/memory allocation. Why it matters architecturally: HDFS scales linearly with nodes but metadata scale is bounded by Namenode heap; YARN allows multi-tenant workloads but adds operational overhead....

Question 3

How would you read data from a web API using PySpark?

Accepted Answer

PySpark has no native API source; the pattern is driver fetch or executor fetch. **Approaches**: (1) **Driver + parallelize**: data = requests.get(url).json(); df = spark.createDataFrame(data). Scales to API response size (typically MBs); driver is bottleneck. (2) **mapPartitions on executor**: Pass partition of IDs to each task; each calls API. Scales to many IDs but risks rate limiting and API abuse. (3) **Orchestrator + landing**: Airflow/Prefect fetches API → lands to S3/GCS → Spark reads....

Question 4

What is broadcasting in Spark, and why is it used? Can you give an example of its use?

Accepted Answer

Broadcasting sends a small table replica to every executor so joins execute locally without shuffle. **Why it matters**: A shuffle join moves both sides across the network; broadcast join moves only the small side once from driver to executors. **Scalability trade-off**: The small table must fit in executor memory; exceeding this causes OOM. As cluster size grows, more copies exist (N executors × table size), so large broadcasts waste memory....

Question 5

What is the difference between Managed and External Tables in Databricks?

Accepted Answer

**Managed tables**: Databricks/Spark owns both metadata and data; `DROP TABLE` deletes metadata and underlying data. **External tables**: Metadata is in the catalog; data lives in an external location (S3, ADLS, GCS); `DROP TABLE` removes only metadata; data persists. **Why it matters architecturally**: Managed tables enforce a single lifecycle for schema and data. External tables decouple storage from compute, enabling multi-engine access (Snowflake, Athena, Redshift) and shared data lakes....

Question 6

What is the difference between map and flatMap in Spark, and when would you use each?

Accepted Answer

**map**: 1 input element → 1 output element; output collection has same cardinality. **flatMap**: 1 input element → 0 or more output elements; results are flattened into a single collection. **Why it matters**: flatMap controls cardinality explosion (e.g., tokenizing a line into N words) and avoids nested structures that would require a separate explode step....

Question 7

What work is done by the executor memory in Spark?

Accepted Answer

Executor memory holds: **(1)** Cached RDD/DataFrame partitions (storage fraction), **(2)** Shuffle output written by map tasks for reduce tasks, **(3)** Working memory for task execution (joins, aggregations, sorting), **(4)** Off-heap (e.g., for native operations, Tungsten). **Why it matters**: Execution memory competes with storage; undersized executors cause spills and OOM....

Question 8

When and how do you use Broadcast Join?

Accepted Answer

**When**: One join side fits in memory across executors (typically < ~100MB, configurable via `spark.sql.autoBroadcastJoinThreshold`). **How**: Use `broadcast(df)` hint or rely on Spark auto-broadcast. Driver sends the small table to all executors; join runs locally without shuffle of the large table. **Why**: Shuffle of a large fact table is expensive (network, disk, serialization); broadcasting the dimension avoids it....

Question 9

Why is SparkSession used in Spark 2.0 and later versions?

Accepted Answer

**SparkSession** (2.0+) is the unified entry point for DataFrames, Datasets, SQL, and Structured Streaming. It subsumes SparkContext, SQLContext, HiveContext, and StreamingContext. **Why it matters**: Single API for DataFrame/SQL/Streaming reduces boilerplate and simplifies configuration. One session manages config, catalog, and context. **Architectural benefit**: Consistent behavior across Scala, Python, R; easier migration from RDD to DataFrame; Hive support via `.enableHiveSupport()`....

Question 10

Write a Python script to find the count of each word in a text file using Spark.

Accepted Answer

**Logic**: Read text → split into words → explode array to rows → group by word → count. **Code**: `from pyspark.sql import SparkSession; from pyspark.sql import functions as F; spark = SparkSession.builder.appName("WordCount").getOrCreate(); df = spark.read.text("path/to/file.txt"); words = df.select(F.explode(F.split(F.col("value"), "\s+")).alias("word")); word_counts = words.groupBy("word").count(); word_counts.show()`....

Altimetrik Spark & Big Data Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 17 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading