Question 1

What is the difference between Managed and External tables in Hive/Spark?

Accepted Answer

Managed: Spark/Hive owns metadata and data. DROP TABLE deletes both. External: Metadata only; data lives in specified location. DROP TABLE drops metadata; data remains. Why External: Shared data across tools (Athena, Glue, Spark); production datasets where accidental DROP would be catastrophic; data lifecycle independent of table. Why Managed: Ephemeral tables, temp outputs; simpler—no orphan paths. Cost: Managed DROP can trigger expensive recursive deletes on object store....

Question 2

When would you architecturally choose Dataset[T] over DataFrame in a Scala Spark pipeline, and what are the scalability and portability trade-offs? Include type-safety benefits vs. operational constraints.

Accepted Answer

DataFrame is an untyped collection of Row objects with schema at runtime; Dataset[T] is typed with compile-time safety. In Scala, DataFrame = Dataset[Row]. Architectural why: Dataset enables domain modeling (e.g., Dataset[Order])—catch errors at compile time, better IDE support, and Catalyst can optimize typed encoders. Scalability: both use Tungsten and Catalyst; Dataset adds encoder overhead but marginal for most workloads. Portability trade-off: PySpark has only DataFrame—no typed Dataset....

Question 3

What is the difference between Managed and External Tables in Databricks?

Accepted Answer

**Managed tables**: Databricks/Spark owns both metadata and data; `DROP TABLE` deletes metadata and underlying data. **External tables**: Metadata is in the catalog; data lives in an external location (S3, ADLS, GCS); `DROP TABLE` removes only metadata; data persists. **Why it matters architecturally**: Managed tables enforce a single lifecycle for schema and data. External tables decouple storage from compute, enabling multi-engine access (Snowflake, Athena, Redshift) and shared data lakes....

Question 4

A JSON file with evolving schema needs to be ingested into a DataFrame. How would you handle new fields dynamically in PySpark without breaking the job for previous structures?

Accepted Answer

Use schema-on-read with mergeSchema: df = spark.read.option("mergeSchema", "true").json("s3://bucket/events/"). **Why**: New fields appear as null in old records; no job failure. **Production**: Schema registry or versioned schemas; explicit schema with allowNull for new columns; from_json with optional fields. **Scalability**: mergeSchema has overhead; use for bounded evolution....

Question 5

A task intermittently fails due to external API limitations. How would you configure Airflow retries and alerts to manage this situation efficiently?

Accepted Answer

default_args = {'retries': 5, 'retry_delay': timedelta(minutes=2), 'retry_exponential_backoff': True}. Use retry_exponential_backoff for API rate limits. Add on_failure_callback for Slack/PagerDuty. Consider circuit breaker for repeatedly failing APIs. **Why**: Transient failures; backoff respects rate limits. **Scalability**: Idempotent tasks for safe retries....

Question 6

Accumulator and Broadcast Variables - explain

Accepted Answer

**Accumulator**: Write-only shared variable; workers add; driver reads. For counters. Fault-tolerant via lineage. **Broadcast**: Read-only; sent to all workers; avoids shuffling small lookup. Cached on executors. **Why**: Accumulators for diagnostics (not control flow—retries double-count). Broadcast for lookups <100MB. **Scalability**: Large broadcast = executor OOM....

Question 7

Approaches to handling multiple tasks within a sprint?

Accepted Answer

**Situation**: Sprint had 8 tasks—3 high-priority data pipeline fixes, 4 feature work, 1 tech debt. Team capacity: 2 engineers. **Task**: Ship on-time without burnout; maintain quality. **Action**: (1) Prioritized by business impact—incident-risk fixes first. (2) Timeboxed: 2 days max per investigation before escalating. (3) Parallelized—Engineer A on pipelines, Engineer B on features. (4) WIP limit of 2 per person to avoid context-switching....

Question 8

Cache() vs Persist(): Explain the difference and use cases for caching and persisting data in Spark with memory levels.

Accepted Answer

**Why this matters**: Wrong storage level = OOM or unnecessary disk I/O. **cache()** = `persist(StorageLevel.MEMORY_AND_DISK)`—default for quick reuse. **persist()** accepts levels: MEMORY_ONLY (fast, evicted if OOM), MEMORY_AND_DISK (spill to disk), MEMORY_ONLY_SER (serialized, less memory), MEMORY_AND_DISK_SER, DISK_ONLY. **Use cases**: MEMORY_ONLY for hot small DataFrames; MEMORY_AND_DISK when data may not fit; serialized for memory savings....

Question 9

Can you explain dynamic resource allocation in Spark? How does it help optimize job performance?

Accepted Answer

**Why dynamic allocation matters**: Static executors = overpay when idle, underperform when backlogged. **Mechanism**: `spark.dynamicAllocation.enabled=true`. Executors scale up when task backlog grows; scale down when idle for `executorIdleTimeout`. Min/max bounds control range. **Scalability trade-offs**: Scale-up has latency—initial tasks may queue. Scale-down frees resources for other jobs. Works with YARN, K8s; requires external shuffle service for shuffle data retention....

Question 10

Can you explain the concept of incremental loading in Sqoop and how to use it for job processing?

Accepted Answer

**Why incremental loading matters**: Full dumps of large tables waste bandwidth and time; incremental = only new/changed rows. **Sqoop incremental**: `--incremental append` for insert-only tables (e.g., logs); `--incremental lastmodified` for tables with update column. Use `--check-column` and `--last-value`; store last value in metastore or file for next run. **Scalability trade-offs**: Append = simple; lastmodified requires consistent timezone and indexed check column....

Easy Spark & Big Data Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 60 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading