Question 1

How do you optimize Spark jobs for better performance? Mention at least 5 techniques.

Accepted Answer

1) Broadcast joins for small tables—avoid shuffle. 2) Predicate pushdown—filter at source (Parquet/ORC) to reduce scan. 3) Partition tuning—spark.sql.shuffle.partitions ~2–4× cores; match partition columns to filter/join keys. 4) Cache only when reused; unpersist when done to free memory. 5) Prefer Spark SQL over UDFs—Catalyst optimization. 6) Skew handling—salted keys, AQE skew join. 7) Kryo serialization for RDD; avoid Java default. 8) Coalesce before write to avoid small files....

Question 2

How do you optimize Spark jobs for performance?

Accepted Answer

Optimization is a hierarchy: (1) **Reduce data scanned**—partition pruning, predicate pushdown, column pruning; biggest lever. (2) **Reduce shuffle**—broadcast small tables, avoid unnecessary repartitions, co-locate joins. (3) **Right-size parallelism**—spark.sql.shuffle.partitions ~2–4× cores; too many = task overhead; too few = underutilization. (4) **Avoid serialization hot paths**—use built-in functions over UDFs; UDFs break Catalyst and force row-by-row Python or slower Java....

Question 3

Concatenate Columns in PySpark

Accepted Answer

**Methods**: `concat(col1, col2)` or `concat_ws(sep, col1, col2)`. `concat_ws` skips nulls; `concat` with null yields null. For mixed types, cast first. **Example**: `df.withColumn("full_name", concat_ws(" ", col("first_name"), col("last_name")))`. **Why it matters**: String ops can be expensive at scale; built-ins are optimized. **Scalability trade-offs**: Row-wise UDF for concat = slow; use built-in. **Cost implications**: Trivial for small tables; for large, avoid UDF....

Question 4

Executor vs Driver in Spark

Accepted Answer

**Why distinction matters**: Driver OOM vs Executor OOM have different causes and fixes. **Driver**: Single process; runs main(); builds DAG; schedules; collects results. Bottleneck for collect(), take(). **Executor**: Workers; run tasks; store cached data. Do the heavy lifting. **Scalability trade-offs**: Driver = single point; executors scale. collect() on large = driver OOM. **Cost implications**: Over-sized driver = waste; undersized = OOM....

Question 5

How to Connect to Salesforce Without Typing Credentials Manually

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Connect to Salesforce without manual credentials: (1) OAuth 2.0—use Connected App, store refresh token in secrets (Databricks Secrets, AWS Secrets Manager). (2) Named Credentials—Salesforce feature for delegated auth. (3) Server-to-Server (JWT)—for server integration....

Question 6

Partition and Save as Parquet in PySpark

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Partition and save Parquet: `df.write.mode('overwrite').partitionBy('year','month').parquet('path')`. For single partition: `partitionBy('date')`. Options: `coalesce(4)` to limit files....

Question 7

Spark Architecture - Components include Driver, Executors, Cluster Manager, and Tasks

Accepted Answer

**Section 1 — The Context (The 'Why')**
Spark: Driver, Executors, Cluster Manager, Tasks. Driver builds DAG; executors process data. Sizing driver same as executors wrong. AQE helps skew.

**Section 2 — The Diagram**
```
[Driver] --> [DAG]
   |
   v
[Cluster Mgr] YARN | K8s
   |
   v
[Executors] [Tasks]
```

**Section 3 — Component Logic**
**Driver** builds DAG; does not process data. **Cluster Manager** allocates. **Executors** 4 cores, 8GB. AQE for skew....

Presidio Spark & Big Data Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 7 Questions

More Interview Prep Guides

Unlock All Expert Answers