Question 1

What is the difference between cache() and persist() in Spark? When would you use each?

Accepted Answer

**cache()**: Equivalent to `persist(MEMORY_AND_DISK)`. Stores partitions in memory; spills to disk if memory is insufficient.

**persist(storage_level)**: Explicit control over storage: MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY.

**Architectural Logic (Why It Matters)**: Caching trades memory/disk for recomputation cost....

Question 2

Can you explain the architecture of Apache Spark and its components?

Accepted Answer

**Section 1 — The Context (The 'Why')**
Apache Spark's distributed execution model faces the core challenge of coordinating hundreds of executors while avoiding driver bottlenecks and shuffle storms. At scale, the driver's single-threaded scheduling and result aggregation become failure points....

Question 3

Tell me about a time when you faced a challenging situation at work and how you handled it.

Accepted Answer

Situation: During a major product launch, a critical data pipeline failed. Downstream dashboards showed incorrect KPIs; execs were escalating; the root cause was unclear. Task: Restore data integrity and stakeholder confidence while preventing recurrence. Action: I triaged using a runbook: checked Spark/Airflow logs, source schema, and recent deployments. I identified a breaking schema change in the upstream API that wasn't in our contract....

Question 4

What is a window function? Explain with an example.

Accepted Answer

**Architectural Logic:** Window functions compute values over a set of rows related to the current row (defined by OVER) without collapsing rows like GROUP BY. **Why windows over self-joins:** Single scan, no explosion; optimal for rankings, running totals, period-over-period. **Scalability:** PARTITION BY and ORDER BY affect sort/spill; large windows with RANGE BETWEEN UNBOUNDED PRECEDING can spill to disk. **Cost:** ROWS vs....

Question 5

Prioritize Spark optimizations by impact and effort. Discuss partitioning strategy, caching policy, join selection, shuffle reduction, and when each becomes a scalability or cost bottleneck.

Accepted Answer

Optimization hierarchy: (1) Partitioning: partition by filter columns (date, region) for predicate pushdown; coalesce/repartition to match downstream parallelism. Impact: high—avoids full scans; cost: storage overhead for many partitions. (2) Caching: cache() for multi-pass reuse; memory cost—unpersist when done. (3) Broadcast joins: < autoBroadcastJoinThreshold; eliminates shuffle for small dimension tables....

Question 6

Explain the difference between batch and streaming data processing in Data Fusion.

Accepted Answer

Batch processes bounded datasets in discrete runs; streaming processes unbounded data with continuous execution. **Why the distinction matters**: Batch has predictable cost (run N times, pay N × job cost); streaming has always-on cost and state management. **Data Fusion context**: Batch templates (e.g., JDBC to BigQuery) are scheduled; streaming uses Pub/Sub or Kafka sources....

Question 7

Why are you leaving your current role?

Accepted Answer

Situation: Role departure. Task: Concise, positive. Action: Growth, challenges, culture, compensation, balance. Keep brief. Example: 'Great experience, looking for [goal]. This role aligns.' Redirect to enthusiasm for new role....

Question 8

Explain job bookmarking in AWS Glue. How does it help in incremental data processing?

Accepted Answer

Architectural logic: Job bookmarks track last processed keys (paths, partition values). Next run skips already processed data. Why: Avoid full scans; reduce cost and time. For JDBC: Use column (e.g., updated_at) as bookmark. Scalability: Many partitions can slow bookmark overhead....

Freecharge Data Engineer Interview Questions

Reading isn't practice. Get AI feedback on your answers.

Other Companies

Freecharge Data Engineer Interview Questions

Reading isn't practice. Get AI feedback on your answers.

Other Companies