Question 1

What is the difference between cache() and persist() in Spark? When would you use each?

Accepted Answer

**cache()**: Equivalent to `persist(MEMORY_AND_DISK)`. Stores partitions in memory; spills to disk if memory is insufficient.

**persist(storage_level)**: Explicit control over storage: MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY.

**Architectural Logic (Why It Matters)**: Caching trades memory/disk for recomputation cost....

Question 2

Can you explain the architecture of Apache Spark and its components?

Accepted Answer

**Section 1 — The Context (The 'Why')**
Apache Spark's distributed execution model faces the core challenge of coordinating hundreds of executors while avoiding driver bottlenecks and shuffle storms. At scale, the driver's single-threaded scheduling and result aggregation become failure points....

Question 3

Tell me about a time when you faced a challenging situation at work and how you handled it.

Accepted Answer

Situation: During a major product launch, a critical data pipeline failed. Downstream dashboards showed incorrect KPIs; execs were escalating; the root cause was unclear. Task: Restore data integrity and stakeholder confidence while preventing recurrence. Action: I triaged using a runbook: checked Spark/Airflow logs, source schema, and recent deployments. I identified a breaking schema change in the upstream API that wasn't in our contract....

Question 4

What is a window function? Explain with an example.

Accepted Answer

**Architectural Logic:** Window functions compute values over a set of rows related to the current row (defined by OVER) without collapsing rows like GROUP BY. **Why windows over self-joins:** Single scan, no explosion; optimal for rankings, running totals, period-over-period. **Scalability:** PARTITION BY and ORDER BY affect sort/spill; large windows with RANGE BETWEEN UNBOUNDED PRECEDING can spill to disk. **Cost:** ROWS vs....

Question 5

Prioritize Spark optimizations by impact and effort. Discuss partitioning strategy, caching policy, join selection, shuffle reduction, and when each becomes a scalability or cost bottleneck.

Accepted Answer

Optimization hierarchy: (1) Partitioning: partition by filter columns (date, region) for predicate pushdown; coalesce/repartition to match downstream parallelism. Impact: high—avoids full scans; cost: storage overhead for many partitions. (2) Caching: cache() for multi-pass reuse; memory cost—unpersist when done. (3) Broadcast joins: < autoBroadcastJoinThreshold; eliminates shuffle for small dimension tables....

Question 6

Explain the difference between batch and streaming data processing in Data Fusion.

Accepted Answer

Batch processes bounded datasets in discrete runs; streaming processes unbounded data with continuous execution. **Why the distinction matters**: Batch has predictable cost (run N times, pay N × job cost); streaming has always-on cost and state management. **Data Fusion context**: Batch templates (e.g., JDBC to BigQuery) are scheduled; streaming uses Pub/Sub or Kafka sources....

Question 7

Why are you leaving your current role?

Accepted Answer

Situation: Role departure. Task: Concise, positive. Action: Growth, challenges, culture, compensation, balance. Keep brief. Example: 'Great experience, looking for [goal]. This role aligns.' Redirect to enthusiasm for new role....

Question 8

Explain job bookmarking in AWS Glue. How does it help in incremental data processing?

Accepted Answer

Architectural logic: Job bookmarks track last processed keys (paths, partition values). Next run skips already processed data. Why: Avoid full scans; reduce cost and time. For JDBC: Use column (e.g., updated_at) as bookmark. Scalability: Many partitions can slow bookmark overhead....

Question 9

How do you monitor and log data pipelines in AWS?

Accepted Answer

Architectural layers: CloudWatch (metrics, logs), X-Ray (tracing). Glue, Lambda, Step Functions emit metrics. Structured logging (JSON); filter and search. Alarms on failure, latency. Dashboards. PagerDuty/Slack integration....

Question 10

What are the limitations of AWS Glue and Lambda?

Accepted Answer

**Glue**: 15-min DPU billing minimum—short jobs still pay for 15 min. Cold start 2–10 min for first run. Max 100K DPU-hours per account (request increase). Not for sub-minute jobs. Job bookmark has limitations (only Append mode for some sources). **Lambda**: 15 min max duration; 10 GB memory; 6 MB payload (synchronous); 256 KB (Step Functions). Concurrency limits (1000 default, request more). Not for multi-GB data; use Glue/EMR. Cold start for VPC Lambdas....

Freecharge Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 19 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading