Question 1

Explain the difference between batch and streaming data processing in Data Fusion.

Accepted Answer

Batch processes bounded datasets in discrete runs; streaming processes unbounded data with continuous execution. **Why the distinction matters**: Batch has predictable cost (run N times, pay N × job cost); streaming has always-on cost and state management. **Data Fusion context**: Batch templates (e.g., JDBC to BigQuery) are scheduled; streaming uses Pub/Sub or Kafka sources....

Question 2

Explain the concept of preemptible VMs in Dataproc and their cost implications.

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Preemptible VMs (Dataproc): low-cost, can be reclaimed by GCP (max 24h). Cost approximately 80% less than standard. Use for: task nodes, batch workloads, fault-tolerant jobs. Risk: preemption can cause task failure. Mitigation: use for task nodes only; configure max preemptible; use fault-tolerant frameworks (Spark, Hadoop); restart failed nodes....

Question 3

How do you configure autoscaling for a Dataproc cluster?

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Dataproc autoscaling: create cluster with autoscaling policy—set min/max workers, scale based on YARN memory/CPU. gcloud dataproc clusters create --autoscaling-policy=policy. Policy defines scale-in/out rules....

Question 4

How do you manage dependencies between tasks in a Cloud Composer DAG?

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams. Cloud Composer task dependencies: use >> and <<: task1 >> task2 >> task3; or task1.set_downstream(task2). Use ExternalTaskSensor for cross-DAG....

Question 5

How would you debug a failing Spark job running on Dataproc?

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Debug failing Spark job on Dataproc: (1) Check YARN/Spark logs in Cloud Console (Dataproc > Job > View Logs). (2) Spark driver logs—look for stack traces, OOM, serialization errors. (3) Executor logs—task failures, shuffle errors. (4) Enable `spark.eventLog.enabled` and review in History Server. (5) Reproduce locally with small data....

Question 6

How would you handle a large-scale data shuffle in a Dataflow pipeline?

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Large-scale shuffle in Dataflow: (1) Increase worker count and disk. (2) Use `GroupByKey` sparingly; prefer `CombinePerKey` or `CombineGlobally` to reduce data. (3) Use side inputs for small data. (4) Co-locate data—use consistent keys. (5) Increase `maxNumWorkers`; use `dataflowFlexTemplates` for custom config. (6) Consider shuffle service....

Question 7

What are the advantages of using Dataproc over a traditional Hadoop setup?

Accepted Answer

**Why Managed Over Self-Managed**: Ops burden (patching, scaling, monitoring) consumes 30–50% of platform team time. Managed shifts that to vendor.

**Dataproc Advantages**: (1) **Fast startup**—minutes vs. hours for on-prem. (2) **Preemptible/Spot**—60–90% cheaper for batch. (3) **Auto-delete**—transient clusters; no idle cost. (4) **GCS integration**—no HDFS; object storage native. (5) **Init scripts**—reproducible config. (6) **Optional components**—Spark, Hive, etc....

Aarete Spark & Big Data Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 7 Questions

More Interview Prep Guides

Unlock All Expert Answers

Aarete Spark & Big Data Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 7 Questions

More Interview Prep Guides

Unlock All Expert Answers