Question 1

What are traits in Scala, and how are they different from classes?

Accepted Answer

**Traits**: Interface-like constructs that can define abstract and concrete methods/fields. Support multiple inheritance of type. Mixed in via `with`.

**Classes**: Define objects with state and behavior. Single inheritance; one superclass.

**Key Differences**: Traits enable composition; classes define core logic. Traits can be partially implemented; classes hold primary behavior....

Question 2

What is the difference between cache() and persist() in Spark? When would you use each?

Accepted Answer

**cache()**: Equivalent to `persist(MEMORY_AND_DISK)`. Stores partitions in memory; spills to disk if memory is insufficient.

**persist(storage_level)**: Explicit control over storage: MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY.

**Architectural Logic (Why It Matters)**: Caching trades memory/disk for recomputation cost....

Question 3

What is the difference between groupByKey and reduceByKey in Spark?

Accepted Answer

**groupByKey()**: Shuffles all (key, value) pairs to group values per key. Transfers O(total_values) over the network. No local aggregation—you combine values afterward. High memory and network cost.

**reduceByKey(func)**: Performs local reduce (e.g., sum) on each partition before shuffle. Shuffles only O(unique_keys) aggregated values. Combines locally first, then across partitions.

**Architectural Logic (Why reduceByKey Wins)**: Shuffle is the bottleneck....

Question 4

What is the difference between narrow and wide transformations in Apache Spark? Explain with examples.

Accepted Answer

**Narrow transformations**: Each input partition maps to at most one output partition. No shuffle. Examples: map, filter, flatMap, mapPartitions.

**Wide transformations**: Require data from multiple input partitions to produce one output partition. Trigger shuffle. Examples: groupByKey, reduceByKey, join, distinct, repartition.

**Architectural Logic (Why This Matters)**: Spark pipelines narrow transformations and executes them in a single stage....

Question 5

What is the difference between partitioning and bucketing in Spark, and when would you use bucketing?

Accepted Answer

**Partitioning**: Physically divides data by column values (e.g., date, region); enables partition pruning; one directory per partition value. **Bucketing**: Divides data within a partition into a fixed number of files via hash of bucketing column(s); co-locates same-key rows. **When to bucket**: Frequent joins or group-bys on a column (e.g., user_id). Same bucket count on both sides enables sort-merge join without shuffle....

Question 6

Can you explain the architecture of Apache Spark and its components?

Accepted Answer

**Section 1 — The Context (The 'Why')**
Apache Spark's distributed execution model faces the core challenge of coordinating hundreds of executors while avoiding driver bottlenecks and shuffle storms. At scale, the driver's single-threaded scheduling and result aggregation become failure points....

Question 7

When would you architecturally choose Dataset[T] over DataFrame in a Scala Spark pipeline, and what are the scalability and portability trade-offs? Include type-safety benefits vs. operational constraints.

Accepted Answer

DataFrame is an untyped collection of Row objects with schema at runtime; Dataset[T] is typed with compile-time safety. In Scala, DataFrame = Dataset[Row]. Architectural why: Dataset enables domain modeling (e.g., Dataset[Order])—catch errors at compile time, better IDE support, and Catalyst can optimize typed encoders. Scalability: both use Tungsten and Catalyst; Dataset adds encoder overhead but marginal for most workloads. Portability trade-off: PySpark has only DataFrame—no typed Dataset....

Question 8

Can you explain your experience with Jenkins in your project?

Accepted Answer

Why Jenkins: CI/CD for data pipelines—validate before deploy; automate releases. Architectural logic: Pipeline as Code (Jenkinsfile): lint, unit test, integration test, deploy. Build: package code, Docker images. Test: pytest, dbt test. Deploy: push artifacts, update Airflow DAGs. Integrations: Git webhooks, S3/Nexus, deployment targets. Scalability: Jenkins is single point; master/agent for parallel. Trade-offs: Jenkins is powerful but heavy; GitHub Actions/GitLab CI may be simpler....

Coforge Data Engineer Interview Questions

Reading isn't practice. Get AI feedback on your answers.

Other Companies

Coforge Data Engineer Interview Questions

Reading isn't practice. Get AI feedback on your answers.

Other Companies