Spark & Big Data questions from Coforge data engineering interviews.
These spark & big data questions are sourced from Coforge data engineering interviews. Each includes an expert-level answer.
What is the difference between cache() and persist() in Spark? When would you use each?
What is the difference between groupByKey and reduceByKey in Spark?
What is the difference between narrow and wide transformations in Apache Spark? Explain with examples.
Can you explain the architecture of Apache Spark and its components?
When would you architecturally choose Dataset[T] over DataFrame in a Scala Spark pipeline, and what are the scalability and portability trade-offs? Include type-safety benefits vs. operational constraints.
Can you explain dynamic resource allocation in Spark? How does it help optimize job performance?
Explain the DAG in Spark and how it plays a role in execution.
Have you worked with UDFs in Spark? When do you use them, and how do they differ from built-in functions?
How do you handle schema evolution in Spark, especially when reading data from sources like Parquet or Avro?
How do you handle very large datasets in Spark to ensure scalability and efficiency?
How many stages are created in a Spark job, and how are they formed?
How would you handle unstructured data in Hive?
What are the key performance tuning techniques you apply in Spark jobs to improve performance?
What is data shuffling in Spark, and how do you minimize its impact on job performance?
What is one disadvantage of using Scala for data engineering tasks?
What is the command to import data from HDFS to Hive?
What is the difference between map and flatMap in Spark transformations?
What is the difference between partitions and repartitions in Spark, and when do you use each?
Download the complete interview prep bundle with expert answers. Study offline, on your commute, anywhere.