Saying 'cache uses memory, persist lets you choose' isn't enough. See the production-grade answer that separates senior engineers from juniors.
What is the difference between cache() and persist() in Spark? When would you use each?
cache() stores the RDD/DataFrame in memory. persist() allows you to choose the storage level like MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, etc. cache() is equivalent to persist(MEMORY_AND_DISK). You should use cache when you want to reuse a DataFrame multiple times in your pipeline.
cache() is syntactic sugar for persist(StorageLevel.MEMORY_AND_DISK). But knowing when NOT to cache matters more:
python# Good: DataFrame used in multiple downstream branches base_df = spark.read.parquet('/events').filter(col('date') >= '2026-01-01') base_df.cache() metrics_daily = base_df.groupBy('date').agg(count('*')) metrics_weekly = base_df.groupBy(weekofyear('date')).agg(count('*'))
When NOT to cache (common mistakes):
MEMORY_AND_DISK (default cache): Safe, but deserialized objects consume 2-5x more memory than raw dataMEMORY_AND_DISK_SER: What I use 90% of the time β serialized storage uses ~50% less memory, worth the small CPU costDISK_ONLY: For very large intermediate results where recomputation is expensiveunpersist() when done β cached data pins memory and evicts other cached data via LRUspark.conf.set('spark.databricks.io.cache.enabled', 'true')) often outperforms manual cachingJunior engineers cache everything. Senior engineers cache strategically, monitor memory pressure, unpersist when done, and know that MEMORY_AND_DISK_SER is the production-safe default.
Paste your answer and get instant AI-powered feedback with a FAANG-level improved version.
Analyze My Answer β Free3 free analyses per day. No sign-up required.