DataEngPrep.tech

hardjoinoptimizationpartition3.2 min read

Can you explain the architecture of Apache Spark and its components?

CoforgeFreechargeNihilent

hardoptimizationpartitionspark0.5 min read

Describe the difference between Spark RDDs, DataFrames, and Datasets.

AccentureFragma Data Systems

hardjoinoptimizationspark0.5 min read

How does Spark's Catalyst Optimizer work? Explain its stages.

DunnhumbyFragma Data SystemsHashedIn

hardsparkwindow0.5 min read

How do you handle late-arriving data in Spark Structured Streaming?

BitwiseIncedoSwiggy

hardpartitionspark0.5 min read

What is the small-file problem in Spark, and how do you solve it?

Daniel WellingtonIncedoSwiggy

How do you optimize Spark jobs for better performance? Mention at least 5 techniques.

Fragma Data SystemsPresidioSwiggy

Architecturally, how would you justify or challenge Hadoop vs. a cloud-native data lake (S3 + EMR/Databricks) for a greenfield enterprise data platform? Discuss scalability ceilings, cost model trade-offs, and operational complexity.

hardspark0.7 min read

AltimetrikInfosys

hardjoinoptimizationpartition2.9 min read

Design a cost-aware resource strategy for a Databricks workload with spiky and batch jobs. Explain Dynamic Resource Allocation, when to disable it, and how min/max executors and spot instances affect cost and SLAs.

LTIMindtreePWC

hardjoinoptimizationpartition3 min read

Design an anti-skew strategy for a join on a high-cardinality key with a long-tail distribution (e.g., a few keys hold 80% of rows). Cover salting, split-skew, AQE, and cost/operational trade-offs.

Fragma Data SystemsMatrix

Prioritize Spark optimizations by impact and effort. Discuss partitioning strategy, caching policy, join selection, shuffle reduction, and when each becomes a scalability or cost bottleneck.

FreechargeSnowflake

Walk through the three AQE features in Spark 3.x (coalesce, join switch, skew join)—how they operate at shuffle boundaries, which configs enable them, and what happens when AQE cannot help.

HashedInSnowflake

hardjoinoptimizationpartition2.7 min read

Explain wide vs. narrow transformations and how they drive shuffle cost, failure domains, and pipeline design. When would you intentionally add a wide transformation, and how do you minimize its impact?

FedEx DataworksZen Data Shastra

hardjoinoptimizationpartition2.6 min read

Design a Delta table layout for mixed workload: point lookups by user_id, range scans by date, and full partition scans. Compare partitioning vs. Z-ordering—when to use each, and the rewrite cost trade-off.

BCGIncedo

hardoptimizationpartitionspark0.9 min read

Architecturally, how do Job–Stage–Task boundaries in Spark's execution model impact cluster sizing, shuffle cost, and when would you deliberately collapse or split stages?

FedEx DataworksFreight Tiger

hardjoinoptimizationpartition2.5 min read

Design a fault-tolerant Spark Streaming checkpoint strategy: what to persist, recovery semantics, and cost/scalability trade-offs with checkpoint frequency.

MeeshoTCS

hardjoinoptimizationpartition2.6 min read

Explain the Medallion Architecture (Bronze, Silver, Gold layers).

ChubbKaseya

Explain the benefits of using DataFrames over RDDs.

Fragma Data SystemsYash Technologies

Explain the concept of checkpointing in Spark and why it is important.

hardspark0.7 min read

CitiGlobant

hardbigquerypartitionwindow0.7 min read

Explain the difference between batch and streaming data processing in Data Fusion.

AareteFreecharge

hardpartitionspark0.6 min read

Given a streaming dataset from Kafka, how would you ingest the data in real-time using Spark?

Goldman SachsMeesho

How do you optimize Spark jobs for performance?

Fragma Data SystemsPresidio

hardsparkwindow0.6 min read

How would you implement a sliding window aggregation in Spark Structured Streaming?

Fragma Data SystemsSwiggy

hardpartitionsparksql0.6 min read

Implement a Spark job to find the top 10 most frequent words in a large text file.

CapcoPubmatic

hardjoinoptimizationpartition0.7 min read

What are the key components of the Spark execution model (Job, Stage, Task)?

FedEx DataworksFreight Tiger

hardjoinoptimizationspark0.7 min read

What is Spark's Catalyst Optimizer? Explain its stages.

DunnhumbyFragma Data Systems

hardoptimizationpartitionpython0.6 min read

What is the difference between Spark RDDs, DataFrames, and Datasets?

AccentureFragma Data Systems

hardpartitionsparkwindow0.7 min read

What is the small-file problem in Spark, and how do you solve it?

Daniel WellingtonIncedo

hardpythonsparksql0.5 min read

Why is SparkSession used in Spark 2.0 and later versions?

AltimetrikInfosys

A data pipeline processes files for different clients stored in separate directories. Explain how you would use dynamic DAG creation to handle client-specific workflows in Airflow.

hardairflow0.3 min read

Dunnhumby

hardjoinoptimizationpartition0.3 min read

Adaptive Query Execution (AQE): Discuss how AQE optimizes query execution in Spark dynamically based on runtime stats.

Capgemini

hardoptimizationpartition0.2 min read

After cleaning, how would you store the transformed data into Delta Lake?

Meesho

hardjoinoptimizationpartition3.6 min read

Alternatives to the Medallion Architecture

KPMG

hardjoinoptimizationpartition3.6 min read

Apache Spark Architecture - RDD, DAG, cluster manager, driver node, worker node

Zen Data Shastra

Apache Spark Fundamentals - discuss

McKinsey

Apache Spark Fundamentals - failures, job optimization, resource utilization

Paytm

Basic Spark commands – Create RDD, Load data, Filter

Pubmatic

Bloom Filters in Spark projects - explain use case

JP Morgan

hardoptimizationspark0.5 min read

Cache vs. Persistent storage in Spark?

Snowflake

hardetloptimizationsql0.6 min read

Calculating Databricks costs - explain DBU

JP Morgan

hardlakehousespark0.5 min read

Can Presto work with Near Real-Time Data (Streaming Data Source)?

Walmart

Can you explain how streams and tasks handle data freshness in near real-time?

hardspark0.5 min read

Cognizant

Challenges with Spark Jobs and Resolutions

Nagarro

hardpartitionspark0.4 min read

Compare Hadoop and Spark. Which one would you choose for a real-time application, and why?

BCG

Compare Kafka Streams and Spark Structured Streaming for real-time processing

hardetlspark0.5 min read

Meesho

hardpartition0.4 min read

Compare Kafka and RabbitMQ for real-time message processing in a streaming platform.

Disney+ Hotstar

hardjoinoptimizationpartition4 min read

Conceptualize and design a real-time streaming data pipeline end-to-end.

Expedia

hardetllakehousespark0.3 min read

Databricks - platform, use cases

NAB

hardoptimizationpythonspark0.4 min read

Define what a User-Defined Function (UDF) is and how to register it in PySpark.

Capgemini

hardlakehouse0.4 min read

Delta Lake: ACID compliance, time travel, streaming support

Kaseya

hardairflowetloptimization0.4 min read

Describe how you would monitor ETL job performance and handle long-running tasks.

Adidas

hardjoinoptimizationspark0.3 min read

Describe how you would optimize a join between two large tables where one is significantly smaller, using broadcast joins in PySpark.

Dunnhumby

Describe how you would optimize slow-running Spark jobs in a distributed environment.

EPAM

Describe the projects emphasizing Spark, Hadoop, or Azure for large-scale data processing

hardetlspark0.4 min read

LTIMindtree

hardoptimizationspark0.3 min read

Describe the role of a DAG Scheduler in PySpark

Nielsen

Describe the stages of a Spark job and strategies to optimize Spark performance for large datasets.

Swiggy

hardetloptimizationpartition3.7 min read

Design an ETL pipeline using Kafka and Spark Streaming

Meesho

hardetloptimizationpartition3.5 min read

Difference between Presto vs. Spark underlying architecture

Walmart

hardjoinoptimizationspark0.3 min read

Discuss common transformations used in Spark code.

Datametica

hardpartition0.4 min read

Discuss file formats (Parquet, Avro, ORC) and storage strategies.

Apple