DataEngPrep.tech
QuestionsPracticeAI CoachDashboardPacksBlog
ProLogin

Interview Questions

Real questions from top companies in Spark/Big Data

700+ Easy450+ Medium650+ Hard
All CategoriesBehavioralSpark/Big DataSQLPython/CodingSystem Design/ArchitectureCloud/ToolsGeneral/Othereasymediumhard
101

Databricks - platform, use cases

Spark/Big Datahardetllakehousespark0.3 min read
NAB
β†’
102

Databricks Cluster Management - standalone vs YARN mode

Spark/Big Dataeasyspark0.3 min read
Meesho
β†’
103

Databricks Job Cluster and SQL Endpoint - discuss Photon

Spark/Big Dataeasyetlsparksql0.5 min read
JP Morgan
β†’
104

Databricks notebooks vs. Fabric notebooks - differences

Spark/Big Dataeasylakehousespark0.3 min read
Nihilent
β†’
105

Databricks vs. PySpark?

Spark/Big Dataeasypythonspark0.3 min read
Comcast
β†’
106

Define Airflow and explain it as a workflow orchestration tool.

Spark/Big Dataeasyairflow0.3 min read
Fossil Group
β†’
107

Define what a User-Defined Function (UDF) is and how to register it in PySpark.

Spark/Big Datahardoptimizationpythonspark0.4 min read
Capgemini
β†’
108

Defining Tasks in DAG

Spark/Big Dataeasyairflowpython0.3 min read
Verizon
β†’
109

Delta Lake: ACID compliance, time travel, streaming support

Spark/Big Datahardlakehouse0.4 min read
Kaseya
β†’
110

Delta vs Parquet - explain

Spark/Big Dataeasylakehouse0.3 min read
Myntra
β†’
111

Deploying DAGs

Spark/Big Dataeasyairflowpython0.3 min read
Verizon
β†’
112

Describe a custom EMR cluster configuration for Spark-based ETL with minimal cost.

Spark/Big Dataeasyetlspark0.3 min read
Capco
β†’
113

Describe building custom JARs for Spark jobs

Spark/Big Dataeasyspark0.3 min read
LTIMindtree
β†’
114

Describe how to pass data between tasks in Airflow using XComs.

Spark/Big Dataeasyairflow0.4 min read
Citi
β†’
115

Describe how you would monitor ETL job performance and handle long-running tasks.

Spark/Big Datahardairflowetloptimization0.4 min read
Adidas
β†’
116

Describe how you would optimize a join between two large tables where one is significantly smaller, using broadcast joins in PySpark.

Spark/Big Datahardjoinoptimizationspark0.3 min read
Dunnhumby
β†’
117

Describe how you would optimize slow-running Spark jobs in a distributed environment.

Spark/Big Datahardoptimizationpartitionspark0.4 min read
EPAM
β†’
118

Describe how you would use PySpark to aggregate and summarize large transaction datasets.

Spark/Big Datamediumpartitionsparkwindow0.3 min read
Swiggy
β†’
119

Describe the cluster configuration used in your project, including memory allocation, number of nodes, and executor/driver settings.

Spark/Big Dataeasyspark0.3 min read
Capgemini
β†’
120

Describe the projects emphasizing Spark, Hadoop, or Azure for large-scale data processing

Spark/Big Datahardetlspark0.4 min read
LTIMindtree
β†’

Reading isn't practice. Get AI feedback on your answers.

Type or paste your answer to any of these questions and our AI Coach scores it, highlights gaps, and rewrites it at FAANG quality. Free to try.

Try AI Answer Coach β€” FreeStart a Mock Interview
Previous1...45678...23Next
Categories
All QuestionsSQLSpark / Big DataPython / CodingSystem DesignCloud / ToolsBehavioral
By Company
AmazonGoogleDatabricksSnowflakeMicrosoftNetflixUberTCS
Interview Guides
All GuidesTop SQL QuestionsTop Spark QuestionsTop Python QuestionsTop System DesignSQL Window FunctionsETL QuestionsData Modeling
Products
AI Interview CoachAnswer AnalyzerSQL PlaygroundResume AnalyzerInterview PacksPricing
Company
About UsContact UsAI DisclosureDisclaimerTerms of ServicePrivacy Policy
Β© 2026 DataEngPrep.tech. All rights reserved.
AboutBlogContactDisclaimer