Easy-level spark & big data questions from real data engineering interviews.
These easy spark & big data questions are selected from real interviews at top companies. Each question includes a detailed expert answer and pro tip to help you nail your interview.
What is the difference between Managed and External tables in Hive/Spark?
When would you architecturally choose Dataset[T] over DataFrame in a Scala Spark pipeline, and what are the scalability and portability trade-offs? Include type-safety benefits vs. operational constraints.
What is the difference between Managed and External Tables in Databricks?
A JSON file with evolving schema needs to be ingested into a DataFrame. How would you handle new fields dynamically in PySpark without breaking the job for previous structures?
A task intermittently fails due to external API limitations. How would you configure Airflow retries and alerts to manage this situation efficiently?
Accumulator and Broadcast Variables - explain
Approaches to handling multiple tasks within a sprint?
Cache() vs Persist(): Explain the difference and use cases for caching and persisting data in Spark with memory levels.
Can you explain dynamic resource allocation in Spark? How does it help optimize job performance?
Can you explain the concept of incremental loading in Sqoop and how to use it for job processing?
Can you give a use case where Delta Live Tables would be ideal?
Can you share a time when you had to shift focus due to urgent tasks?
Cluster Resource Allocation in Spark
Compare HDFS and cloud-based storage systems in terms of scalability and performance.
Compare ORC and Parquet
Compare Spark SQL vs. Hive Performance.
Compare Spark and MapReduce for iterative workloads
Concatenate Columns in PySpark
Controlling mappers in MapReduce
Create a DataFrame with default column types
Data locality in Hadoop - explain
Databricks Cluster Management - standalone vs YARN mode
Databricks Job Cluster and SQL Endpoint - discuss Photon
Databricks notebooks vs. Fabric notebooks - differences
Databricks vs. PySpark?
Define Airflow and explain it as a workflow orchestration tool.
Defining Tasks in DAG
Delta vs Parquet - explain
Deploying DAGs
Describe a custom EMR cluster configuration for Spark-based ETL with minimal cost.
Describe building custom JARs for Spark jobs
Describe how to pass data between tasks in Airflow using XComs.
Describe the cluster configuration used in your project, including memory allocation, number of nodes, and executor/driver settings.
Describe the role of a workflow orchestrator like Airflow in a data pipeline.
Describe your approach to managing offsets in Kafka.
Discuss Delta Logs file format and its significance.
Discuss the process of moving files in Databricks File System (DBFS).
Executor vs Driver in Spark
Explain Bronze/Silver/Gold Layers.
Explain your approach to monitoring and logging Spark jobs in AWS. What tools would you use to identify performance bottlenecks?
How do you compare the time investment and value of a task?
How do you handle bad data in Databricks?
How do you handle failures in Airflow tasks, and what retry strategies can you use?
How do you handle schema evolution in Spark, especially when reading data from sources like Parquet or Avro?
How do you prioritize your tasks in a multi-project environment?
Sqoop Incremental Import?
Sqoop command for importing multiple tables
Suppose you have a DAG that ingests data from multiple databases. How would you increase task parallelism in Airflow to improve performance without overloading the system?
Suppose you need to import 5 tables from an external RDBMS (like MySQL) into Hadoop HDFS. Write the Sqoop command
Task Dependencies in DAG
What are Hadoop commands for Get and Merge?
What are the advantages of using Dataproc over a traditional Hadoop setup?
What are the advantages of using Delta Lake over Parquet?
What are the differences between %pip and %conda commands in Databricks?
What are the different delivery semantics in Kafka (at least-once, at-most-once, exactly-once)?
What are the different modes in which you can submit Spark jobs? Explain each.
What are the performance considerations when using Auto Loader?
What are the steps to connect to Salesforce?
What are the steps to debug a failed workflow in Databricks?
What are the steps to execute a Python file with PySpark code on an EC2 environment?
Download the complete interview prep bundle with expert answers. Study offline, on your commute, anywhere.