Real questions from top companies Β· medium
What is Broadcast Join and Why is It Required?
What is Shuffle and How to Handle It in Spark
What is offset management in Kafka?
What is the advantage of caching in PySpark? When and why would you use it?
What is the command to import data from HDFS to Hive?
What is the difference between partitions and repartitions in Spark, and when do you use each?
What is the most common performance bottleneck in Spark jobs, and how would you resolve it?
What is the role of Zookeeper in Kafka?
What is the usage of Optimize and REORG commands in Databricks?
What performance tuning techniques do you apply in both Sqoop and Spark to optimize their execution?
What role does executor memory and CPU configuration play in maximizing parallelism?
What strategies would you use to optimize Spark jobs for both performance and cost on AWS?
What techniques ensure deduplication in large datasets?
What's the difference between narrow and wide transformations?
When would you choose a broadcast join over a shuffle join? Any memory risks?
Which Spark property controls the number of shuffle partitions?
Write PySpark code to extract data from a CSV and create a table.
Write PySpark code to save a DataFrame in Parquet format to an S3 bucket.
Write a PySpark job that calculates the number of unique users who logged in per day, but exclude any logins from inactive users listed in a separate file.
Write a PySpark script to filter out invalid records from a dataset and calculate the average for a specific column, ensuring the schema is strictly defined at runtime.
Type or paste your answer to any of these questions and our AI Coach scores it, highlights gaps, and rewrites it at FAANG quality. Free to try.