Interview questions · medium
What is the difference between repartition and coalesce in Apache Spark?
What is the difference between narrow and wide transformations in Apache Spark? Explain with examples.
Explain the difference between Spark's map() and flatMap() transformations.
Explain the concept of Broadcast Join in Spark. When should it be used?
What is the difference between SQL and NoSQL databases?
Explain SQL Window Functions with examples.
Explain the use of the MERGE statement in SQL.
How do you handle NULL values in SQL? Mention functions like COALESCE and ISNULL.
How would you handle duplicate records in an SQL table?
What steps do you take to troubleshoot a slow-running Spark job?
Explain how you would use repartition or coalesce effectively to optimize processing when analyzing data only for a specific region.
How can you delete partitions from a table in Hive using a command?
If manual partitions are created in a Hive data-warehouse table directory, and you query records from those partitions, will you see the data? If not, how can this be fixed?
What is the difference between static and dynamic partitioning in Hive?
Your Kafka consumer shows significant lag during peak hours. What strategies would you employ to reduce lag and ensure timely data processing?
Write a PySpark job that calculates the number of unique users who logged in per day, but exclude any logins from inactive users listed in a separate file.
Your Kafka producer schema has changed, and the new data includes additional fields. How would you ensure backward compatibility using Schema Registry while consuming data from the same topic?
Type or paste your answer to any of these questions and our AI Coach scores it, highlights gaps, and rewrites it at FAANG quality. Free to try.