Interview questions · hard
How does Spark's Catalyst Optimizer work? Explain its stages.
Have you worked on Data Warehousing projects?
What is the difference between OLTP and OLAP?
How do you optimize a long-running SQL query?
What is Spark's Catalyst Optimizer? Explain its stages.
Name the tools and technologies you have worked with to date.
You need to create a workflow where Task B runs only if Task A is successful, and Task C should always run regardless of Task A or B's status. How would you define this dependency using Airflow?
You need to design a Kafka topic for a logging service. How would you decide the number of partitions and the key for partitioning to balance throughput and ordering requirements?
A data pipeline processes files for different clients stored in separate directories. Explain how you would use dynamic DAG creation to handle client-specific workflows in Airflow.
Describe how you would optimize a join between two large tables where one is significantly smaller, using broadcast joins in PySpark.
Explain how you would use Kafka Connect to ingest data from a relational database into Kafka while ensuring minimal latency and exactly-once semantics.
Given a DataFrame with columns id and name, add a new column department: If id < 100 assign HR, if id >= 100 and id < 200 assign admin.
If a consumer fails to process a message due to data corruption, describe how you would configure Kafka to handle retries and avoid message loss.
In Spark, what is the difference between cores and executors?
What is the difference between Pandas DataFrame and Spark DataFrame? When would you prefer using each?
When submitting Spark jobs, how does the process work in the backend? Explain.
Write a PySpark script to check for missing values and duplicate rows in a DataFrame. How would you ensure data quality before saving it to a storage system?
Type or paste your answer to any of these questions and our AI Coach scores it, highlights gaps, and rewrites it at FAANG quality. Free to try.