Interview questions
Name the tools and technologies you have worked with to date.
What steps do you take to troubleshoot a slow-running Spark job?
What would you do if you were assigned a task with a technology you've never used before?
Explain how you would use repartition or coalesce effectively to optimize processing when analyzing data only for a specific region.
How can you delete partitions from a table in Hive using a command?
If manual partitions are created in a Hive data-warehouse table directory, and you query records from those partitions, will you see the data? If not, how can this be fixed?
What is the difference between static and dynamic partitioning in Hive?
Write a SQL query to find distinct IDs from a table where the count is more than 1 and greater than 200.
You need to create a workflow where Task B runs only if Task A is successful, and Task C should always run regardless of Task A or B's status. How would you define this dependency using Airflow?
You need to design a Kafka topic for a logging service. How would you decide the number of partitions and the key for partitioning to balance throughput and ordering requirements?
Your Kafka consumer shows significant lag during peak hours. What strategies would you employ to reduce lag and ensure timely data processing?
A JSON file with evolving schema needs to be ingested into a DataFrame. How would you handle new fields dynamically in PySpark without breaking the job for previous structures?
A data pipeline processes files for different clients stored in separate directories. Explain how you would use dynamic DAG creation to handle client-specific workflows in Airflow.
A task intermittently fails due to external API limitations. How would you configure Airflow retries and alerts to manage this situation efficiently?
Describe how you would optimize a join between two large tables where one is significantly smaller, using broadcast joins in PySpark.
Explain how you would use Kafka Connect to ingest data from a relational database into Kafka while ensuring minimal latency and exactly-once semantics.
Given a DataFrame with columns id and name, add a new column department: If id < 100 assign HR, if id >= 100 and id < 200 assign admin.
If a consumer fails to process a message due to data corruption, describe how you would configure Kafka to handle retries and avoid message loss.
In Spark, what is the difference between cores and executors?
Suppose you have a DAG that ingests data from multiple databases. How would you increase task parallelism in Airflow to improve performance without overloading the system?
Type or paste your answer to any of these questions and our AI Coach scores it, highlights gaps, and rewrites it at FAANG quality. Free to try.