Interview questions
Oozie join condition?
Partitioning a table with card details and transactions?
Teradata to Hadoop migration and handling data with SCD Type 2?
What is a Kafka topic, and how do you choose the number of partitions for it?
What is the role of a partition in Kafka, and how does it impact scalability?
Describe how to pass data between tasks in Airflow using XComs.
Explain the concept of RDD, DataFrame, and Dataset in PySpark.
Explain the concept of consumer groups in Kafka. How do they affect message processing?
Explain the difference between TriggerDagRunOperator and ExternalTaskSensor in Airflow.
How do you ensure data quality and consistency across different stages of a data pipeline?
How do you handle failures in Airflow tasks, and what retry strategies can you use?
How do you optimize a join operation in Spark for large datasets?
How would you design a Kafka-based pipeline for processing streaming data in real-time?
Methods to avoid duplicates in PySpark/Scala?
Usage of UDFs?
What is a DAG in Apache Airflow, and how is it used for scheduling workflows?
Describe an end-to-end data pipeline project you worked on, highlighting your role and the technologies used.
Describe how Kafka ensures data durability and fault tolerance.
Introduce your recent project, explaining its goal, architecture, tools, and technologies.
Type or paste your answer to any of these questions and our AI Coach scores it, highlights gaps, and rewrites it at FAANG quality. Free to try.