Real questions from top companies Β· medium
Write a query to remove duplicate records from a table while retaining the earliest entry.
Write a query to retain only the latest record and delete others in case of duplicates.
Write a query to select the latest record based on a time_of_insertion column.
Write a self join query to get the manager's name for each employee.
Write an SQL query to find the top 3 performing products in each category
Write code to find the third-highest salary in a dataset using Pandas.
Write optimized SQL queries involving window functions, CTEs, and joins.
Write queries combining Joins and Group By operations.
Your Kafka consumer shows significant lag during peak hours. What strategies would you employ to reduce lag and ensure timely data processing?
map() vs mapPartitions(): Highlight the difference between map (row-level transformation) and mapPartitions (partition-level transformation).
repartition() vs coalesce(): Explain when to use repartition() (increases partitions) vs coalesce() (reduces partitions).
Accumulators - use as shared variable for write-only operations
Broadcast Joins and Shuffle Merge Joins?
Broadcast join - how it optimizes joins
Can you explain the concept of mappers in Spark, and how are they used in data transformations?
Code a simple PySpark job to read a JSON file, filter records, and write output in Parquet format.
Compare Spark's lineage recovery with Hadoop's block replication mechanism.
Daily tasks of a Data Engineer?
Data-Related Issues Encountered - handling skewed data
Describe how you would use PySpark to aggregate and summarize large transaction datasets.
Type or paste your answer to any of these questions and our AI Coach scores it, highlights gaps, and rewrites it at FAANG quality. Free to try.