Real questions on schema design, star and snowflake schemas, dimension/fact tables, normalization vs denormalization for analytics.
Data modeling is foundational to analytics and data engineering. These questions cover dimensional modeling (star and snowflake schemas), fact and dimension tables, normalization vs denormalization tradeoffs, SCD types, and schema design for both OLTP and OLAP use cases. Essential for any data warehouse or lakehouse role.
What architecture are you following in your current project, and why?
CDC During Migration - explain approaches for real-time Change Data Capture
Explain the differences between a Data Lake and a Data Warehouse.
Explain the differences between Data Warehouse, Data Lake, and Delta Lake
What strategies can you use to handle skewed data in Spark?
Describe the difference between Spark RDDs, DataFrames, and Datasets.
Explain Fact and Dimension Tables with examples.
Explain the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN.
What is the difference between a list and a tuple in Python?
When would you choose a Snowflake schema over a Star schema?
Detail examples of inner, outer, left, and right joins.
Explain strategies for managing schema changes in PySpark over time.
Explain the Medallion Architecture (Bronze, Silver, Gold layers).
Have you worked on Data Warehousing projects?
How do you drop columns with null values in PySpark?
How do you ensure smooth communication between data scientists, business teams, and developers?
How would you read data from a web API? What steps would you follow after reading the data?
Prioritize Spark optimizations by impact and effort. Discuss partitioning strategy, caching policy, join selection, shuffle reduction, and when each becomes a scalability or cost bottleneck.
Tell me about a time when you faced a challenging situation at work and how you handled it.
What are the key components of AWS Glue, and how do they work together?
What is normalization and denormalization? When would you use each?
What is the difference between Managed and External Tables in Databricks?
What is the difference between OLTP and OLAP?
What is the difference between OLTP and OLAP?
What is the difference between Spark RDDs, DataFrames, and Datasets?
What is the difference between SQL and NoSQL databases?
What is the purpose of the Bronze, Silver, and Gold layers in a data pipeline?
When and how do you use Broadcast Join?
When would you architecturally choose Dataset[T] over DataFrame in a Scala Spark pipeline, and what are the scalability and portability trade-offs? Include type-safety benefits vs. operational constraints.
Why should we hire you for this role?
A JSON file with evolving schema needs to be ingested into a DataFrame. How would you handle new fields dynamically in PySpark without breaking the job for previous structures?
ACID Properties
After cleaning, how would you store the transformed data into Delta Lake?
Agile methodologies used?
Alternatives to the Medallion Architecture
Apache Spark Architecture - RDD, DAG, cluster manager, driver node, worker node
Articulate the architectural decisions, scalability trade-offs, and cost implications of designing an AWS data platform. How would you justify glue vs. EMR, Redshift vs. Athena, and when would each choice become cost-prohibitive at scale?
Azure Fabric in Cloud Architecture?
Bloom Filters in Spark projects - explain use case
Build an executive dashboard for reporting.
Business generates TBs of data daily. How would you design the data pipeline in Azure?
Can Schema Evolution lead to data inconsistencies? If so, how do you manage them?
Can you describe the role of user groups in setting up these policies?
Can you explain the trade-offs you made during the design process?
Can you give an example of processing nested JSON data using these functions?
Can you share a time when you had to shift focus due to urgent tasks?
Can you share a time you faced a significant challenge and how you overcame it?
Cloud Architecture - explain
Code a simple PySpark job to read a JSON file, filter records, and write output in Parquet format.
Command to Read JSON Data and Options
Download the complete interview prep bundle with expert answers. Study offline, on your commute, anywhere.