Write a PySpark script to check for missing values and duplicate rows in a DataFrame. How would you ensure data quality before saving it to a storage system?
Spark/Big Datahard
2
Write a PySpark script to filter out invalid records from a dataset and calculate the average for a specific column, ensuring the schema is strictly defined at runtime.
Spark/Big Datamedium
3
Write a PySpark script to process data stored in Delta format and transform it into Parquet.
Spark/Big Datamedium
4
Write a PySpark script to read a CSV file, filter rows where the age column is less than 18, and write the result to a new CSV file.
Spark/Big Datamedium
5
Write a Spark job to count word occurrences from an S3 dataset.
Spark/Big Datahard
6
Write a complete PySpark program from import statements to the stop statement, covering transformations and actions.
Spark/Big Datamedium
7
Write a transformation in PySpark to join and clean multiple raw input sources
Spark/Big Datamedium
8
Write code to read data from Delta Lake in S3 and perform upsert based on primary key
Spark/Big Datamedium
+20 More Questions with Expert Answers
Get the complete 1,800+ question library with detailed, expert-level answers covering SQL, Spark, System Design, Python, Cloud, and Behavioral topics.