Real questions from top companies Β· hard
Explain Spark Architecture β Driver, Executors, and Tasks.
Explain Spark transformations (lazy evaluation, wide vs narrow).
Explain Spark's execution process β Job/Stage/Task creation.
Explain Spark's narrow vs. wide transformations and when to use each
Explain a scenario-based question on Spark optimization and how you would troubleshoot performance issues.
Explain aggregation functions in PySpark with examples and use cases.
Explain caching techniques in Databricks.
Explain data encryption in Databricks, both at rest and in transit.
Explain database drivers/connectors and their use cases.
Explain how Glue's Spark-based architecture handles data parallelism.
Explain how HDFS (Hadoop Distributed File System) stores data across nodes.
Explain how I handle performance optimizations, scheduling tasks, and monitoring DAGs in Airflow.
Explain how Kafka handles real-time data streaming and guarantees message delivery.
Explain how Spark groups transformations into stages. What causes a stage boundary?
Explain how Spark handles data partitioning and the role of shuffles in performance tuning.
Explain how Spark processes a 500GB file, covering memory allocation, shuffles, and spillovers to disk.
Explain how spark.read.format("delta").load() works
Explain how to overwrite a file stored in S3 using PySpark.
Explain how to schedule an automated task using Apache Airflow.
Explain how you would design a partition strategy for a large dataset in HDFS.
Type or paste your answer to any of these questions and our AI Coach scores it, highlights gaps, and rewrites it at FAANG quality. Free to try.