DataEngPrep.tech

mediumpartitionpythonspark1 min read

What is the difference between repartition and coalesce in Apache Spark?

BCGCitiDunnhumbyFragma Data Systems+3

mediumpartitionsqlwindow0.8 min read

Write an SQL query to find the second-highest salary from an employee table.

AccentureBCGCognizantIncedo+2

mediumpartitionspark0.7 min read

What is the difference between cache() and persist() in Spark? When would you use each?

AccentureCoforgeFreechargeImpetus+1

mediumpartitionspark0.8 min read

What is the difference between groupByKey and reduceByKey in Spark?

AccentureCapcoCoforgeNagarro+1

mediumjoinpartitionpython0.9 min read

What is the difference between narrow and wide transformations in Apache Spark? Explain with examples.

CoforgeDelivery HeroDunnhumbyFragma Data Systems+1

hardairflowetljoin3.5 min read

What architecture are you following in your current project, and why?

CognizantHCLNagarroThoughtworks+1

Demonstrate the difference between DENSE_RANK() and RANK()

CapcoImpetusKPMGWipro

mediumbigquerypartitionsnowflake0.5 min read

Explain the differences between Data Warehouse, Data Lake, and Delta Lake

FractalKPMGMatrixMeesho

mediumjoinpartition0.5 min read

Explain the differences between Repartition and Coalesce. When would you use each?

DatameticaFedEx DataworksNihilentPresidio

mediumjoinpartitionspark0.5 min read

What is the difference between partitioning and bucketing in Spark, and when would you use bucketing?

CitiCoforgeHCLLTIMindtree

mediumjoinpartitionspark0.5 min read

What strategies can you use to handle skewed data in Spark?

BCGBitwiseCitiHashedIn

hardetljoinpartition0.5 min read

Briefly introduce yourself and walk us through your journey as a Data Engineer so far.

AccentureEPAMYash Technologies

mediumjoinpartition0.7 min read

Describe a scenario where partitioning and bucketing would improve query performance.

Daniel WellingtonGoldman SachsSwiggy

Explain the types of triggers in ADF, including schedule, tumbling window, and event-based triggers.

FedEx DataworksNihilentVirtusa

mediumbigquerypartition0.6 min read

How do you remove duplicate rows in BigQuery?

EYIncedoTech Mahindra

hardjoinpartitionwindow0.7 min read

Joins and window functions - INNER, LEFT, RIGHT, FULL OUTER, ROW_NUMBER(), RANK(), DENSE_RANK()

FordKPMGNihilent

hardjoinoptimizationpartition3.2 min read

Can you explain the architecture of Apache Spark and its components?

CoforgeFreechargeNihilent

hardoptimizationpartitionspark0.5 min read

Describe the difference between Spark RDDs, DataFrames, and Datasets.

AccentureFragma Data Systems

mediumpartitionspark0.4 min read

Explain the difference between Spark's map() and flatMap() transformations.

Delivery HeroDunnhumbyFragma Data Systems

hardpartitionspark0.5 min read

What is the small-file problem in Spark, and how do you solve it?

Daniel WellingtonIncedoSwiggy

hardjoinoptimizationpartition0.5 min read

How do you optimize Spark jobs for better performance? Mention at least 5 techniques.

Fragma Data SystemsPresidioSwiggy

mediumairflowpartitionspark0.7 min read

Tell me about a time when you faced a challenging situation at work and how you handled it.

FreechargeWalmart

mediumjoinpartitionspark0.6 min read

What challenges did you face, and how did you tackle them?

Delivery HeroGrover

mediumpartitionspark0.7 min read

What would you do if a pipeline failed and you couldn't find the reason?

Delivery HeroGrover

hardbigqueryjoinoptimization3 min read

What is Snowflake's architecture, and why is it unique?

EYIncedoTech Mahindra

hardjoinoptimizationpartition3 min read

Briefly explain the architecture of Kafka.

Delivery HeroGrover

hardjoinoptimizationpartition3 min read

Describe the data pipeline architecture you've worked with.

Fragma Data SystemsGrover

hardbigqueryetloptimization0.7 min read

Have you worked on Data Warehousing projects?

AareteDunnhumby

mediumetlpartition0.7 min read

How would you read data from a web API? What steps would you follow after reading the data?

AltimetrikInfosys

hardbigquerypartitionsnowflake0.6 min read

Retrieve the most recent sale_timestamp for each product (Latest Transaction).

PresidioSwiggy

hardbigqueryetljoin0.7 min read

What is the difference between OLTP and OLAP?

AareteDunnhumbyFragma Data Systems

hardbigqueryoptimizationpartition0.6 min read

Difference Between Internal and External Tables in BigQuery

EYIncedoTech Mahindra

mediumpartition0.6 min read

Difference between ROW_NUMBER(), RANK(), and DENSE_RANK() with examples.

PresidioSwiggy

mediumjoinpartitionsql0.5 min read

Explain SQL Window Functions with examples.

mediumbigqueryjoinpartition0.5 min read

Explain the use of the MERGE statement in SQL.

How do you optimize a long-running SQL query?

mediumpartitionsql0.6 min read

How would you handle duplicate records in an SQL table?

Implement a query to find the top 5 customers by total sales amount.

Daniel WellingtonGoldman SachsSwiggy

mediumpartitionsqlwindow0.4 min read

SQL query to find the second highest salary from each department.

AccentureYash Technologies

mediumpartitionsqlwindow0.5 min read

Write an SQL query to find duplicate emails in a users table.

Daniel WellingtonGoldman SachsSwiggy

Triggers in ADF, especially tumbling window triggers.

AccentureYash Technologies

mediumjoinpartitionwindow0.5 min read

What is a window function? Explain with an example.

CitiFreecharge

mediumpartitionsql0.4 min read

Write a SQL query to find top 3 earners in each department.

FedEx DataworksIncedo

mediumpartitionwindow0.4 min read

Write a query to find the top three highest-paid employees in each department using window functions.

Bristol Myers SquibbWipro

mediumjoinpartitionsql0.7 min read

Write complex SQL queries involving multiple joins, subqueries, and data aggregation logic.

AppleTiger Analytics

mediumpartitionpythonspark0.8 min read

Convert complex SQL (CTEs, window functions, subqueries) to production-grade PySpark. Discuss when to use spark.sql() vs. DataFrame API, and the implications for testability, partitioning, and execution predictability.

DatameticaS&P Global

hardjoinoptimizationpartition2.9 min read

Design a cost-aware resource strategy for a Databricks workload with spiky and batch jobs. Explain Dynamic Resource Allocation, when to disable it, and how min/max executors and spot instances affect cost and SLAs.

LTIMindtreePWC

hardjoinoptimizationpartition3 min read

Design an anti-skew strategy for a join on a high-cardinality key with a long-tail distribution (e.g., a few keys hold 80% of rows). Cover salting, split-skew, AQE, and cost/operational trade-offs.

Fragma Data SystemsMatrix

Prioritize Spark optimizations by impact and effort. Discuss partitioning strategy, caching policy, join selection, shuffle reduction, and when each becomes a scalability or cost bottleneck.

FreechargeSnowflake

mediumjoinpartitionspark0.6 min read

Explain how Adaptive Query Execution changes the economics of Spark tuning. What problems does it solve at runtime, and when might you still need manual intervention (e.g., salting, broadcast hints)?

FedEx DataworksPWC

Walk through the three AQE features in Spark 3.x (coalesce, join switch, skew join)—how they operate at shuffle boundaries, which configs enable them, and what happens when AQE cannot help.

HashedInSnowflake

hardjoinoptimizationpartition2.7 min read

Explain wide vs. narrow transformations and how they drive shuffle cost, failure domains, and pipeline design. When would you intentionally add a wide transformation, and how do you minimize its impact?

FedEx DataworksZen Data Shastra

hardjoinoptimizationpartition2.6 min read

Design a Delta table layout for mixed workload: point lookups by user_id, range scans by date, and full partition scans. Compare partitioning vs. Z-ordering—when to use each, and the rewrite cost trade-off.

BCGIncedo

hardoptimizationpartitionspark0.9 min read

Architecturally, how do Job–Stage–Task boundaries in Spark's execution model impact cluster sizing, shuffle cost, and when would you deliberately collapse or split stages?

FedEx DataworksFreight Tiger

hardjoinoptimizationpartition2.5 min read

Design a fault-tolerant Spark Streaming checkpoint strategy: what to persist, recovery semantics, and cost/scalability trade-offs with checkpoint frequency.

MeeshoTCS

mediumpartition1 min read

Architect incremental load in ADF + Databricks with idempotency, late-arrival handling, and cost/scalability implications of watermark vs. change data capture.

DeloitteIncedo

mediumpartitionspark0.8 min read

Explain strategies for managing schema changes in PySpark over time.

AccentureYash Technologies

hardjoinoptimizationpartition2.6 min read

Explain the Medallion Architecture (Bronze, Silver, Gold layers).

ChubbKaseya

Explain the benefits of using DataFrames over RDDs.

Fragma Data SystemsYash Technologies