DataEngPrep.tech

hardairflowetljoin3.5 min read

What architecture are you following in your current project, and why?

CognizantHCLNagarroThoughtworks+1

hardbigqueryoptimizationsnowflake0.4 min read

What is a Common Table Expression (CTE), and when would you use it?

AccentureCognizantEPAMYash Technologies

hardjoinoptimizationpartition3.2 min read

Can you explain the architecture of Apache Spark and its components?

CoforgeFreechargeNihilent

hardoptimizationpartitionspark0.5 min read

Describe the difference between Spark RDDs, DataFrames, and Datasets.

AccentureFragma Data Systems

hardjoinoptimizationspark0.5 min read

How does Spark's Catalyst Optimizer work? Explain its stages.

DunnhumbyFragma Data SystemsHashedIn

hardjoinoptimizationpartition0.5 min read

How do you optimize Spark jobs for better performance? Mention at least 5 techniques.

Fragma Data SystemsPresidioSwiggy

hardbigqueryjoinoptimization3 min read

What is Snowflake's architecture, and why is it unique?

EYIncedoTech Mahindra

hardjoinoptimizationpartition3 min read

Briefly explain the architecture of Kafka.

Delivery HeroGrover

hardjoinoptimizationpartition3 min read

Describe the data pipeline architecture you've worked with.

Fragma Data SystemsGrover

hardbigqueryetloptimization0.7 min read

Have you worked on Data Warehousing projects?

AareteDunnhumby

hardbigqueryoptimizationpartition0.6 min read

Difference Between Internal and External Tables in BigQuery

EYIncedoTech Mahindra

How do you optimize a long-running SQL query?

AareteDunnhumbyIncedo

hardjoinoptimizationpartition2.9 min read

Design a cost-aware resource strategy for a Databricks workload with spiky and batch jobs. Explain Dynamic Resource Allocation, when to disable it, and how min/max executors and spot instances affect cost and SLAs.

LTIMindtreePWC

hardjoinoptimizationpartition3 min read

Design an anti-skew strategy for a join on a high-cardinality key with a long-tail distribution (e.g., a few keys hold 80% of rows). Cover salting, split-skew, AQE, and cost/operational trade-offs.

Fragma Data SystemsMatrix

Prioritize Spark optimizations by impact and effort. Discuss partitioning strategy, caching policy, join selection, shuffle reduction, and when each becomes a scalability or cost bottleneck.

FreechargeSnowflake

Walk through the three AQE features in Spark 3.x (coalesce, join switch, skew join)—how they operate at shuffle boundaries, which configs enable them, and what happens when AQE cannot help.

HashedInSnowflake

hardjoinoptimizationpartition2.7 min read

Explain wide vs. narrow transformations and how they drive shuffle cost, failure domains, and pipeline design. When would you intentionally add a wide transformation, and how do you minimize its impact?

FedEx DataworksZen Data Shastra

hardjoinoptimizationpartition2.6 min read

Design a Delta table layout for mixed workload: point lookups by user_id, range scans by date, and full partition scans. Compare partitioning vs. Z-ordering—when to use each, and the rewrite cost trade-off.

BCGIncedo

hardoptimizationpartitionspark0.9 min read

Architecturally, how do Job–Stage–Task boundaries in Spark's execution model impact cluster sizing, shuffle cost, and when would you deliberately collapse or split stages?

FedEx DataworksFreight Tiger

hardjoinoptimizationpartition2.5 min read

Design a fault-tolerant Spark Streaming checkpoint strategy: what to persist, recovery semantics, and cost/scalability trade-offs with checkpoint frequency.

MeeshoTCS

hardjoinoptimizationpartition2.6 min read

Explain the Medallion Architecture (Bronze, Silver, Gold layers).

ChubbKaseya

Explain the benefits of using DataFrames over RDDs.

Fragma Data SystemsYash Technologies

How do you optimize Spark jobs for performance?

Fragma Data SystemsPresidio

hardjoinoptimizationpartition0.7 min read

What are the key components of the Spark execution model (Job, Stage, Task)?

FedEx DataworksFreight Tiger

hardjoinoptimizationspark0.7 min read

What is Spark's Catalyst Optimizer? Explain its stages.

DunnhumbyFragma Data Systems

hardoptimizationpartitionpython0.6 min read

What is the difference between Spark RDDs, DataFrames, and Datasets?

AccentureFragma Data Systems

hardairflowoptimization0.6 min read

How do you stay updated with the latest trends and technologies in data engineering?

PresidioSwiggy

hardoptimizationpartition0.6 min read

Describe a time when you had to deal with a difficult coworker.

Persistent Systems

hardairflowoptimization0.5 min read

How do you stay updated with the latest trends and technologies in data engineering?

Presidio

hardjoinoptimizationpartition1.2 min read

Discuss the data size challenges in your previous projects. How did you optimize storage and processing?

American Express

hardoptimizationpartitionspark1.5 min read

Explain the projects you have worked on, focusing on challenges and solutions you implemented.

Capgemini

hardairflowetloptimization1 min read

Explain your journey as a data engineer and the projects you have worked on.

Capgemini

hardoptimizationpartitionspark0.6 min read

ADF Optimization Techniques?

Deloitte

Azure Fabric in Cloud Architecture?

Comcast

Business generates TBs of data daily. How would you design the data pipeline in Azure?

Cognizant

hardjoinlakehouseoptimization3.6 min read

Data Lakehouse architecture in Azure?

Persistent Systems

hardbigqueryoptimizationpartition0.6 min read

Could you describe a specific cost optimization strategy you implemented in the cloud and its results?

Walmart

Design an end-to-end data pipeline using Glue, Lambda, EC2, S3, Redshift, and Athena.

Carelon

Design: Migrate data from multiple sources (Hadoop, S3, Oracle DB) into a final S3 bucket

PayPal

hardoptimizationpartitionspark0.3 min read

Explain steps to optimize data read performance from cloud storage (S3 or Azure Blob).

Fragma Data Systems

Explain the purpose and architecture of Azure Synapse Analytics.

Fractal

hardetloptimizationpartition0.2 min read

Glue ETL optimization: Performance improvement strategies?

Daniel Wellington

hardoptimization0.3 min read

How did you contribute to cost optimization initiatives while working with cloud technologies?

Walmart

hardoptimization0.2 min read

How do you handle cost optimization in AWS EMR clusters?

Persistent Systems

How would you design a data pipeline using AWS Glue, S3, and Redshift?

Wipro

In AWS Data Pipeline, how would you design a process to copy only recently modified files from one S3 bucket to another?

EPAM

hardoptimizationpartition0.4 min read

What are the pricing models for queries in Athena?

Capco

hardoptimizationsparksql0.5 min read

What is Azure Data Lake Storage (ADLS) Gen2, and how does it differ from Blob Storage?

Fractal

hardbigqueryetloptimization0.3 min read

What is your experience with cloud technologies?

Thoughtworks

hardoptimizationpartition0.4 min read

What techniques do you use to balance compute costs and performance in cloud-based data solutions?

Swiggy

hardjoinoptimizationpartition0.4 min read

What types of queries would not be efficient in Athena?

Capco

Can you explain the trade-offs you made during the design process?

Thoughtworks

hardoptimizationpartition0.1 min read

Data Storage and Retrieval Optimization techniques

Lumiq

hardetljoinoptimization0.5 min read

Describe the ZS projects you worked on

Meesho

Designing Mixpanel - event-driven analytics platform

Walmart

hardjoinoptimizationpartition0.2 min read

Discuss Logical Plan vs Physical Plan

KPMG

hardoptimization0.2 min read

Explain your roles and responsibilities in your current project

Nihilent

How did you ensure scalability and reliability in your design?

Thoughtworks

hardoptimizationpartition0.2 min read

How do you increase job performance? What techniques and optimizations?

Hexaware