Q: How do you keep a specific column on top in SQL?

**SELECT**: List column first: SELECT priority_col, col2, col3 FROM t. **View**: CREATE VIEW v AS SELECT priority_col, col2, col3 FROM t. **Table**: No guaranteed physical order in SQL model. Presentation layer (BI, app) controls display order.

Question 1

Tell me about yourself and your experience.

Accepted Answer

**Situation**: I joined the data org when our pipelines were monolithic, causing 4+ hour delays and frequent outages affecting downstream dashboards and ML models.

**Task**: I was tasked with redesigning the data platform to support real-time decisioning while improving reliability and cost efficiency.

**Action**: I led a cross-functional team of 5 engineers to architect a medallion (Bronze/Silver/Gold) architecture on Delta Lake....

Question 2

Explain the concept of checkpointing in Spark and why it is important.

Accepted Answer

Checkpointing breaks RDD lineage by materializing to storage, trading storage I/O for DAG truncation. **Why it exists**: Long lineage (e.g., 1000+ stages in iterative algorithms like PageRank or ML) causes stack overflow in the driver and makes fault recovery expensive—Spark would replay the entire DAG. **Scalability trade-off**: Checkpoint is a full write; at scale, this is costly (network, disk). For batch, prefer persist/cache when lineage is < ~50 stages....

Question 3

How do you drop columns with null values in PySpark?

Accepted Answer

Two distinct operations: (1) **Drop columns that are entirely null** (no non-null values): null_cols = [c for c in df.columns if df.filter(col(c).isNotNull()).count() == 0]; df = df.drop(*null_cols). **Caveat**: count() triggers a full scan—expensive on large tables. (2) **Drop rows with null in specified columns**: df.dropna(subset=["col1", "col2"]). **Scalability**: The column-null check is O(partitions × columns) and can be costly; consider sampling or inferring from schema/sample....

Question 4

Explain Dynamic Partition Pruning error and how to fix it.

Accepted Answer

**Architectural Logic**: DPP in Spark pushes dimension filters to fact table to skip partitions—type mismatch or missing broadcast breaks it. **Mechanism**: When dimension is filtered and broadcast, Spark pushes partition filter to fact. **Error Causes**: Partition column type mismatch (e.g., INT vs STRING); filter not pushed; broadcast hint missing. **Fix**: (1) Align join key types with partition column. (2) `broadcast(dim_df)` for small dimension....

Question 5

How do you convert 3 rows into one column in SQL?

Accepted Answer

**STRING_AGG** (PostgreSQL): SELECT STRING_AGG(col, ', ' ORDER BY id) FROM (SELECT col FROM t LIMIT 3) sub. **LISTAGG** (Oracle), **GROUP_CONCAT** (MySQL), **ARRAY_AGG** (returns array). For exactly 3 rows: LIMIT 3 in subquery. **Determinism**: ORDER BY in agg....

Question 6

How do you count occurrences in a column in SQL?

Accepted Answer

**Per value**: SELECT col, COUNT(*) FROM t GROUP BY col. **Specific value**: SELECT COUNT(*) FROM t WHERE col = 'X'. **Non-null**: COUNT(col) excludes NULLs; COUNT(*) counts all. **Distinct**: COUNT(DISTINCT col). **Conditional**: COUNT(CASE WHEN cond THEN 1 END)....

Question 7

How do you keep a specific column on top in SQL?

Accepted Answer

**SELECT**: List column first: SELECT priority_col, col2, col3 FROM t. **View**: CREATE VIEW v AS SELECT priority_col, col2, col3 FROM t. **Table**: No guaranteed physical order in SQL model. Presentation layer (BI, app) controls display order.

Question 8

Explain read and write modes in Spark.

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Spark read modes: spark.read loads data (default overwrites for file sources). Write modes: (1) overwrite—replaces target; (2) append—adds data; (3) ignore—no-op if exists; (4) errorIfExists—fail if exists. Example: df.write.mode('append').parquet(path). For Delta: overwriteSchema; replaceWhere for conditional overwrite....

Question 9

How do you convert an array column to multiple columns in PySpark?

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Array to columns: from pyspark.sql.functions import col; df.select(col('array_col')[0].alias('col1'), col('array_col')[1].alias('col2')) or df.select(explode('array_col')).groupBy('id').pivot('array_col')....

Question 10

How does Adaptive Query Execution (AQE) work?

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Adaptive Query Execution (AQE) optimizes Spark queries at runtime. Introduced in Spark 3.x, it: (1) Coalesces partitions after shuffle based on actual data sizes (`spark.sql.adaptive.coalescePartitions.enabled`). (2) Converts Sort-Merge Join to Broadcast Join when runtime stats show a join side is small....

Globant Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 13 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading

Globant Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 13 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading