Question 1

Describe how metadata is stored and accessed for internal tables in a relational database.

Accepted Answer

**Architectural Logic**: Metadata drives optimization and discovery; system catalogs are the source of truth. **Storage**: pg_catalog (PostgreSQL), information_schema (standard). Tables: pg_class (tables, indexes), pg_attribute (columns), pg_index. **Access**: `SELECT * FROM information_schema.tables WHERE table_schema='public'`. **Why It Matters**: Optimizer uses statistics for plans; tools use for discovery and lineage....

Question 2

Does a Common Table Expression store data? If not, how does it function in SQL?

Accepted Answer

**Architectural Logic**: CTEs are inline view definitions, not materialized storage. **Mechanism**: `WITH cte AS (SELECT ...) SELECT * FROM cte`—CTE is a named subquery; optimizer may inline, materialize, or optimize. Evaluated when referenced (possibly multiple times—engine-dependent). **Why No Storage**: No persistent or temp table; scope is query. **Use Cases**: Readability; recursive queries; breaking complex logic....

Question 3

Find the second-highest salary in the employees table using three different methods.

Accepted Answer

**Method 1 (Subquery)**: SELECT MAX(salary) FROM employees WHERE salary < (SELECT MAX(salary) FROM employees). **Method 2 (LIMIT/OFFSET)**: SELECT DISTINCT salary FROM employees ORDER BY salary DESC LIMIT 1 OFFSET 1—single value; ties pick one. **Method 3 (Window)**: SELECT salary FROM (SELECT salary, ROW_NUMBER() OVER (ORDER BY salary DESC) rn FROM employees) t WHERE rn = 2....

Question 4

How would you optimize a SQL query for better performance when working with large datasets?

Accepted Answer

SQL query optimization for large datasets: (1) Indexing—create indexes on filter and join columns; avoid over-indexing on write-heavy tables. (2) Partitioning—partition by date, region, or key columns to enable partition pruning. (3) Avoid SELECT *—select only needed columns. (4) Push filters early—apply WHERE before JOINs. (5) Replace subqueries with JOINs or CTEs. (6) Use EXPLAIN to analyze execution plans. (7) Denormalize where read performance outweighs storage....

Question 5

What is the purpose of Delta format, and how does it differ from Parquet in terms of storage and querying?

Accepted Answer

Delta format: Built on Parquet with transaction log (ACID), time travel, and upserts. Supports MERGE, DELETE, UPDATE. Parquet: Immutable columnar format; append-only; no built-in ACID. Delta adds: schema enforcement, compaction, change data feed. Storage: Both columnar; Delta adds _delta_log. Querying: Delta supports point-in-time reads; Parquet is file-atom. Use Delta for mutable tables, pipelines needing upserts; Parquet for immutable batches....

Question 6

PySpark Coding Challenge: Transform input dataset with columns id, dob, name to add age, firstname, lastname

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Transform id, dob, name → age, firstname, lastname: `from pyspark.sql.functions import datediff, current_date, split`; `df.withColumn('age', datediff(current_date(), col('dob'))/365)`. `df.withColumn('name_split', split(col('name'),' ')).withColumn('firstname', col('name_split')[0]).withColumn('lastname', col('name_split')[1])`....

Question 7

What is the advantage of caching in PySpark? When and why would you use it?

Accepted Answer

**Advantage**: Avoid recomputation. Same DF used multiple times = one compute, many reads. Critical for iterative algorithms (ML), multi-action pipelines.

**When**: (1) DF reused 2+ times. (2) Iterative (e.g., loop with same lookup). (3) Small dimension joined repeatedly.

**Storage Levels**: MEMORY_ONLY (fast, evictable), MEMORY_AND_DISK (spill), MEMORY_ONLY_SER (smaller). Cache = MEMORY_AND_DISK.

**Why Care**: Recompute of 1TB = hours....

Question 8

Write a PySpark script to process data stored in Delta format and transform it into Parquet.

Accepted Answer

**Why It Matters (Architectural Logic)**: Delta→Parquet conversion enables format migration while preserving ACID and time-travel during transition. Partition preservation is critical for query performance.

Reading Delta and writing Parquet is straightforward with Delta Lake's native APIs. Read: `delta_df = spark.read.format("delta").load("s3://bucket/path/to/delta")`. Write Parquet: `delta_df.write.format("parquet").mode("overwrite").save("s3://bucket/output/parquet")`....

Tredence Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 8 Questions

More Interview Prep Guides

Unlock All Expert Answers