Question 1

Tell me about yourself and your experience.

Accepted Answer

**Situation**: I joined the data org when our pipelines were monolithic, causing 4+ hour delays and frequent outages affecting downstream dashboards and ML models.

**Task**: I was tasked with redesigning the data platform to support real-time decisioning while improving reliability and cost efficiency.

**Action**: I led a cross-functional team of 5 engineers to architect a medallion (Bronze/Silver/Gold) architecture on Delta Lake....

Question 2

What is the difference between SparkSession and SparkContext in Spark?

Accepted Answer

**SparkContext** (Spark 1.x): Low-level entry point for RDD operations. Manages cluster connections, configuration, and RDD creation. One active SparkContext per JVM. RDD-only.

**SparkSession** (Spark 2.0+): Unified entry point subsuming SparkContext, SQLContext, HiveContext, StreamingContext. Provides DataFrame, Dataset, SQL, and Structured Streaming APIs....

Question 3

What are traits in Scala, and how are they different from classes?

Accepted Answer

**Traits**: Interface-like constructs that can define abstract and concrete methods/fields. Support multiple inheritance of type. Mixed in via `with`.

**Classes**: Define objects with state and behavior. Single inheritance; one superclass.

**Key Differences**: Traits enable composition; classes define core logic. Traits can be partially implemented; classes hold primary behavior....

Question 4

What are Airflow Operators? Give examples.

Accepted Answer

Airflow Operators define a single unit of work in a DAG—each operator performs one atomic, idempotent task. **Why they matter**: They encapsulate work so DAGs remain declarative and schedulable; the scheduler doesn't need to understand task logic. **Examples**: BashOperator, PythonOperator, SqlOperator, HTTPOperator, DockerOperator, KubernetesPodOperator, Sensor. **Scalability**: Heavy logic should live in external scripts or services; operators should only orchestrate....

Question 5

How would you read data from a web API? What steps would you follow after reading the data?

Accepted Answer

**Why this matters**: APIs are external dependencies—unreliable, rate-limited, schema-evolving. Production ingestion must be resilient, auditable, and idempotent. **Steps**: (1) **Contract & auth**: Define schema (Pydantic, JSON Schema); store credentials in secrets manager (Vault, AWS Secrets). OAuth tokens need refresh logic. (2) **Pagination & rate limiting**: Offset/cursor-based; loop until empty. Respect Retry-After; implement exponential backoff....

Question 6

Architecturally, how would you justify or challenge Hadoop vs. a cloud-native data lake (S3 + EMR/Databricks) for a greenfield enterprise data platform? Discuss scalability ceilings, cost model trade-offs, and operational complexity.

Accepted Answer

Hadoop's architecture centers on HDFS (storage) + YARN (resource scheduling) + MapReduce (compute model). Components: Namenode (single point for metadata—scalability ceiling for namespace); Datanodes (block storage, 128MB default blocks, 3x replication); ResourceManager/NodeManagers for CPU/memory allocation. Why it matters architecturally: HDFS scales linearly with nodes but metadata scale is bounded by Namenode heap; YARN allows multi-tenant workloads but adds operational overhead....

Question 7

How would you read data from a web API using PySpark?

Accepted Answer

PySpark has no native API source; the pattern is driver fetch or executor fetch. **Approaches**: (1) **Driver + parallelize**: data = requests.get(url).json(); df = spark.createDataFrame(data). Scales to API response size (typically MBs); driver is bottleneck. (2) **mapPartitions on executor**: Pass partition of IDs to each task; each calls API. Scales to many IDs but risks rate limiting and API abuse. (3) **Orchestrator + landing**: Airflow/Prefect fetches API → lands to S3/GCS → Spark reads....

Question 8

What is broadcasting in Spark, and why is it used? Can you give an example of its use?

Accepted Answer

Broadcasting sends a small table replica to every executor so joins execute locally without shuffle. **Why it matters**: A shuffle join moves both sides across the network; broadcast join moves only the small side once from driver to executors. **Scalability trade-off**: The small table must fit in executor memory; exceeding this causes OOM. As cluster size grows, more copies exist (N executors × table size), so large broadcasts waste memory....

Altimetrik Data Engineer Interview Questions

Reading isn't practice. Get AI feedback on your answers.

Other Companies

Altimetrik Data Engineer Interview Questions

Reading isn't practice. Get AI feedback on your answers.

Other Companies