Question 1

What is the difference between SparkSession and SparkContext in Spark?

Accepted Answer

**SparkContext** (Spark 1.x): Low-level entry point for RDD operations. Manages cluster connections, configuration, and RDD creation. One active SparkContext per JVM. RDD-only.

**SparkSession** (Spark 2.0+): Unified entry point subsuming SparkContext, SQLContext, HiveContext, StreamingContext. Provides DataFrame, Dataset, SQL, and Structured Streaming APIs....

Question 2

What are traits in Scala, and how are they different from classes?

Accepted Answer

**Traits**: Interface-like constructs that can define abstract and concrete methods/fields. Support multiple inheritance of type. Mixed in via `with`.

**Classes**: Define objects with state and behavior. Single inheritance; one superclass.

**Key Differences**: Traits enable composition; classes define core logic. Traits can be partially implemented; classes hold primary behavior....

Question 3

How do you handle data security and compliance in a cloud environment?

Accepted Answer

Security is layered: (1) Encryption: At rest (KMS-managed keys, SSE-S3, Azure Storage encryption) and in transit (TLS). Why: Compliance (GDPR, HIPAA) and breach mitigation. Trade-off: Key management adds latency and complexity; managed services reduce operational burden. (2) Access: Least-privilege IAM, role-based access, no long-lived keys in code. Use VPC/VNet for network isolation; private endpoints for data stores....

Question 4

How would you read data from a web API? What steps would you follow after reading the data?

Accepted Answer

**Why this matters**: APIs are external dependencies—unreliable, rate-limited, schema-evolving. Production ingestion must be resilient, auditable, and idempotent. **Steps**: (1) **Contract & auth**: Define schema (Pydantic, JSON Schema); store credentials in secrets manager (Vault, AWS Secrets). OAuth tokens need refresh logic. (2) **Pagination & rate limiting**: Offset/cursor-based; loop until empty. Respect Retry-After; implement exponential backoff....

Question 5

Architecturally, how would you justify or challenge Hadoop vs. a cloud-native data lake (S3 + EMR/Databricks) for a greenfield enterprise data platform? Discuss scalability ceilings, cost model trade-offs, and operational complexity.

Accepted Answer

Hadoop's architecture centers on HDFS (storage) + YARN (resource scheduling) + MapReduce (compute model). Components: Namenode (single point for metadata—scalability ceiling for namespace); Datanodes (block storage, 128MB default blocks, 3x replication); ResourceManager/NodeManagers for CPU/memory allocation. Why it matters architecturally: HDFS scales linearly with nodes but metadata scale is bounded by Namenode heap; YARN allows multi-tenant workloads but adds operational overhead....

Question 6

How would you read data from a web API using PySpark?

Accepted Answer

PySpark has no native API source; the pattern is driver fetch or executor fetch. **Approaches**: (1) **Driver + parallelize**: data = requests.get(url).json(); df = spark.createDataFrame(data). Scales to API response size (typically MBs); driver is bottleneck. (2) **mapPartitions on executor**: Pass partition of IDs to each task; each calls API. Scales to many IDs but risks rate limiting and API abuse. (3) **Orchestrator + landing**: Airflow/Prefect fetches API → lands to S3/GCS → Spark reads....

Question 7

What is broadcasting in Spark, and why is it used? Can you give an example of its use?

Accepted Answer

Broadcasting sends a small table replica to every executor so joins execute locally without shuffle. **Why it matters**: A shuffle join moves both sides across the network; broadcast join moves only the small side once from driver to executors. **Scalability trade-off**: The small table must fit in executor memory; exceeding this causes OOM. As cluster size grows, more copies exist (N executors × table size), so large broadcasts waste memory....

Question 8

What is the difference between map and flatMap in Spark, and when would you use each?

Accepted Answer

**map**: 1 input element → 1 output element; output collection has same cardinality. **flatMap**: 1 input element → 0 or more output elements; results are flattened into a single collection. **Why it matters**: flatMap controls cardinality explosion (e.g., tokenizing a line into N words) and avoids nested structures that would require a separate explode step....

Question 9

What is the purpose of the Bronze, Silver, and Gold layers in a data pipeline?

Accepted Answer

**Bronze**: Raw ingestion; immutable; schema-on-read; data as-is from source. Single source of truth for replay and lineage. **Silver**: Cleansed, deduplicated, conformed; schema enforced; business-level quality. Trusted layer for analytics. **Gold**: Aggregated, modeled for consumption; star schema, metrics, reporting-ready. **Why it matters**: Clear ownership, incremental processing, and deterministic replay....

Question 10

What work is done by the executor memory in Spark?

Accepted Answer

Executor memory holds: **(1)** Cached RDD/DataFrame partitions (storage fraction), **(2)** Shuffle output written by map tasks for reduce tasks, **(3)** Working memory for task execution (joins, aggregations, sorting), **(4)** Off-heap (e.g., for native operations, Tungsten). **Why it matters**: Execution memory competes with storage; undersized executors cause spills and OOM....

Infosys Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 39 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading