Question 1

What is the role of AWS Lambda in a data engineering pipeline?

Accepted Answer

Lambda is serverless compute for event-driven, short-lived tasks. Roles: (1) Orchestration: Trigger Glue, EMR, or Step Functions when files land (S3), events arrive (Kinesis), or schedules fire. (2) Light transforms: JSON→Parquet, small aggregations, API calls. (3) Alerts: Validate data, send Slack/PagerDuty on anomalies. Why: No server management; scale to thousands of concurrent invocations; pay per request and duration. Limitations: 15-min timeout, 10GB memory, 6MB payload (sync)....

Question 2

Copy Large Files from On-Premises to Azure in ADF

Accepted Answer

Why it matters: On-prem to cloud bulk transfer is a common hybrid pattern; poor design causes WAN bottlenecks, cost overruns, and failed SLAs. Architectural logic: Use SHIR as a bridge—it runs in your network, so latency stays within the LAN for the source leg. Azure IR cannot reach on-prem; SHIR is mandatory. Scalability trade-offs: parallelCopies (default 4, max 256 for cloud, hardware-limited for SHIR) multiplied by DIUs drives throughput....

Question 3

Data Load in Synapse Table?

Accepted Answer

Architectural logic: Synapse is MPP—distribution and indexing dictate load and query performance. COPY INTO is the native bulk path; ADF Copy Data supports it via PolyBase or bulk INSERT. Why COPY INTO: Reads from external stage (Blob/ADLS); parallel load across distributions; minimal log overhead. Scalability: Distribution key choice matters—hash on join keys for fact tables; round-robin for staging. CCI vs HEAP: CCI for analytics (column-store); HEAP for staging or small tables....

Question 4

Describe Amazon Athena and how it interacts with S3.

Accepted Answer

Architectural logic: Athena is serverless Presto/Trino—no cluster to manage; you pay per TB scanned. Why this model: Perfect for ad-hoc analytics on data lakes; decouples storage (S3) from compute. S3 interaction: Athena reads directly from S3; no ETL load step. Glue Catalog provides schema; queries scan only relevant partitions and columnar bytes. Scalability: Unlimited concurrency; each query spins up its own cluster....

Question 5

Describe the use of side inputs in Dataflow.

Accepted Answer

Architectural logic: Side inputs = small lookup broadcast to all workers; avoid expensive shuffle. Use case: Enrich events with product catalog or user segments. Main PCollection + side input (AsDict/AsIter) → Map with lookup. Why: No large join; each worker has full lookup. Scalability: Side input must fit in memory; large lookups use CoGroupByKey. Cost: Broadcast is cheap; oversized side input causes OOM....

Question 6

Describe your experience with cloud platforms like AWS, Azure, or GCP

Accepted Answer

Situation: I've built data platforms across AWS, Azure, GCP. Task: Deliver scalable, cost-efficient data solutions. Action: AWS—S3, Glue, Athena, EMR, Redshift; Step Functions + Lambda. Azure—ADF, Synapse, Databricks, Event Hub. GCP—BigQuery, Dataflow, Composer. Common patterns: Lakehouse, medallion layers, IaC (Terraform), CI/CD. Result: Migrated 3 orgs to cloud; reduced infra cost 25% via right-sizing and lifecycle....

Question 7

Difference between pipelines and data flows in ADF

Accepted Answer

Architectural logic: Pipelines = orchestration (activities, flow); Data Flows = Spark-based transformation activity. Pipelines orchestrate; Data Flows transform. A pipeline can contain Copy, Lookup, Data Flow, etc. When to use: Copy for simple move; Data Flow for filter, join, aggregate. Scalability: Data Flow runs on Spark; tune partitions. Cost: Data Flow consumes more than Copy....

Question 8

Discuss S3's advantages, including scalability and durability.

Accepted Answer

Architectural logic: S3 is object storage with 11 nines durability, multi-AZ by default. Scalability: Virtually unlimited; no capacity planning. Cost: Pay per GB; lifecycle for tiering. Why it matters for data lakes: Decoupled from compute; native integration (Glue, Athena, EMR). Trade-offs: Eventual consistency for overwrites (rare); no POSIX—use prefixes, not directories....

Question 9

Explain how AWS Glue interacts with on-premises SQL databases to extract data efficiently.

Accepted Answer

Architectural logic: Glue runs in AWS; connector/agent in on-prem network proxies connection. Efficient extraction: (1) Pushdown—filter at source. (2) Partition-based extract (date). (3) Incremental via bookmarks/watermark. (4) Parallel reads by splitting on numeric key. Why: Full scans over WAN are slow and expensive....

Question 10

Explain how using a staging area in S3 can help.

Accepted Answer

Architectural logic: Staging = buffer between producers and consumers. Benefits: Decouple ingestion from processing; absorb bursts; validate before load; replay without re-fetch. Flow: API → s3://staging/raw/ → Glue → s3://curated/ → Athena/Redshift. Why: Producers write async; consumers process batch. Cost: Staging storage; lifecycle to archive/delete....

Medium Cloud & Tools Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 27 Questions

More Interview Prep Guides

Unlock All Expert Answers