Q: Write code for character frequency in a text file.

**Counter:** Counter(f.read()) for full file. For large: read chunks, Counter.update(chunk). Lower: f.read().lower(). Alpha only: {c:n for c,n in freq.items() if c.isalpha()}. from collections import Counter with open('file.txt') as f: freq = Counter(f.read()) for c, n in freq.most_common(): print(f'{repr(c)}: {n}')

Q: Write code for palindrome generation.

**String:** half + half[::-1] (even); half + char + half[::-1] (odd). **Numbers:** n + reverse(n) or n + reverse(n//10). Generate all k-digit: iterate half, mirror. def gen_palindromes(digits): for i in range(10**(digits//2)): s = str(i).zfill(digits//2) yield int(s + s[::-1]) if digits%2: for d in '0123456789': yield int(s + d + s[::-1])

Question 1

What is the difference between cache() and persist() in Spark? When would you use each?

Accepted Answer

**cache()**: Equivalent to `persist(MEMORY_AND_DISK)`. Stores partitions in memory; spills to disk if memory is insufficient.

**persist(storage_level)**: Explicit control over storage: MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY.

**Architectural Logic (Why It Matters)**: Caching trades memory/disk for recomputation cost....

Question 2

Demonstrate the difference between DENSE_RANK() and RANK()

Accepted Answer

**RANK()**: Same rank for ties; skips subsequent ranks (e.g., 1, 2, 2, 4, 5). **DENSE_RANK()**: Same rank for ties; no gaps (e.g., 1, 2, 2, 3, 4). **Why it matters**: RANK preserves "position" semantics (e.g., 4th place); DENSE_RANK gives consecutive integers useful for filtering (e.g., TOP 10). **Example**: `SELECT name, salary, RANK() OVER (ORDER BY salary DESC) AS rk, DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rk FROM employee`....

Question 3

Detail examples of inner, outer, left, and right joins.

Accepted Answer

**Architectural Logic**: Join choice directly impacts query cost, data correctness, and downstream semantics. INNER keeps only matching rows—ideal when referential integrity is enforced and you want to exclude orphans; minimizes shuffle and output size. LEFT preserves all from A with optional B—use when B is a "dimension" that may not exist (e.g., optional user attributes); RIGHT is mirror of LEFT; FULL OUTER preserves both sides—expensive due to bilateral NULL expansion....

Question 4

How would you handle a deadline conflict between two high-priority projects?

Accepted Answer

Situation: Two projects, same deadline. Task: Resolve with shared ownership. Action: Gathered facts; convened stakeholders and manager. Presented options (extend, split team, reduce scope). Chose scope reduction with sign-off....

Question 5

Discuss versioning in S3.

Accepted Answer

Architectural logic: Versioning stores multiple object versions per key; PUT creates new version; DELETE adds delete marker. Use cases: Recovery, retention, audit. Trade-off: Storage cost; lifecycle rules for old versions....

Question 6

Modify a word count script to output results in descending frequency order.

Accepted Answer

**Why Descending Order Matters:** Top-N by frequency is the core of term frequency analysis, log analysis, and recommendation features (most-viewed items).

**Scalability Tiers:** (1) Single file < 1GB: Counter + most_common()—in-memory, O(n log k) for top-k. (2) Multi-file/large: MapReduce pattern—map emits (word,1), reduce sums, final sort. (3) Spark: reduceByKey, then sortBy(col('count').desc())....

Question 7

What are docstrings? Use examples.

Accepted Answer

**Why Docstrings:** Documentation as code—accessible via help(), __doc__, and auto-generated APIs (Sphinx). Critical for team scale and onboarding.

**Formats:** Google, NumPy, reST. Include: purpose, Args, Returns, Raises, Examples. Type hints complement docstrings.

**Production:** We enforce docstrings on public functions. CI runs pydocstyle. Sphinx generates docs for our internal ETL lib....

Question 8

Which data structure occupies more memory: list or tuple? Why?

Accepted Answer

**Tuple < List:** Tuple is immutable—no over-allocation for growth. Simpler storage. sys.getsizeof((1,2,3)) < sys.getsizeof([1,2,3]). Typical 10–20% smaller for same elements.

**Why:** List has growth buffer; tuple fixed. CPython optimizes small tuples (interning).

**Production:** Use tuple for fixed records (return values, dict keys), list for mutable collections. In pipelines: tuple for (id, value) pairs; list for accumulating results....

Question 9

Write code for character frequency in a text file.

Accepted Answer

**Counter:** Counter(f.read()) for full file. For large: read chunks, Counter.update(chunk). Lower: f.read().lower(). Alpha only: {c:n for c,n in freq.items() if c.isalpha()}.

from collections import Counter
with open('file.txt') as f:
    freq = Counter(f.read())
for c, n in freq.most_common():
    print(f'{repr(c)}: {n}')

Question 10

Write code for palindrome generation.

Accepted Answer

**String:** half + half[::-1] (even); half + char + half[::-1] (odd). **Numbers:** n + reverse(n) or n + reverse(n//10). Generate all k-digit: iterate half, mirror.

def gen_palindromes(digits):
    for i in range(10**(digits//2)):
        s = str(i).zfill(digits//2)
        yield int(s + s[::-1])
        if digits%2:
            for d in '0123456789': yield int(s + d + s[::-1])

Impetus Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 15 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading