Question 1

What are your strengths and weaknesses?

Accepted Answer

Strengths: (1) Technical depth in Spark/SQL/Cloud—I design for scale and cost. (2) Problem-solving and analytical thinking—I use data to drive decisions. (3) Collaboration and communication—I bridge technical and business stakeholders. Weaknesses: (1) I sometimes take on too much—I'm working on delegating and saying no. (2) I can be a perfectionist—I'm learning to ship MVP and iterate. (3) Public speaking—I'm attending workshops and doing more internal tech talks....

Question 2

What is your biggest failure, and what did you learn from it?

Accepted Answer

Situation: I led a project that failed to meet its goals due to poor requirement gathering and lack of stakeholder alignment. We delivered 'on spec' but it wasn't what the business needed. Task: Own the failure and fix the underlying process. Action: I took full responsibility and ran a post-mortem. I identified gaps: unclear success criteria, infrequent stakeholder check-ins, no early validation....

Question 3

Identify the Unix command that lists files with specific permissions

Accepted Answer

**Command:** find . -perm 644. Or -perm -u=r for user read. find . -type f -perm 755. stat for single file. **Format:** rwx rwx rwx (owner group other). **Why:** Permission audit. **Production:** Combine with -exec.

Question 4

Write pseudo code for an ETL pipeline using Python and Pandas

Accepted Answer

**E:** Read source (CSV, API, DB). **T:** dropna, cast types, filter, derive columns, validate. **L:** to_sql, to_parquet. Add: logging, error handling, idempotency (overwrite by date), incremental (filter by watermark)....

Question 5

Design a relational data model for a sales database, incorporating normalization techniques

Accepted Answer

**Section 1 — The Context (The 'Why')**
The primary challenge for this design in SQL is balancing scale, cost, and reliability. At scale, naive approaches fail: single points of failure cause cascades, schema evolution breaks consumers, and over-provisioning explodes cost. Failure modes include silent data loss from non-idempotent writes, cascading job failures from tight coupling, and operational burden from manual intervention....

Question 6

Given two tables, calculate the row count for different types of joins (inner, left, right, and full outer)

Accepted Answer

Given |A|=a, |B|=b, |matches|=m: **Inner** = m. **Left** = a. **Right** = b. **Full outer** = a + b - m. **SQL**: SELECT 'inner', COUNT(*) FROM a INNER JOIN b ON ... UNION ALL SELECT 'left', COUNT(*) FROM a LEFT JOIN b ON ......

Question 7

What motivates you to join Morgan Stanley?

Accepted Answer

**Situation**: Context of the challenge. **Task**: Your responsibility. **Action**: Specific steps, tools, collaboration. **Result**: Quantified outcome. Morgan Stanley's leadership in financial services and technology is compelling. The firm's scale—data across markets, risk, and clients—presents complex engineering challenges....

Question 8

Write a SQL query to calculate the highest salary in each department using a window function

Accepted Answer

SELECT department_id, employee_id, salary, MAX(salary) OVER (PARTITION BY department_id) AS max_salary FROM employees. Or with RANK: SELECT * FROM (SELECT *, RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) r FROM employees) WHERE r = 1. **Why window**: Avoid correlated subquery; single pass....

Question 9

Explain Spark's narrow vs. wide transformations and when to use each

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Narrow transformations process each partition independently—no shuffle (map, filter, mapPartitions). Wide transformations require shuffling data across partitions (groupBy, join, distinct, repartition). Use narrow when possible for speed. Use wide when aggregation or joins are required....

Question 10

Explain the configuration of a Spark cluster for optimal performance

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Spark cluster tuning: (1) Executor: 4–5 cores, 10–20GB RAM; (2) spark.executor.memory, spark.executor.cores; (3) Partitions: 2–4 times total cores; (4) spark.default.parallelism, spark.sql.shuffle.partitions; (5) Memory: spark.memory.fraction 0.6; (6) Dynamic allocation for variable load. Example: 10 nodes times 4 cores yields 200 shuffle partitions....

Morgan Stanley Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 12 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading