What is the difference between Managed and External tables in…

Frequency

Low

Asked at 3 companies

Why This Question Matters

This easy-level Spark/Big Data question appears frequently in data engineering interviews at companies like Citi, Dunnhumby, Fragma Data Systems. While less common, it tests deeper understanding that distinguishes strong candidates. Mastering the underlying concepts (spark) will help you answer variations of this question confidently.

How to Approach This

Start by clearly defining the core concept being asked about. Interviewers want to see that you understand the fundamentals before diving into implementation details. Structure your answer with a definition, then explain the practical application with a concise example. The expert answer includes a code example that demonstrates the implementation pattern.

Expert Answer

466 wordsIncludes code

Managed and External tables in Hive/Spark differ fundamentally in their ownership and lifecycle management of the underlying data.

Core Difference

Managed tables (also known as Internal tables) mean Spark/Hive fully owns both the table's metadata (schema, partitions) and its data files. When you DROP TABLE, both the table definition in the Metastore and the data files on storage (e.g., HDFS, S3) are deleted. The data typically resides in the warehouse directory configured for Spark/Hive.

External tables, conversely, mean Spark/Hive only manages the table's metadata. The data files reside in a specified LOCATION that Spark/Hive does not own or manage. When you DROP TABLE, only the metadata is removed from the Metastore; the underlying data files remain untouched.

Mechanics and Why

Why use External tables? Data Sharing: Essential for data lake architectures where multiple compute engines (Spark, Presto/Trino, AWS Athena, Snowflake external tables) need to access the same physical data files* without tight coupling. * Data Governance & Resilience: Protects critical production datasets from accidental deletion. The data's lifecycle is independent of the table definition, allowing for separate data retention policies and preventing catastrophic data loss if a table is mistakenly dropped. * Schema-on-Read: Allows defining multiple table schemas over the same raw data, useful for different analytical views or evolving data structures.

Why use Managed tables?
* Simplicity: Ideal for ephemeral data, temporary outputs, or staging tables where the data's lifecycle is directly tied to the table's existence. Spark/Hive handles all data path management, reducing the burden on the user and preventing "orphan data" (data without a table definition).
* Full Lifecycle Management: Guarantees that when a table is dropped, all associated data is also cleaned up, which can be beneficial for cost management and compliance in non-production environments.

Trade-offs and Best Practices

While DROP TABLE on an external table is a cheap metadata operation, a DROP TABLE on a managed table can trigger expensive recursive delete operations on object storage (like S3 or ADLS) for large datasets, incurring significant API call costs and latency.

Best practice: Use External tables for production datasets and any data shared across tools. Use Managed tables for temporary, intermediate, or staging data where the data's lifespan is explicitly tied to the table. Always document the ownership and lifecycle of the underlying data paths for external tables.

-- Example: Creating an External table
CREATE EXTERNAL TABLE my_external_table (
    id INT,
    name STRING
)
STORED AS PARQUET
LOCATION 's3://my-data-lake/raw/my_dataset/';

In the interview, also mention that the Hive Metastore (or AWS Glue Data Catalog) serves as the central metadata repository for both types. Modern formats like Delta Lake tables are typically accessed as external tables in Spark, pointing to the Delta transaction log and data files, even though Delta Lake itself provides ACID properties over the data.

⚡

Pro Tip

Red Flag: Only defining the difference without discussing data ownership or drop behavior. Pro-Move: 'All prod tables are External; we had an incident where Managed DROP cascaded to S3—now we never use Managed for shared data'—demonstrates operational lessons.

Related Study Guides

⚡

Dunnhumby Data Engineer Interview Questions & Answers (2026)

Practice the 48 most asked data engineering questions at Dunnhumby. Covers Spark/Big Data, Python/Coding, General/Other and more.

9 min read →

⚡

Citi Data Engineer Interview Questions & Answers (2026)

Practice the 39 most asked data engineering questions at Citi. Covers Spark/Big Data, SQL, General/Other and more.

8 min read →

According to DataEngPrep.tech, this is one of the most frequently asked Spark/Big Data interview questions, reported at 3 companies. DataEngPrep.tech maintains an editor-reviewed database of 1,863 data engineering interview questions across 7 categories.

What is the difference between Managed and External tables in Hive/Spark?

Why This Question Matters

How to Approach This

Core Difference

Mechanics and Why

Trade-offs and Best Practices

Dunnhumby Data Engineer Interview Questions & Answers (2026)

Citi Data Engineer Interview Questions & Answers (2026)

Related Spark/Big Data Questions

Level up your prep

What is the difference between Managed and External tables in Hive/Spark?

Why This Question Matters

How to Approach This

Core Difference

Mechanics and Why

Trade-offs and Best Practices

Dunnhumby Data Engineer Interview Questions & Answers (2026)

Citi Data Engineer Interview Questions & Answers (2026)

Related Spark/Big Data Questions

Level up your prep