Hudi CREATE TABLE LIKE Bug: COW/MOR Table Failures

by ADMIN 51 views

Hey folks! Have you ever run into a snag when using Hudi with Spark SQL, especially when trying to use CREATE TABLE ... LIKE ...? It's a real head-scratcher, but don't worry, we're going to dive deep into what's happening and how it affects both COW (Copy-On-Write) and MOR (Merge-On-Read) tables. We'll explore the root cause, the steps to reproduce the issue, and what to expect. This is super important because CREATE TABLE LIKE is a handy tool, and when it fails, it can mess up your workflows. So, let's get into it.

The Problem: CREATE TABLE LIKE Fails

The core issue revolves around how Hudi handles the CREATE TABLE ... LIKE ... operation, specifically when dealing with Glue and Spark SQL. When you try to create a new Hudi table that's a copy of an existing one using CREATE TABLE ... LIKE ..., things can go sideways, and you'll run into an error message that says, "Creating ro/rt table need the existence of the base table." This error pops up for both COW and MOR tables, which is not what we want. Hudi's CREATE TABLE LIKE implementation has a check related to the hoodie.query.as.ro.table configuration property. Let's break down how this configuration impacts the CREATE TABLE LIKE command.

When hoodie.query.as.ro.table is not set (is empty), Hudi happily lets you create a new table using CREATE TABLE LIKE. But, if hoodie.query.as.ro.table is set (is not empty), Hudi assumes you're trying to create a read-optimized or real-time (RO/RT) table, and it throws an error because it expects the base table to already exist. The problem is that this hoodie.query.as.ro.table property is also populated for COW tables in Glue (set to false) and for MOR tables (set to true or false). This check only looks to see if the property is set, regardless of the table type or if it's relevant to read-optimized behavior. Because of this, it can prevent valid CREATE TABLE LIKE operations for both COW and MOR tables. This means that if the property is set – even if the table isn't meant to be read-optimized – the operation will fail.

Let's get into the code snippet from the bug report:

val queryAsProp = hoodieCatalogTable.catalogProperties.get(ConfigUtils.IS_QUERY_AS_RO_TABLE)
    if (queryAsProp.isEmpty) {
      // init hoodie table for a normal table (not a ro/rt table)
      hoodieCatalogTable.initHoodieTable()
    } else {
      if (!hoodieCatalogTable.hoodieTableExists) {
        throw new HoodieAnalysisException("Creating ro/rt table need the existence of the base table.")
      }
      if (HoodieTableType.MERGE_ON_READ != hoodieCatalogTable.tableType) {
        throw new HoodieAnalysisException("Creating ro/rt table should only apply to a mor table.")
      }
    }

As you can see, the code checks if the hoodie.query.as.ro.table property is empty. If it's empty, it initializes the Hudi table normally. If the property isn't empty, it proceeds to check if the table exists and if it's a MOR table. If either of these conditions isn't met, it throws the error. This is where the problem lies, because the property can be set for COW tables in Glue. This means a valid CREATE TABLE LIKE can fail even when it should succeed.

How Glue and Hudi Interact

When you create a new Hudi table and insert data into it, Glue automatically stores the hoodie.query.as.ro.table property in its metadata. This happens no matter whether your table is COW or MOR. Since this property gets set, any subsequent CREATE TABLE ... LIKE ... operation will likely fail, triggering that frustrating error message we talked about. This is especially problematic because CREATE TABLE LIKE is a pretty common operation. Imagine you want to create a copy of a table for testing or to experiment with changes. You're going to hit a wall because of this bug.

Steps to Reproduce the Bug

To make things super clear, here's how you can reproduce the bug step-by-step. This is based on the information provided in the bug report:

  1. Create a Hudi Table: First, you need to create a Hudi table. This example uses a COW table, but the issue also affects MOR tables. Use Spark SQL in the Glue environment to run the following:
CREATE TABLE hudi_demo_table (
  id STRING,
  name STRING,
  ts BIGINT,
  dt STRING
)
USING hudi
OPTIONS (
  type = 'cow',
  primaryKey = 'id',
  preCombineField = 'ts'
)
PARTITIONED BY (dt)
LOCATION 's3://<PATH>/hudi_demo_table';
Make sure you replace `<PATH>` with the actual S3 path where you want to store your Hudi table.
  1. Insert Some Data: Next, you need to populate your newly created table with some data. Run an INSERT statement:
INSERT INTO hudi_demo_table VALUES
('1', 'Bob', 1739660000000, '2025-10-16'),
('2', 'Alice', 1739660100000, '2025-10-16');
  1. Attempt CREATE TABLE LIKE: Now, try to create a table like the one you just created using the CREATE TABLE ... LIKE ... command:
CREATE TABLE hudi_demo_table_like2 LIKE hudi_demo_table USING HUDI;
This is where you'll likely encounter the error: "Creating ro/rt table need the existence of the base table."  The bug will prevent you from successfully creating the new table, even though you'd expect it to work seamlessly.

Expected Behavior vs. Actual Behavior

What we expect is for the CREATE TABLE LIKE operation to work perfectly for both COW and MOR tables. We should be able to create a new Hudi table that's an exact replica of an existing one without any issues. The new table should inherit the same schema, table properties, and configurations. The reality, however, is that the bug in Hudi currently prevents this. Instead of a successful table creation, we get that confusing error message, which breaks our workflows and forces us to find workarounds.

Environment Details

The bug report specifies a few key environment details, which are helpful for understanding the context:

  • Hudi Version: The issue was reported on Hudi version 1.0.2.
  • Query Engine: The query engine used is Spark.

Impact and Importance

So, why should you care about this bug? The CREATE TABLE LIKE operation is super useful for various tasks, including:

  • Creating Backups: You can easily create a backup copy of a table before making any major changes or upgrades.
  • Testing and Experimentation: It's great for creating a test environment where you can try out new features or modifications without affecting your production data.
  • Data Analysis: You might want to create a copy of a table to run different analytical queries or to experiment with different partitioning strategies.
  • Schema Evolution: Making copies of tables with different schema versions for data versioning and compatibility.

When CREATE TABLE LIKE doesn't work, it adds friction to all of these processes. It means you might have to resort to manual table creation, which is more time-consuming and error-prone. This bug makes it harder to manage and manipulate your Hudi tables, ultimately slowing down your data engineering and analysis tasks.

The Root Cause Revisited

At the core of the problem is a check within Hudi's code that looks at the hoodie.query.as.ro.table property. This property is intended to indicate read-optimized behavior for MOR tables. If the property is non-empty, the code assumes you're trying to create a read-optimized table and expects the base table to already exist. However, the property gets set in Glue for all Hudi table types (COW and MOR). Because the code doesn't differentiate between table types or check the context of the operation, CREATE TABLE LIKE fails for all table types whenever the property is set. This means a valid and perfectly reasonable operation will fail because of this simple check.

Workarounds and Mitigations

While there isn't a perfect fix without patching Hudi, here are a few workarounds you can try:

  1. Manual Table Creation: The most straightforward workaround is to create the new table manually, defining the schema and properties yourself. This can be time-consuming, but it gets the job done.

  2. Using CREATE TABLE AS SELECT: Another option is to use CREATE TABLE AS SELECT. This command allows you to create a new table based on the results of a SELECT query. You can select all columns and data from your original table and create a new Hudi table with the same data and schema. This approach can be a bit more complex, depending on your table's structure, but it can be a reliable alternative.

  3. Patching Hudi (If Possible): If you're able to modify your Hudi installation, you could potentially patch the code to remove or modify the problematic check. This would involve changing the logic that checks the hoodie.query.as.ro.table property to ensure it only applies to MOR tables or to account for the context of the CREATE TABLE LIKE operation.

  4. Contact Hudi community You can contact the Hudi community or report the issue on their platform. Then, wait for the fix.

Conclusion

In a nutshell, this is a tricky bug that affects CREATE TABLE LIKE operations for both COW and MOR Hudi tables. It's caused by a code check that's too restrictive, which makes a perfectly valid operation fail. Hopefully, this helps you understand the problem, why it happens, and what you can do about it. Keep this in mind when you're working with Hudi and Spark SQL, and hopefully, a fix will be available soon. Keep on coding!