Endaq.ide.get_doc() Bug: Out-of-Bounds Data Returns

by SLV Team 52 views
`endaq.ide.get_doc()` Bug: Returns Data Outside Requested Start/End

Hey guys! Today, we're diving into a tricky bug in the endaq-python library, specifically with the endaq.ide.get_doc() function. This issue can cause some headaches when you're trying to extract data from IDE files within specific timeframes. Let's break down the problem, how to reproduce it, and what the expected behavior should be.

Summary of the Issue

The core of the bug is this: when you call endaq.ide.get_doc() with timezone-aware datetime values for the start and end parameters, and these values fall completely outside the actual recording interval of the IDE file, the function still returns a document with data. Not only that, but the channels in this document are not empty, containing data points that lie outside the timeframe you specified. This is definitely not what we want!

Expected Behavior

Ideally, if you request data from a period that doesn't exist in the IDE file, the function should return an object where each channel contains zero samples. This would clearly indicate that no data was found within the requested window. Think of it like asking for a slice of cake that was never baked – you'd expect an empty plate, not a random piece of bread!

Steps to Reproduce the Bug

Okay, let's get our hands dirty and see how to make this bug appear. Here’s a step-by-step guide, complete with code snippets, so you can follow along.

  1. Parse the Filename for Datetime Values:

    First, we need to extract the start and end times from the IDE filename. We'll use the datetime and zoneinfo modules for this. The key here is to make sure we're working with timezone-aware datetimes. This is crucial for the bug to manifest. Let's see how to do it:

    from datetime import datetime
    from zoneinfo import ZoneInfo
    import endaq
    
    filename = "20251018_RC63_LH_SAM4_BATCH_1_202510180930_202510181830"
    tz = ZoneInfo("Australia/Perth")  # AWST (UTC+08)
    
    start_time = datetime.strptime(filename.split("_")[-2], "%Y%m%d%H%M").replace(tzinfo=tz)
    end_time   = datetime.strptime(filename.split("_")[-1], "%Y%m%d%H%M").replace(tzinfo=tz)
    
    print(start_time)  # 2025-10-18 09:30:00+08:00
    print(end_time)    # 2025-10-18 18:30:00+08:00
    

    In this code, we're taking a filename, extracting the date and time information, and then applying the Australia/Perth timezone (AWST, which is UTC+08). This ensures our datetimes are timezone-aware.

  2. Request a Trimmed Document with Out-of-Bounds Times:

    Now comes the crucial part. We'll call endaq.ide.get_doc() with start and end times that do not overlap the actual data in the IDE file. In the example below, the file contains data from October 19th UTC, but we're requesting data from October 18th AWST. This mismatch is what triggers the bug.

    doc = endaq.ide.get_doc(
        r"C:\...\20251018_RC63_LH_SAM4_BATCH_2_202510180930_202510181830.IDE",
        start=start_time,
        end=end_time,
    )
    

    Make sure to replace the file path with the actual path to your IDE file.

  3. Inspect the Channel Table:

    Next, we'll use endaq.ide.get_channel_table() to inspect the contents of the returned document. This will reveal the unexpected non-empty channels.

    endaq.ide.get_channel_table(doc)
    

    When you run this, you'll likely see a table with channels that have data, even though we requested a timeframe with no data. You might also see a RuntimeWarning related to division by zero. This warning is a side effect of the bug, as it arises from channels with zero duration but non-zero samples. Here’s an example of the output you might see:

    ...\endaq\ide\info.py:201: RuntimeWarning: divide by zero encountered in scalar divide
      rate = samples / (duration / 10 ** 6)
    
    | channel | name                | type          | units   | start        | end          | duration     | samples | rate       |
    |---------|---------------------|--------------|---------|-------------|-------------|-------------|---------|-----------|
    | 8.0     | X (2000g)          | Acceleration | g       | 00:00.0008  | 00:00.0552  | 00:00.0543  | 2720    | 5000.78 Hz|
    | 8.1     | Y (2000g)          | Acceleration | g       | 00:00.0008  | 00:00.0552  | 00:00.0543  | 2720    | 5000.78 Hz|
    | 8.2     | Z (2000g)          | Acceleration | g       | 00:00.0008  | 00:00.0552  | 00:00.0543  | 2720    | 5000.78 Hz|
    | 80.0    | X (40g)            | Acceleration | g       | 00:00.0015  | 00:00.0506  | 00:00.0491  | 248     | 504.87 Hz |
    | 80.1    | Y (40g)            | Acceleration | g       | 00:00.0015  | 00:00.0506  | 00:00.0491  | 248     | 504.87 Hz |
    | 80.2    | Z (40g)            | Acceleration | g       | 00:00.0015  | 00:00.0506  | 00:00.0491  | 248     | 504.87 Hz |
    | 20.0    | Internal Pressure  | Pressure     | Pa      | 00:00.0039  | 00:06.0112  | 00:06.0073  | 62      | 10.21 Hz  |
    | 20.1    | Internal Temperature| Temperature | °C      | 00:00.0039  | 00:06.0112  | 00:06.0073  | 62      | 10.21 Hz  |
    | 65.0    | X                  | Quaternion   | q       | 00:00.0111  | 00:01.0122  | 00:01.0010  | 102     | 100.94 Hz |
    | 65.1    | Y                  | Quaternion   | q       | 00:00.0111  | 00:01.0122  | 00:01.0010  | 102     | 100.94 Hz |
    | 65.2    | Z                  | Quaternion   | q       | 00:00.0111  | 00:01.0122  | 00:01.0010  | 102     | 100.94 Hz |
    | 65.3    | W                  | Quaternion   | q       | 00:00.0111  | 00:01.0122  | 00:01.0010  | 102     | 100.94 Hz |
    | 65.4    | Acc                | Quaternion   | q       | 00:00.0111  | 00:01.0122  | 00:01.0010  | 102     | 100.94 Hz |
    | 88.0    | Latitude           | Location     | Degrees | 01:33:55.0979| 01:34:27.0322| 00:31.0342 | 32      | 1.02 Hz   |
    | 88.1    | Longitude          | Location     | Degrees | 01:33:55.0979| 01:34:27.0322| 00:31.0342 | 32      | 1.02 Hz   |
    | 88.2    | Time               | Unix Epoch   | s       | 01:33:55.0979| 01:34:27.0322| 00:31.0342 | 32      | 1.02 Hz   |
    | 88.3    | Ground Speed       | GNSS Speed   | m/s     | 01:33:55.0979| 01:34:27.0322| 00:31.0342 | 32      | 1.02 Hz   |
    | 102.0   | GNSS Time:00       | Unix Epoch Reference| s| 01:33:59.0322| 01:33:59.0322| 00:00.0000 | 1       | inf Hz    |
    
  4. Retrieve Primary Sensor Data and Check Timestamps:

    Let's go a step further and retrieve the primary sensor data using endaq.ide.get_primary_sensor_data(). This will allow us to examine the actual timestamps of the data.

    df = endaq.ide.get_primary_sensor_data(doc=doc)
    print(df.head())
    print(len(df))
    

    You'll notice that the timestamps in the DataFrame are from October 19th UTC, which is outside our requested window of October 18th AWST. This confirms that the function is returning data from the wrong timeframe.

                                        X (2000g)  Y (2000g)  Z (2000g)
    timestamp
    2025-10-19 09:57:15.008941+00:00      3.853685  -1.895579  -1.323955
    2025-10-19 09:57:15.009141042+00:00   3.853685  -1.998095  -0.949977
    ...
    2720
    
  5. Manually Filter the Data (for Comparison):

    To illustrate the correct behavior, let's manually filter the DataFrame using the same start_time and end_time. This will show us what we should have gotten: an empty DataFrame.

    df_trimmed = df.loc[(df.index > start_time) & (df.index < end_time)]
    print(df_trimmed.head())
    print(len(df_trimmed))
    

    As expected, this yields an empty DataFrame:

    Empty DataFrame
    Columns: [X (2000g), Y (2000g), Z (2000g)]
    Index: []
    0
    

Actual Behavior (The Bug in Action)

So, to recap, here’s what's actually happening:

  • endaq.ide.get_doc() returns a document containing non-empty channels and data outside the requested start/end window when the window doesn't overlap the recording interval.
  • get_primary_sensor_data(doc) returns data with timestamps outside the requested range.
  • A RuntimeWarning: divide by zero encountered in scalar divide appears, likely due to a channel with zero duration and one sample resulting in infinite Hz. This is more of a side effect than the main issue, but it’s still worth noting.

Why This Bug Matters

This bug can lead to some serious problems if you're not careful. Imagine you're analyzing data from a specific event, and you rely on endaq.ide.get_doc() to filter the data. If the function returns data from the wrong timeframe, your analysis will be completely off! This can lead to incorrect conclusions and potentially flawed decision-making.

Conclusion and Next Steps

Alright, guys, we've thoroughly explored this bug in endaq.ide.get_doc(). We've seen how it can return data from outside the requested timeframe, leading to potential analysis errors. The key takeaway is that using timezone-aware datetimes with non-overlapping intervals triggers the issue.

Hopefully, this detailed explanation helps you understand the bug and avoid it in your own projects. Stay tuned for updates, and let's hope the endaq-python team squashes this bug soon! If you encounter this issue, be sure to report it on the endaq-python GitHub repository to help the developers track and fix it.

Happy data analyzing!