SLA-Aware Data Pipelines: Backfills & Late Data Handling
Summary
Let's talk about implementing top-notch support for those pesky late/straggler events and historical backfills. We're going to be super explicit about data SLAs across both our batch and streaming paths. Think event-time watermarks in Spark Structured Streaming, partition-aware Airflow backfill DAGs, and even automatic gap detection. And the best part? Self-healing reruns! This means our downstream stores and dashboards stay accurate without anyone having to manually babysit them. How cool is that?
Motivation
Real-world data pipelines? They're messy. We're dealing with clock skew, upstream retries, and historical reprocessing. Right now, late data can get silently dropped or, even worse, double-counted. And backfills? They're usually just ad-hoc solutions. We need a better way, guys. Specifically, we need:
- Deterministic handling of late data: Think exactly-once semantics per partition/window.
- Automated gap detection with auto backfill: Spotting those missing partitions or hours automatically.
- Observable SLAs: Like, "T-15 min freshness for Kafka→Postgres anomalies table."
Goals
Here's what we're aiming for:
- Streaming: Event-time processing with watermarks and stateful dedupe. We'll route those super-late events to a DLQ in MinIO, complete with replay tooling.
- Batch: A parameterized backfill DAG that handles date/hour partition ranges with idempotent writes and partition overwrite/merge.
- SLA layer: Declarative SLAs for each dataset. Airflow sensors will check freshness & completeness, and auto-trigger backfills if things go south.
- Observability: We're talking metrics (Prometheus) for lateness histograms, watermark lag, and SLA breach counts. Plus, Grafana dashboards to visualize it all.
- Idempotency: Enforcing per-partition job locks and upsert/merge semantics in Postgres. It is very important.
Diving Deeper into the Goals
Let's break down these goals a bit further to really understand what we're trying to achieve. First, the streaming goal is about making our streaming data pipelines more robust and accurate. By using event-time processing with watermarks, we can ensure that late-arriving data is handled correctly, preventing data loss or corruption. The stateful deduplication ensures that we don't double-count any events, which is crucial for maintaining data integrity. The dead-letter queue (DLQ) provides a mechanism to handle events that are too late to be processed in real-time, allowing us to replay them later.
Next, the batch goal focuses on making our batch data processing more efficient and reliable. A parameterized backfill DAG allows us to easily reprocess historical data, ensuring that our data stores are always up-to-date. Idempotent writes are essential for preventing data duplication or inconsistency during backfills. Partition overwrite/merge ensures that we can efficiently update our data stores with the backfilled data.
The SLA layer goal is about ensuring that our data pipelines meet our service level agreements (SLAs). By defining declarative SLAs for each dataset, we can monitor the freshness and completeness of our data and automatically trigger backfills if necessary. This helps us maintain data quality and ensure that our downstream consumers have access to accurate and timely data.
Observability is key to understanding the performance and health of our data pipelines. By exposing metrics for lateness, watermark lag, and SLA breaches, we can quickly identify and resolve issues. Grafana dashboards provide a visual representation of these metrics, making it easy to monitor the overall health of our data pipelines.
Finally, idempotency is crucial for ensuring data consistency and reliability. By enforcing per-partition job locks and upsert/merge semantics in Postgres, we can prevent data duplication or corruption during concurrent writes. This is especially important for backfills and other data processing tasks that may be executed multiple times.
Non-Goals
Here's what we're not trying to do right now:
- Immediately replacing our storage format (Hudi/Iceberg/Delta). It would be nice to have later.
- Global CDC ingestion. That's a separate epic.
Delving into the Non-Goals
It's important to understand what we're not trying to achieve with this project. While we acknowledge the benefits of using more advanced storage formats like Hudi, Iceberg, or Delta, we're not planning to migrate to these formats immediately. This is because such a migration would be a significant undertaking and would distract from our primary goals of improving late data handling and backfill capabilities. However, we do recognize the potential benefits of these storage formats and may consider migrating to them in the future.
Similarly, we're not planning to implement global change data capture (CDC) ingestion as part of this project. While CDC is a valuable technique for capturing changes to data in real-time, it's a complex topic that deserves its own dedicated effort. We may consider implementing CDC in the future, but it's not a priority for this project.
Design Overview
1) Streaming path (Spark Structured Streaming)
- Enable event-time with `withWatermark(