How can i handle late-arriving data in a data pipeline?
#1
I’ve been trying to build a reliable data pipeline for our reporting, but I keep hitting a wall with late-arriving data messing up our daily aggregates. My current process just overwrites the old values, which I know is wrong, but I’m unsure how to properly handle updates to fact records that come in after the initial load. What’s the simplest way you’ve managed this without rebuilding everything from scratch each time?
Reply
#2
we stopped overwriting and started using an upsert flow. load into staging, dedupe by id, then merge into the fact table on the key. late rows update the existing facts. we keep a last_updated timestamp and a simple daily snapshot so we can re run the aggregates if needed. not glamorous but it saved us from rebuilding everything.
Reply
#3
we added a delta approach. there is a daily summary and a delta table that captures late changes for that day. late arrivals apply to the delta and then we merge into the daily totals. if the late data is large, the delta job finishes in a reasonable time and avoids a full rebuild.
Reply
#4
i tried the upsert path but hit race conditions when late records showed up back to back. counts drifted while the merge happened. we added a small transactional boundary and a retry; still not perfect but it reduces the damage.
Reply
#5
do you have a reliable watermark or event time to tell late data from fresh data? i found that without a clear order the problem keeps chasing you.
Reply


[-]
Quick Reply
Message
Type your reply to this message here.

Image Verification
Please enter the text contained within the image into the text box below it. This process is used to prevent automated spam bots.
Image Verification
(case insensitive)

Forum Jump: