How do i track column-level provenance when data is changed by external scripts?
#1
I’ve been trying to build a reliable data lineage map for our ETL pipelines, but I keep hitting a wall when transformations happen outside our core tools. How do you all handle tracking column-level provenance when a script in a completely separate system alters the data?
Reply
#2
Sounds familiar. We added a lightweight beacon in the off band script: every time it touches a column, it emits a lineage event with the source column id, the target column id, the transform name, and a timestamp to a central store. It helped surface some gaps, but it also created noise when scripts failed or skipped runs.
Reply
#3
We also built a mapping table in the data warehouse: source_column_id, transform_id, target_column_id, run_id, and a snapshot timestamp. The external system pushes an entry whenever it completes a local transform. It’s not perfect, but it lets us stitch together a chain when core tooling misses things.
Reply
#4
Sometimes I feel like we’re chasing a ghost. scripts outside our control can rename columns, move data, or wipe logs. We started using checksums of the input and output values to flag unexpected changes, plus catalog tags, but it’s brittle and slow to reconcile across big pipelines.
Reply
#5
Do you have a sense that the real blocker is the lack of a canonical column level contract across systems?
Reply


[-]
Quick Reply
Message
Type your reply to this message here.

Image Verification
Please enter the text contained within the image into the text box below it. This process is used to prevent automated spam bots.
Image Verification
(case insensitive)

Forum Jump: