Login

I’ve been trying to use a large language model to help parse and categorize decades of old, unstructured lab notes in my field, but I’m hitting a wall with its tendency to confidently invent plausible-sounding but completely incorrect experimental parameters. This is making the whole process of building a reliable dataset from our historical records feel untrustworthy. Has anyone else dealt with this specific issue in their own archival work?

I did something like this a while back. We fed decades of notes into an LLM to tag topics and pull out parameters, but the model kept spitting out plausible numbers that never existed in the records. It would claim a run used a temperature of 37 C and a reagent mix that was never documented. The notes are messy and acronyms vary, so the errors showed up when the text was ambiguous.

One trick we tried was to shut down auto generation of numbers and have a human verify any parameter before it got saved. The model still suggested topics but we only captured what a person confirmed. It cut the fake parameters but kept the confusion around what the notes actually said.

I wonder if the real barrier is the dataset not the model. If the notes are sparse or handwriting clashes with the OCR, the model fills in blanks with things that feel right. We tested on a small subset and compared to the lab log entries we know are true and the mismatch was big in some cases.

Do you think the real problem is not the model but the missing context in the notes and the handwriting OCR quality?

Login
Username:
Password:	Lost Password?
	Remember me