Login

I’ve been trying to use a large language model to help categorize my lab’s unstructured experimental notes, but I’m hitting a wall with its consistency. It will correctly tag a procedure one time, then completely misclassify a nearly identical entry the next, even with very clear prompting. Has anyone else run into this problem where the model’s output seems arbitrarily variable on structured scientific text?

I’ve been there. One entry gets tagged correctly, the next nearly identical one goes off the rails. It feels like the model latches onto a cue in the prompt and ignores a little context that would keep the label stable. I tried pinning a tiny fixed exemplar set and enforcing a single label per entry, which helped a touch but didn’t fix the misclassifications over time.

We tried lowering the temperature to 0 and adding a deterministic seed so outputs aren’t wandering. Then we added a post processing checker: if the tag doesn’t match our known categories, it gets sent to human review or mapped to the closest safe label. That cut down the obviously wrong tags, but edge cases still sneak through.

Do you think the inconsistency is the real bottleneck here, or is it more about alarmingly variable input notes and terminology that the model just can’t settle on?

One time I tried cleaning up the notes a bit—standardizing abbreviations, units, and adding a tiny header with the main variables. After that, the model seemed steadier for a short stretch, then it started wandering again. Maybe the prompt window is stealing too much attention from older notes, or the taxonomy we use keeps shifting in practice.

Login
Username:
Password:	Lost Password?
	Remember me