What is a heisenbug and how do i debug it in production?
#1
I’ve been trying to debug a race condition in our API client, and I think the issue is a Heisenbug. It only fails in production under heavy load, but every time I try to trace it locally or add logging, the problem just disappears. Has anyone else had a bug that seems to actively avoid being observed?
Reply
#2
Yep, I’ve chased a Heisenbug too. It only shows up under load, and the moment you log or watch it, it disappears. We finally traced it to a race around lazy initialization in a shared client, where the timing between threads changes with CPU saturation.
Reply
#3
We added light-weight timing around the critical path and ran with tracing off by default in prod; the patterns were visible in production traces but the root cause remained elusive.
Reply
#4
Could it be that the problem isn’t the bug at all, but something in the proxy or load balancer under heavy load?
Reply
#5
Sometimes I drift into thinking maybe the logging framework itself is altering timing, especially with sync vs async flush paths.
Reply
#6
One quick move we tried was to remove a few noncritical async tasks to see if the timing window shrank; it didn’t give a clean signal, but it narrowed the field.
Reply


[-]
Quick Reply
Message
Type your reply to this message here.

Image Verification
Please enter the text contained within the image into the text box below it. This process is used to prevent automated spam bots.
Image Verification
(case insensitive)

Forum Jump: