Years ago, at a financial services company…
It started with some mild grumbling. Some people were complaining about getting signed out on our customer portal. But it was probably just user error, right? The reports seemed to happen more and more though. Then, employees who worked on the product started to get signed out.
At this point we have to admit that there is a problem. Every so often someone is using the site and they are suddenly booted out.
It happens seemingly at random. If you have a bug that you cannot reliably reproduce, your goal is to gather enough information until you can. “Seemingly at random” is a euphemism for “caused by some piece that we aren’t considering.”
Being able to reproduce a bug is important. First, you want to be confident that you have fixed the issue. If you can’t be sure you’ll trigger the bug, you can’t be sure your fix worked. Second, the factors that go into triggering the bug give you insight into where the problem is. Working toward a consistent reproduction is helping you triangulate the problem. So we started gathering information in hopes that we could reproduce it.
One thing you can do is ask yourself “what changed?” One of the things that changed was that we were rolling out a new version of our in-house PHP framework. This felt important, but also not, since the affected site was on the old framework. What’s more, both frameworks used the same way to delete sessions. Something like:
sprintf('DELETE FROM `sessions` WHERE `expires_at` < %d', now());
Which is a pretty reasonable way to expire sessions. The session creation code was similarly reasonable
sprintf('INSERT INTO `sessions` (`cookie`, …, `expires_at`) VALUES (:cookie, …, :expires )', $cookie, …, now() + SESSION_LENGTH);
Jason, one of our analytics folks, was able to add HTTP monitoring. We figured out what the problem looked like from the browser side (a request with a session id setting a different session id in less than n seconds) and started logging how often it happened.
What we learned from that the frequency was accelerating. The longer it took us to fix, the worse it was going to get.
We still cannot reliably reproduce this at this point, so we have to keep gathering information. I was tempted to deploy changes out to see if they worked. This is dangerous. If you don’t understand the problem, you don’t know if a fix is going to inadvertently make things worse. Remember: if it seems random, you don’t understand it.
Adam, one of the devs on the team, added logging to our session delete code. Before it ran the DELETE statement, it did something like
debug_log(query(sprintf('SELECT cookie, expires_at FROM `sessions` WHERE `expires_at` < %d', now())));
so we could see exactly which sessions were being deleted. That’s when things started to get weird.
We would compare that to our HTTP monitoring and find the session and see that it definitely should not have expired based on the timing. But the session logging says that it definitely should be deleted based on the time it was logged.
At this point we reach the “Eureka” moment. For context, “Eureka” is what old Archimedes shouted as he jumped out of the bathtub. While the bathtub is clearly useful in realizing that water displacement can measure volume, I don’t think enough credit is given to the bathtub as a place of thought in that story. I consider showering (along with dog walking, driving, and sleeping) to be an important part of debugging.
And so I was, when I was trying to figure out how to reconcile the two facts:
- Sessions were expiring before they should be
- Sessions were expiring on time
I realized the disconnect, hopped out of the shower and sent an email to our Unix admins asking them to check the clocks on our production servers. Sure enough, one of the load-balanced servers wasn’t running ntp and it’s clock had drifted out of sync.
What was happening was if the session code was called on the rogue server then
now() was wrong. That meant that sessions created on it would live extra long, but sessions deleted on it would die young.
Why was it accelerating? It turns out that in V1 of our framework the session-delete code was mistakenly not hooked up, but that was fixed in V2. The way PHP works is that every n requests it calls the cleanup function. We were moving more and more traffic to V2, which meant that the cleanup function was being called more often.
I like debugging. Well, I like having debugged. It can be maddening in the middle of it but there is a lot of satisfaction in solving a puzzle. A bug is a puzzle with stakes. If you are spending your time on a bug, that means its important. There’s also satisfaction in finally arriving at a simple, testable, theory of a bug.
I like this bug because it required us to understand so many pieces that make up the system: HTTP requests, session persistence, load balancing, and server clocks. It’s also nice when the fix is to adjust the clock instead of recode your entire app.
Image credit: Public Domain image NH 96566-KN The First “Computer Bug”