My Favorite Bug

Years ago, at a financial services company…

It started with some mild grumbling. Some people were complaining about getting signed out on our customer portal. But it was probably just user error, right? The reports seemed to happen more and more though. Then, employees who worked on the product started to get signed out.

At this point we have to admit that there is a problem. Every so often someone is using the site and they are suddenly booted out.

It happens seemingly at random. If you have a bug that you cannot reliably reproduce, your goal is to gather enough information until you can. “Seemingly at random” is a euphemism for “caused by some piece that we aren’t considering.”

Being able to reproduce a bug is important. First, you want to be confident that you have fixed the issue. If you can’t be sure you’ll trigger the bug, you can’t be sure your fix worked. Second, the factors that go into triggering the bug give you insight into where the problem is. Working toward a consistent reproduction is helping you triangulate the problem. So we started gathering information in hopes that we could reproduce it.

One thing you can do is ask yourself “what changed?” One of the things that changed was that we were rolling out a new version of our in-house PHP framework. This felt important, but also not, since the affected site was on the old framework. What’s more, both frameworks used the same way to delete sessions. Something like:

sprintf('DELETE FROM `sessions` WHERE `expires_at` < %d', now());

Which is a pretty reasonable way to expire sessions. The session creation code was similarly reasonable

sprintf('INSERT INTO `sessions` (`cookie`, …, `expires_at`)  VALUES (:cookie, …, :expires )', $cookie, …, now() + SESSION_LENGTH);

Jason, one of our analytics folks, was able to add HTTP monitoring. We figured out what the problem looked like from the browser side (a request with a session id setting a different session id in less than n seconds) and started logging how often it happened.

What we learned from that the frequency was accelerating. The longer it took us to fix, the worse it was going to get.

We still cannot reliably reproduce this at this point, so we have to keep gathering information. I was tempted to deploy changes out to see if they worked. This is dangerous. If you don’t understand the problem, you don’t know if a fix is going to inadvertently make things worse. Remember: if it seems random, you don’t understand it.

Adam, one of the devs on the team, added logging to our session delete code. Before it ran the DELETE statement, it did something like

debug_log(query(sprintf('SELECT cookie, expires_at FROM `sessions` WHERE `expires_at` < %d', now())));

so we could see exactly which sessions were being deleted. That’s when things started to get weird.

We would compare that to our HTTP monitoring and find the session and see that it definitely should not have expired based on the timing. But the session logging says that it definitely should be deleted based on the time it was logged.

At this point we reach the “Eureka” moment. For context, “Eureka” is what old Archimedes shouted as he jumped out of the bathtub. While the bathtub is clearly useful in realizing that water displacement can measure volume, I don’t think enough credit is given to the bathtub as a place of thought in that story. I consider showering (along with dog walking, driving, and sleeping) to be an important part of debugging.

And so I was, when I was trying to figure out how to reconcile the two facts:

  1. Sessions were expiring before they should be
  2. Sessions were expiring on time

I realized the disconnect, hopped out of the shower and sent an email to our Unix admins asking them to check the clocks on our production servers. Sure enough, one of the load-balanced servers wasn’t running ntp and it’s clock had drifted out of sync.

What was happening was if the session code was called on the rogue server then now() was wrong. That meant that sessions created on it would live extra long, but sessions deleted on it would die young.

Why was it accelerating? It turns out that in V1 of our framework the session-delete code was mistakenly not hooked up, but that was fixed in V2. The way PHP works is that every n requests it calls the cleanup function. We were moving more and more traffic to V2, which meant that the cleanup function was being called more often.

I like debugging. Well, I like having debugged. It can be maddening in the middle of it but there is a lot of satisfaction in solving a puzzle. A bug is a puzzle with stakes. If you are spending your time on a bug, that means its important. There’s also satisfaction in finally arriving at a simple, testable, theory of a bug.

I like this bug because it required us to understand so many pieces that make up the system: HTTP requests, session persistence, load balancing, and server clocks. It’s also nice when the fix is to adjust the clock instead of recode your entire app.

Image credit: Public Domain image NH 96566-KN The First “Computer Bug”

DRYer than thou

At work we have a project (launching Monday, w00t!) that has a lot of cool things happening on the browser side. You know kids these days, what with their jQueries and Web 2.0s, they need whiz bang special effects in their websites. In my day, we only had one type of input field in our forms, and we liked it! Don’t know how lucky they are… but I digress.

One problem we ran into was that we needed to do the same calculation on the server side and the client side. Let’s say for the sake of example that we need to calculate Michigan’s 6% sales tax or California’s 7.25% sales tax. The real calculation was only slightly more complex. One thing to note is that the calculation would need to be very responsive because it would be running a lot and affecting user input, so speed is a concern.

I came up with two options. The first is to write the same calculation in PHP for the server side and JavaScript for the client side. I didn’t love this because it isn’t DRY. If we had two calculations in two different files then when someone was updating the code they would likely miss the other one.

The other option I came up with was to put the calculation on the server side only and use AJAX to run it on the client side. This seemed overly complex, which will sound funny when you see what we decided on. We had avoided AJAX on the site because it wasn’t really needed; adding it for this seemed like a bad value/complexity trade off. I was also worried that server lag could cause a bad user experience, as I pointed out above the calculation needed to be fast.

So I was talking it over with my coworker Matt and he made a joke and we both laughed it off. Then we thought about it, and realized there there were fewer downsides to his approach than my two, so I started coding.

Continue reading “DRYer than thou”

PHP on Rails

The funny thing about working with Ruby on Rails is that its made me a better PHP developer.

A client wanted something developed and chose being able to support the app with their in-house PHP experience over Rails’ speed of development. I’m now using tools like DB_DataObject to emulate some of what I love about ActiveRecord. I’m making each PHP file its own controller, and using Smarty for the V in MVC. If you look at my include directory it looks suspiciously like a Rails directory.

Yes, I know about Cake, but I only have so much time to invest in learning frameworks, and I still prefer Rails.