Ian Fleming’s Goldfinger said, “Once is happenstance. Twice is coincidence. Three times is enemy action.” While managing web applications is not nearly as exciting as international spy thrillers, I often find myself thinking of the phrase when trying to debug phantom problems.
Phantom problems are the errors that come out of nowhere, go away just as quickly, and could have been caused by a thousand things. The specific error itself might be easy to understand (e.g. “there’s blocking in the database” or “the cookie was missing from the request”), but finding out what caused it is much harder. Good tools and logging can help, but often times there is no smoking gun.
Once is Happenstance
It’s easy to discount a lone example. With so many different compoents involved (databases, web servers, load balancers, web browsers, users, etc.), some of which are outside of your control, there often is not enough to go on to start trying to track it down. If the error is not continuing to happen, it’s not clear if there is anything to actually fix. Hey, maybe it was electromagnetic radiation from a solar flare!
Twice is Coincidence
Things get more serious when the phantom error happens a second time. Since it happened again, you can rule out the possibility of a one-time freak occurrence. The issue is real.
With two examples, it is possible to start drawing some real conjectures about what is going on. Other events start to correlate. “Hey, client X was loading data during both events” or “Server A logged a network anomaly a few minutes before each one”. You might even go so far as to form a theory about ways to work around it.
The danger here is that you are very vulnerable to just seeing coincidences. Sure, those events correlated, but there were a thousand things happening. You could be a connection that is meaningful, but it might just be random.
More significantly, if you do try making a change, you will not have good confidence that the problem is really fixed. Sure, the error might not be happening anymore, but you’ve only seen it twice. You don’t know how often the error can be expected to occur, so you won’t know if you actually solved a problem, or if it just went away on its own.
Three Times is Enemy Action
When the error happens a third time, you know you have a real, recurring problem. It’s not going to go away. And you now have vital information for trying to fix it. Time to get really busy.
With three instances, most of the meaningless coinciding events will vanish (“Client X wasn’t loading any data this time”). Remaining correlations are much more likely to be related, since they have now happened three separate times. You also get a sense about whether there is a time-element to the pattern, such as an error that always occurs at 10:30 am or always occurs at 20 minutes past the hour.
Knowing how often the error occurs also gives you a yard stick that allows you to evaluate whether a possible fix worked. If the three occurrences tell you that a machine reboots from a bug check about once a day, a couple of days without a reboot will give you a lot of confidence. Two examples would not be enough to conclude this.
Just to be clear, I am not saying to ignore errors until they occur three times – jump on them immediately. Use all the tools at your disposal to track down root cause, including error logs, database diagnostic tools like Idera’s SQL Diagnostic Manager, web session recorders like HttpWatch, etc. Just know that you do not have enough information to see a pattern, so beware misleading conclusions.
And, yes, I know that three errors is not truly “statistically significant”. But, we humans are incredibly good at finding patterns for a reason. Three examples is often enough.