Applying “Once is happenstance, twice is coincidence, three times is enemy action” to troubleshooting

Ian Fleming’s Goldfinger said, “Once is happenstance.  Twice is coincidence.  Three times is enemy action.”  While managing web applications is not nearly as exciting as international spy thrillers, I often find myself thinking of the phrase when trying to debug phantom problems.

Phantom problems are the errors that come out of nowhere, go away just as quickly, and could have been caused by a thousand things.  The specific error itself might be easy to understand (e.g. “there’s blocking in the database” or “the cookie was missing from the request”), but finding out what caused it is much harder.  Good tools and logging can help, but often times there is no smoking gun.

Once is Happenstance

It’s easy to discount a lone example.  With so many different compoents involved (databases, web servers, load balancers, web browsers, users, etc.), some of which are outside of your control, there often is not enough to go on to start trying to track it down.  If the error is not continuing to happen, it’s not clear if there is anything to actually fix.  Hey, maybe it was electromagnetic radiation from a solar flare!

Twice is Coincidence

Things get more serious when the phantom error happens a second time.  Since it happened again, you can rule out the possibility of a one-time freak occurrence. The issue is real.

With two examples, it is possible to start drawing some real conjectures about what is going on.  Other events start to correlate.  “Hey, client X was loading data during both events” or “Server A logged a network anomaly a few minutes before each one”.  You might even go so far as to form a theory about ways to work around it.

The danger here is that you are very vulnerable to just seeing coincidences.  Sure, those events correlated, but there were a thousand things happening.  You could be a connection that is meaningful, but it might just be random.

More significantly, if you do try making a change, you will not have good confidence that the problem is really fixed.  Sure, the error might not be happening anymore, but you’ve only seen it twice.  You don’t know how often the error can be expected to occur, so you won’t know if you actually solved a problem, or if it just went away on its own.

Three Times is Enemy Action

When the error happens a third time, you know you have a real, recurring problem.  It’s not going to go away.  And you now have vital information for trying to fix it.  Time to get really busy.

With three instances, most of the meaningless coinciding events will vanish (“Client X wasn’t loading any data this time”).  Remaining correlations are much more likely to be related, since they have now happened three separate times.  You also get a sense about whether there is a time-element to the pattern, such as an error that always occurs at 10:30 am or always occurs at 20 minutes past the hour.

Knowing how often the error occurs also gives you a yard stick that allows you to evaluate whether a possible fix worked.  If the three occurrences tell you that a machine reboots from a bug check about once a day, a couple of days without a reboot will give you a lot of confidence.  Two examples would not be enough to conclude this.

The Caveat

Just to be clear, I am not saying to ignore errors until they occur three times – jump on them immediately.  Use all the tools at your disposal to track down root cause, including error logs, database diagnostic tools like Idera’s SQL Diagnostic Manager, web session recorders like HttpWatch, etc.  Just know that you do not have enough information to see a pattern, so beware misleading conclusions.

And, yes, I know that three errors is not truly “statistically significant”.  But, we humans are incredibly good at finding patterns for a reason.  Three examples is often enough.

 

Posted in Uncategorized | Tagged , | Leave a comment

#iMovie randomly not saving my project is solved – it’s the zombie project’s fault

A few weeks ago, my infatuation with my new Mac and iMovie was crushed when iMovie lost six hours of work.  I had been putting together a photo/video montage and was just putting on the finishing touches when all of sudden my project became unplayable.  As far as I have been able to determine, this is iMovie’s behavior when a project file goes corrupt.

I still haven’t forgiven iMovie for this.  There was no specific triggering event that caused it to die, and the lack of any sort of error message is downright annoying.  But the product is so good, I had resolved to start over, being more carful to back up after every few changes.  It also provided a great opportunity to learn about Time Machine.

So, I started putting the project back together.  While the old project was unplayable, it was still around in a kind of zombie state.  I was able to access parts of it by using mouse-overs of the project timeline, which allowed me to see the order of the pictures and videos.  This was useful, since I could use it as a guide for what pictures I had originally selected and what order I had put them in.

And then I discovered I had a new problem.  I would periodically quit iMovie in order to back up the project, but when I returned I sometimes found that iMovie had not saved my changes.  It was erratic – sometimes my most recent edits would be there, and some times not.  It was like playing roulette, and I was seriously questioning how Apple could release such a buggy product.

A large chunk of my day job involves trying to play sherlock holmes with strange system problems, looking for patterns in errors and trying to come up with theories that explain them.  I don’t believe in gremlins, and I figured there must be some pattern as to why iMovie would sometimes save my work and not other times.  There is no way a bug of this magnitude could be widespread.

My suspicions soon fell on the zombie project, and after a few tests, I confirmed my theory.  Any time I would touch the zombie project to check on the order, iMovie was going into a silent failure mode.  It would let me make changes to the other project, and even allow me to preview them, but it was no longer actually saving any of my work.  If I would avoid touching the zombie project, everything would be fine.

There are no error messages coming out of iMovie, and no warnings on the console.  This kind of silent failure, particularly around saving work, is a serious design flaw.

The real killer is that despite all this, the product puts to shame anything I have seen back in the PC world.  Rather than quitting, I am going just recreate the project from scratch, without the help of being able to refer to the old one.  No I just have two rules – backup often, and run far away from any corruption I encounter, since it is contagious.

Making great products sure does allow Steve Jobs and Apple to get away with some pretty major slip-ups.

 

Posted in Uncategorized | Tagged | Leave a comment

Performance Tuning Tip: Turn on time-taken in your web logs (IIS, Apache, Tomcat) to identify performance hot spots

Everyone knows that the web logs generated by web servers like IIS, Apache, and Tomcat (JBOSS / Catalina) are a well-known goldmine of information.  They allow you to track site usage, find content errors, gather browser statistics, and all sorts of other incredibly useful types of information.  Slap together some basic linux (or cygwin) commands like grep, sort, cut, and uniq, and you can do some very sophisticated analysis.  If that’s not enough, there are thousands of free and paid tools to help with the analysis.

One often overlooked benefit is the ability to gather detailed performance information.  Each hit record how long it took to process a request in milliseconds, and this can be a goldmine of information.  You can use the data answer important performance-related questions like:

  • How does your response time fluctuate based on time-of-day?
  • How does your response time fluctuate based on server load?
  • How does the response time of your servers compare with each other (assume you have a web farm)?
  • What pages are the slow pages?
  • Does response time correlate with the size of the page being returned?
  • Do specific IP addresses (i.e. customers) see higher response times than normal?
  • My customers are complaining about a performance blip – how long did it last? was it all pages or just specific ones?

Being able to answer questions like this is vital to being able to identify performance hot spots.  Even more importantly, it allows you to objectively measure performance improvements after you make them.

Unfortunately, for reasons I cannot fathom, time-taken is not turned on by default in either IIS or Tomcat.  It is incredibly useful and easy to collect, so why not turn it on by default?

Here is how to do it in IIS 6.0:

  1. Open your website properties
  2. Make sure “Enable Logging” is turned on, and then click Properties
    Open_logging_properties

     

  3. Click on the “Advanced” tab
  4. Turn on the Time Taken option
    Activate_time_taken

Here is how to do it in Tomcat / JBOSS:

This will vary depending on your specific Tomcat / JBOSS configuration, but here is what worked for me:

  1. Open up <JBOSS_HOME>serverdefaultdeployjboss-web.deployerserver.xml
  2. Find the entry for “org.apache.cataline.valves.AccessLogValue”
  3. Make sure that “%D” is included in your format string

 

 

 

Posted in Uncategorized | Tagged , , , | Leave a comment

The DNS Challenges of Using Akamai Sites for Global Corporations

Akamai’s web application accelerator product has been a tremendous benefit for our global operations.  Using their tools, we have observed a 400% performance boost between the United States and Asia versus accessing our site over “public” internet.  The benefits come from three areas:

  1. Caching static elements (javascript, css, images, etc) at Akamai edge servers around the world
  2. Enhanced delivery speeds through Akamai’s proprietary internal protocols, which are able to avoid the TCP/IP slow start
  3. Routing around internet bottlenecks

However, I frequently find myself talking to customers who are saying that many of their global offices are experiencing performance problems.  When I look into it, I frequently discover that many of the benefits of Akamai’s web application accelerator are being negated by the IT infrastructures of large, global corporations.

More specifically, it is being negated by the way that many large corporations handle DNS resolution.

To understand why, you first need to understand a little bit about how Akamai works.

How Akamai Works

Akamai maintains “edge servers” all over the world.  When a site connected to Akamai’s web application accelerator, the DNS entry is set up as a CNAME over to Akamai, rather than an A record.  When an end user tries to resolve the website, Akamai’s nameservers identifies an edge server near the end user, and they start talking to that edge, thinking it is the site.

Standard_akamai

Static elements may be served straight out of the edge server’s cache (if it has been requested by another user recently), and missing elements or non-cachable pages are routed through the Akamai network to another edge server near the host.  That edge server makes the actual requests to the host site and passes them back through the network to the original edge server, and from there they are returned to the end-user.

Since the edge servers are internally communicating using Akamai’s proprietary protocols and routing around bottlenecks, traffic can flow much faster than over the public internet.

The Role of DNS

The key to this entire process working properly is DNS.  Akamai is identifying which edge server to send the end user based on what DNS server made the request.  They assume that the end user is located nearby this DNS server, so as long as they accelerate traffic to that region, it should have only a short journey over the public internet.

However, in a corporate environment, there are generally multiple DNS servers involved.  Most offices will have some sort of internal DNS server.  When a machine wants to connect to a site, they will make a “recursive” name lookup request to their internal DNS server.  Since the internal DNS server cannot provide translations for external sites, it will simply forward the request its own external DNS server and then return the answer to the client.  The client is generally unaware that multiple DNS servers will involved in handling the request.

Corporate_akamai

This still works fine, assuming that the external DNS server was close to the client, just like the internal DNS server was.

Global Corporate DNS Resolution

Unfortunately, global corporations are complex places with their own complex IT infrastructures.  They have large numbers of internal sites and resources, and they need their DNS servers to ensure that all internal users are able to access them.  Many times, rather than forwarding DNS requests to a nearby external DNS server, remote corporate offices will forward DNS requests to another internal DNS server back at the company’s headquarters, somewhere else in the world.

Corporate_akamai_with_remote_dns

And here is where the problems come in. When trying to access external sites, the centralized DNS server needs to pass them off to its nearby external DNS server.  If the site is on Akamai, then it is the external DNS server near the central coprorate server who makes the request to Akamai.  Akamai now concludes that the user is in a completely different part of the world.

In the example in the diagram above, Akamai sees that the DNS request came from a DNS server in Europe, so it returns the ip address of a European edge server and routes the traffic there.  It has no way of knowing that the end user was actually in Asia.  Now, the end user starts communicating over public internet to the Akamai edge server in Europe.  At this point, most of the benefits for the Akamai network are lost.  The client is still travelling huge distances over congested networks using slow start TCP/IP to reach the edge server.

Fixing the Problem

There is no way to fix the Akamai configuration to correct the problem.  Changes need to be made at each remote office in the global corporation to change how they resolve DNS.  The need to modify their servers to pass external DNS entries and only forward on requests for internal sites to the central server (conditional forwarding).  Or, at the very least, pass off DNS requests for the Akamai site in question to the local site.

And this is where the real change is.  It involves educating IT professionals all over the world about how Akamai works and convincing them that they need to make a change on their end to fix the problem.  This requires overcoming language barriers and time differences, and being very persuasive.

Welcome to performance tuning in a globalized world.

 

Posted in Uncategorized | Tagged , , , | 6 Comments

According to Mini-Boden, UK companies cannot legally mail unpriced gifts to the US

With the holidays approaching, my wife and I have started looking for gifts for family members.  The Jewish holiday of Hanukkah falls very early this year, so we have had to get an earlier jump than normal.

My wife found a cute pair of matching nightgowns at Mini-Boden for my nieces.  Since they live on the other side of the country, we planned to have them shipped direct.  However, when I went to the online checkout, I could not find any options that would allow me to specify that they were a gift.  No options to include a gift message or to hide the amounts on the packing list.

I initially assumed this was a bad user interface, since I have seen similar problems before.  I once had a similar problem with Pottery Barn, and after contacting their customer support, I learned that on their site, you had to mark something as a gift when you added it to the checkout cart.

Figuring that Mini-Boden must also have some non-intuitive way of doing this, I emailed their technical support.  I was very surprised by their response (emphasis mine):

I am very sorry but as most of our merchandise is dispatched from our UK warehouse, we are unable to provide a gift service. These parcels are shipped via International Mail to US Post and must have prices evident on the outside of the parcel. Additionally, the invoice will be sent with the parcel and the prices will appear on the invoice.

If you would like a message printed on the invoice, you may call on our customer service line, 1-866-206-9508 and a representative will attempt this service for you. However, if the order has already been processed, we are unable to add a message.

If I understand this correctly, US law is requiring companies to include the prices on everything they ship into the US.  I assume this is due to some sort of customs or tariff requirements.

Or maybe this is just an incentive to try to convince people to make their purchases from companies in the US.

 

Posted in Uncategorized | Tagged | Leave a comment

More about my plan to fix the deficit, using the NYTimes budget balancing tool

This morning’s Week in Review had a fascinating section on all of the options on how to fix the deficit.  It puts you in charge of the budget, allowing you to mark off all of the ways you would save in order to cut the budget by $1.355 trillion by 2030.  If you went to the page online, there was an interactive version.

I instantly thought that this was a great opportunity for me to put my money where my mouth is.  You can see my detailed plan at http://t.co/YBebbxQ.

Overall, my plan represented 67% tax increases and 33% budget cuts – can you guess I’m a liberal?  Yes, I believe in most of the government’s programs, and I am willing to pay higher taxes for them.

Getting to around $1 trillion was pretty easy:

  • Eliminate Earmarks ($14b) – get rid of projects that favor a specific lawmakers’ home states
  • Reduce nuclear arsenal and space spending ($38 billion) – the future of warfare is less and less likely to be nuclear
  • Reduce navy and air force fleets ($24 billion) – build fewer ships and retire more than were scheduled
  • Enact medical malpractice reform ($13 billion) – so much of medical spending is based on trying to avoid getting sued
  • Increase the medicare eligibility to age 68 ($56 billion) – We are living longer and healthier
  • Reduce social security benefits for those with high incomes ($54 billion) – Higher income people can afford it and are saving anyways
  • Use an alternate measure for inflation ($82 billion) – use an updated method of measuring inflation
  • Fix estate tax rate at $3.5 million ($45 billion) – That’s still a lot of money to share amongst the kids, tax free
  • Raise capital gains rates to 20% ($24 billion) – still much less than income tax rates, and this is an investment vehicle for the wealthy
  • Expire bush tax cuts for > $250,000 ($115 billion) – that’s easy, they can afford it, and it certainly hasn’t boosted the economy while it has been in place
  • Payroll tax applies over $106,000 ($100 billion) – it’s nice when the extra savings kick in at the end of the year, but it’s a nice bonus.  My budgeting is based on the normal payroll taxes
  • Eliminate loopholes, but keep taxes slightly higher ($315 billion) – tax rates would still go down, and the loopholes would still go away
  • Reduce mortgage-interest deduction by converting to credit ($54 billion) – I benefit from this, but when it comes down to it, it’s not why I own my house. Cutting the deficit was more important.
  • Bank tax ($103 billion) – they got us into the financial crisis. They are going to have to help get us out.

Getting the last $355 billion was tougher.  It came from two options:

  • Raise social security retirement age to 70 ($247 billion) – this is going to sting for many people, but the reality is that we are all living a lot longer and a lot healthier. We need to contribute more to social security to compensate.
  • Millionaire’s tax on income above $1 million ($95 billion) – 5.4% on income over $1 million dollars; the millionaires can do more

I left most of the military spending intact.  I paired it down for programs that were geared toward global warfare, but we need the troops in Afghanistan and Iraq to protect our security interests  And, cutting payrolls and contractors is putting people out of work.

I’m hoping that the New York Times will do more with the data it is collecting.  I’d love to see aggregate data about how popular each cut was, and perhaps a breakdown of who cut what by region.

What a great crowdsourcing approach to the deficit.

If only it were so easy to actually get it done.

Posted in Uncategorized | Leave a comment

Windows 7 channels Windows Vista with endless “Do you want to save” security boxes. Just copy them dammit!

I had thought that in Windows 7 Microsoft had fixed the endless security warning dialog boxes that used to constantly show up in Windows Vista.

Today I’m trying to use a freeware text editor, but rather than having a normal installer, it is just a cabinet file with everything in a single directory.  I’m trying to extract it into a folder for easier use, and as it copes the approximately 30 .ini files, EVERY SINGLE ONE prompts me with this stupid message:

Do_you_want_to_save

Given that I am not actually overwriting a .ini file, the harm should be pretty limited.  But okay, Windows is just trying to warn me.

But couldn’t it at least give me a “do not show this message again” box, or at least an “apply to all” box?  Instead, I have to hit the button for each and every one them.

Please, just copy them, dammit!

Brings back the memories of the Mac / Windows commerical about the annoying security dialogs in Vista:..

 

Posted in Uncategorized | Leave a comment

Locked out of the car while holding the baby: think positive!

The last time I locked myself out of the house was in the summer of 1995 when I was 20 years old. It was a dumb mistake, and ever since then, I have been paranoid about making sure that I had my keys, wallet, and cellphone on my person at all times.

So, I was dumbfounded at finding myself out of the house at night, locked out of my car, without my cell phone, holding a baby.  I don’t make mistakes like this.  Sure, I make my fair share of dumb mistakes, but not ones like this.

I was out picking up our farm share, and normally I take our two-year-old daughter to do this, but tonight I had my four-month-old son.  I normally have just one set of keys in my pocket, but today I had two since I had taken our second car to a mechanic during the day.  And normally, I remember to take my cell phone out of its holder on the dashboard, but the baby was crying, and I had rushed to take him out and pick up the farm share.   

When I got back to the car laden down with two bags of groceries, 15 pounds of baby, and 10 pounds of car seat, I reached into my pocket to hit the button to pop the trunk.  I wasn’t used to have two sets of keys in there, and I found myself fumbling with the wrong one.  In the end, I had to pull it out to look at it and make sure I had the the right key.  When I had it, I popped the trunk, threw the bags in, and closed the lid.

When I got to the side of the car to open the door to put in the baby, I discovered my pocket now only had the keys to the other car in it, and the keys to this car were no longer in my hands.  I circled the car three times, but no keys.  I checked all my other pockets, and this left me with the only possible conclusion – I had dropped the keys into the trunk along with the bags.

And so, there I was.  Locked out of my car, away from home, without my cell phone, holding the baby.

I knew there was only one option.  I was going to have to walk home, explain to my wife what happened, get the other set, and walk back.  After about 20 seconds to get over the shock, I started trudging back home, lugging the 15 pounds of baby and 10 pounds of car seat.  As my shoulder muscles started to spasm from holding this very awkward load (car seat ergonomics do not allow them to be carried more than 20 feet with any comfort), I started to think this must be the worst night ever.

As soon as I thought this, I realized how ridiculous it was.  In general, I am a pretty positive person, and when things go wrong, I try to think about all of the things that could have made a problem much worse.  For example:

  1. The baby could have been locked inside of the car
  2. It could have been pouring rain, like it had been all day
  3. The baby could have been still screaming his head off, rather than sleeping finally
  4. It could have been miles from the house, rather than a 10 minute walk

Fortunately, my wife was very understanding when I got home, and it was a much faster walk sans baby to go back and get the car.

I’ll have to be doubly paranoid about my keys from now on.

 

Posted in Uncategorized | Tagged | Leave a comment

Yahoo’s YUI limiting support for IE6. Will this hurt YUI use in the enterprise market? #browser

Yahoo announced today that they are going to be reducing support for IE6 in YUI.  They won’t drop support altogether, but they will be reducing it.  This is not good news.

Don’t get me wrong – I hate IE6. I hate it with the depths of my soul.  I loathe it.

That said, I also recognize that I am stuck with it.  My company is a SaaS provider for the enterprise market.  While IE6 usage has dropped, it is still going strong in the corporate market with over 50% of our traffic.  IE6 is not going anywhere any time soon.

Large corporations have much slower roll out times than consumers.  They have a limited IT staff to support a large user base, and many Microsoft websites like older versions of sharepoint do not work in Windows 7 / IE8.  They are starting to test the waters with upgrading some users to Windows 7, but Windows XP with its IE6 default will still be around for years to come.

With YUI starting to limit support for IE6, it makes the YUI toolset a much less attractive option for our development, since we cannot depend on newer versions to work with a large chunk of our user base.

Do other corporations feel the same way?  Will usage of YUI in the enterprise market start to drop?  I’m curious to see what happens.

Posted in Uncategorized | Tagged | Leave a comment

My crush on my new mac just abruptly ended when #iMovie corrupted six hours of my work #fail

Oh, my dear mac…

Just a few hours ago, I was still starstruck by iMovie.  After having previously tried to make some movies on my PC many months ago, I was blown away by how easy it is to create movies in iMovie.  The ease of clipping videos… the way you could just mouse over to see where the action was… the themes… the transitions… it was everything a video editing program should be.

Seeing how easy it was, I dived in over the weekend and put together a movie of photos and videos of the last year and a half, starting from just after my daughter turned one up to the birth of my son (my son will get a dedicated video when he turns one, just like my daughter did).  I spent about six hours on it over the weekend, getting it just right.  Even my wife was impressed.

Then, tragedy struck. I was trying to show the video to my daughter, and it wouldn’t play.  It had been working fine last night, and I hadn’t made any edits to it since. I would try to play, and just a blank screen.  No errors, just not working.

It feels a little like being strung along – it won’t dump me, but it won’t return my phone calls either.  After googling around for a bit, I learned this is what happens when an iMovie project becomes corrupted.  I found some suggestions about how to edit the project, but they don’t seem to work for iMovie 11.  And, since iMovie 11 is so new, there isn’t much technical support available for it.

The saddest part is that despite this rejection, the software is so impressive that I can’t walk away. I know I’m going to end up just recreating it.

Except now I’m going to start backing it up at regular intervals.  I won’t fall for this twice.

Posted in Uncategorized | Leave a comment