When analyzing performance, looking at 95th percentile response times is one of the most useful metrics. While the average can be a good indicator, it
can be heavily skewed by lots of small requests or a few extremely long ones. Furthermore, when you look at the average number, you know that half your users got a response faster than the average, but half got a response that was worse.
The 95th percentile gives you a much better sense of response time, because it tells you what was seen by 19 out of every 20 users. If you are happy with your 95th percentile response time, you can be confident that it really is representative of the majority of your users’ experience.
The problem I run into is that the 95th percentile is a real pain to compute. Unlike the average (sum all the values and then divide by the count), there is no easy formula that you can compute in one or two lines. Generally, when I needed to compute it, I would have to open it up in Excel and use the percentile function. This is a pain for a bunch of reasons:
· You have to export your data to a machine where excel is installed and then manually load it
· Percentile just takes a range of cells, so if you want to do fine grained calculations (like 95th percentile for each distinct url in the log file), you have to go through real acrobatics like excel array functions to narrow down to just the relevant cells, or else do a lot of copy and pasting
Really what I wanted was a simple script that I could use on the command line to take a web log file (the kind you get from IIS or Tomcat or JBoss) and have it spit out the 95th percentiles for me, so I finally took the time to write one.
For a variety of reasons, I like to use awk for these kinds of tasks. Many people rightly point to the greater power and sophistication of perl, but awk has a key advantage for me – it’s part of the basic linux/cygwin installation. Many times I want to run this kind of analysis straight on the log files in a production environment, and hardened servers often don’t have perl installed at all. I would need to go through the hoops of exporting the data to another machine/environment or getting approval to have perl installed. The purpose of a tool like this to do quick, command line analysis to get pointed in the right direction, so being able to do it even on a basic environment is valuable.
In order to keep the awk script simple, I want to feed it data in an organized fashion. I’d like all the response for a page to come together, and I want them in order of increasing response time. This is best done with a sort command that can pipe the results into the awk script. Most log file lines are space delimited, like this:
· [11/Dec/2010:00:17:16 -0500] 220.127.116.11 GET /site/home ?login=1 200 24141 183
In the example above, url is in position 5 (as delimited by spaces), and the response time is in position 9. What I want to do is sort first by the url, then by the response time, and then cut away all the other columns:
· cat logfile | sort -t ” ” -k 5,5 -k 9,9n –ignore-case | cut -d ” ” -f 5,9
Some of these sort command0line options are not frequently used, so here is what each one does:
· -t ” ” – Tells sort to use the space character as a column delimiter
· -k 5,5 – Sort first by column 5
· -k 9,9n – Next sort by column 9, but treat the value as a number, so 175 should come before 1289
· –ignore-case – Treat text as case-insensitive, since I don’t care about case differences when computing page times
Okay, so now I have my log file sorted by with all urls together, and response time steadily increasing. Now I need to compute the page times.
I am doing this somewhat crudely – for each page, I keep track of all of the response time numbers in an array, and when I hit the last one, I just grab the one that is at the 95th percentile position. The biggest issue here is that it requires keeping the response numbers in memory, but since it is only for one url at a time, you would need an enormous log file on a machine with very limited ram to run into the issue.
Here’s the script:
As I said, it is crude, but effective. It differs from Excel’s numbers in that it doesn’t do fancy things like interpolate a 95th percentile if there are fewer than 100 values. However, this script is more for triage, and numbers at those low traffic levels lack statistical significance anyways.
Putting it all together, you would run this on the log file:
· cat logfile | sort -t ” ” -k 5,5 -k 9,9n –ignore-case | cut -d ” ” -f 5,9 | awk -f compute_performance.awk
And you would get an output like this:
From here, you can sort by hit frequency, 95th percentile time, etc. You can also use it to compute 95th percentiles by other metrics in the log file, such as by hour or by ip address.