I thought I had a full grasp of how Akamai routing works, but a recent dive into our web logs has taught me that world of high speed internet traffic routing is more nuanced.
One of the big selling points of using Akamai is their ability to route traffic around internet bottlenecks. The public internet routes traffic using plain old Border Gateway Protocol (BGP), which is designed to find a path from point A to point B based on how networks are interconnected. However, it is not concerned with throughput speeds for those networks, so if there is congestion at a particular node, BGP will continue to send traffic through it, effectively slowing everybody down.
In contrast, Akamai is constantly measuring throughput on various network segments, and they are able to identify alternate paths. For example, when there is a cable break in Asia, BGP protocols will tend to overload certain nodes as they start using alternate paths for data. Akamai is able to detect that those nodes are overloaded and choose other alternative paths for better performance.
All of Akamai’s edge servers around the world are able to communicate with each other using this proprietary routing protocol. There are other enhancements to this edge server to edge server network: they use a data transmission protocol that avoids the TCP/IP slow start issues, and all content is compressed to minimize the number of packets sent.
I had always understood the network to function like the diagram above. Normal, non-Akamai traffic meant a direct connection from the end-user machine to the host data center (route 1). Akamai traffic meant that the client connected to a nearby edge server, where static content was served out of its cache, and remaining content traveled over their high speed protocols (route 2) to another Akamai edge server near the host data center, at which point the traffic reverted back to normal protocols. Of course, the diagram above assumes that the end user’s DNS is properly configured.
Shouldn’t All Requests Come From Nearby?
It turns out that the diagram above is not the only routing scenario.
I was recently researching some data in our web logs, and I came across something very surprising. When I geolocated the IP address of one of the edge servers I saw connecting, I discovered that it was actually located in China, despite the fact that our origin servers for this site are in the United States.
As I dug deeper into our web logs, I found more than one thousand different Akamai edge servers connecting to our machines. While the traffic was concentrated with servers in the United States close to the origin, I did find requests from servers on the far side of the country, Europe, and other parts of the world as well.
Based on my understanding shown in the diagram above, I couldn’t see a reason why I would ever get requests from so far away. Shouldn’t it all be routed over Akamai’s high speed network, with requests always coming from an edge server near our origin? After talking to Akamai, I discovered that while this is usually the case, the reality is more nuanced.
Three Possible Routes
Akamai’s proprietary routing algorithm, dubbed “SureRoute”, is repeatedly measuring the transmission speed between an edge server and on origin for three possible routes. Two of the routes are exactly as I described above – the Akamai edge server talks over their high speed protocols to another edge server near the origin.
However, the third route is a “classic” route, where the edge server talks over normal BGP / TCP/IP protocols directly back to the origin:
The Akamai edge server will use the third route when it finds that it is actually faster than the “dynamic” routes through other edge servers. It could be that there is congestion along the other paths, or that the other edge servers are unavailable for some reason.
When this route is selected, the edge server is talking directly to the origin, and this is why we see such a long list of edge servers in our logs, and why occasionally those servers are very far away.
I traced a session from one of these far away edge servers, and I saw that the requests lasted only for a few minutes. Soon after, requests for that session started coming from another edge server much closer. Presumably, for a short period, the “traditional” route was determined to be faster than “dynamic” routes, so it started using good old BGP routing instead.
Impact of a Traditional Route
In general, Akamai’s decision to use the classic route makes sense; if that route is currently the fastest way to get to the origin, then it should be chosen. Since the client is still connecting first to a nearby edge server, they still get the performance benefit of the static cache, and all remaining requests are traveling back to origin on the most optimal path available.
One downside, however, is that it loses the other two benefits of the Akamai network: TCP/IP slow start and data compression. The edge server is now talking directly to the origin, and it needs to use TCP/IP to do it. Furthermore, if the origin sends any data that was uncompressed, significantly more data ends up being transmitted over the internet than would have been transmitted using Akamai’s proprietary protocols. This could in theory rapidly undo any benefit of the traditional path, since a route that is 20% faster will still be slower if you transmit 3x as much data.
Still, TCP/IP slow start is probably a small percentage of traffic time, and most data sites can be configured to compress content before transmission.
Excellent description, and thank you for the detailed images!