Solved: Bug in memcached client library .Net port causes inconsistent key distribution

Memcached is most popularly used in php and java applications, but it’s a great technology that can integrate with anything.  For those with programs in the .Net world, there is a .Net memcached client library, which is a port called of a java based memcached client create by a team led Greg Whalin.

When you are working with a network of memcached servers, there is no actual communication between the different nodes.  Each one simply accepts get/set requests independently from one another.  The only point that knows about the network is the actual client trying to use it, which has a list of all the nodes.  When it wants to access a value out, it hashes the key name against the list of nodes and selects a node to be the owner.  If will then check with that node for the value, and if it is missing, it will recreate the value and has hit there.

In a web farm, each server is independent, but as long as they have the same list of nodes and are using the same hashing algorithm, they will all act as a group select the same target node to hash any particular value.

At least this is how it is supposed to work in practice.  I think it’s a cool algorithm, and I wanted a way to see it in action, so I previously wrote a script that could query all the nodes in a memcached network and look for a particular value.  Everything looked good in my testing environment, but when we deployed the application, I noticed some very odd behavior – values were being duplicated all over the place.  In some cases, I was finding the same keys on half of the nodes in the network!

I was perplexed.  There were a couple of possible causes.  One would be that the clients weren’t operating off of a common list of servers, which would naturally throw off the hashing algorithm, but I quickly confirmed this wasn’t the case.  Everyone had the same identical list.

Another possibility would be node failures; since memcached is designed to be very robust, if a node goes down, the default behavior is to start looking at a different node to host it.  I started playing with individual servers, deleting values in the cache and then looking to see where they popped up.  What I saw is that it was very consistent – client A would always put its values on node 4 while client B would always put its values on node 7.  Since I knew my nodes were online, failures couldn’t explain the inconsistency.

I started debugging the .Net source code and soon found very behaviors.  I was passing in a list of 10 nodes to the client library, but when I would check the underlying list, there would be 16 nodes listed.  Some, but not all, of the nodes were duplicated, sometimes more than once.

After debugging for a while, I found the source of the problem.  When the client library initializes, it goes through the server list one-by-one creating connections and adding them to a pool.  Once the connection was created, it uses the code below associate it with the node:

if(_buckets.BinarySearch(host) < 0)

The code is using binary search to try to find a value in the array, and if it isn’t found, it adds the value to the end.  If it is intending to use a binary search, the data needs to be sorted.  Adding values to the end is going to create an unsorted jumble of data, causing the binary search algorithm to fail.  This causes nodes to get added to the list multiple times.

As it turns out, there is a very simple work-around to this problem (aside from modifying the source code) – just sort the list of nodes before they are passed in to the library.  If the list of nodes is sorted, the “_budgets.Add(host)” line will actually manage to keep the array in a sorted order, because each new entry will always belong at the end.

This could be accomplished just by adjusting the configuration data that contains the list, but my preference was to add an “Array.Sort” call to the string[] of nodes before passing it into the memcached client.  This would ensure that if sysops person ever tweaked the node list, they wouldn’t need to know that it would have to be sorted.

After we implemented the sort, our memcached array started behaving properly – keys were no longer duplicated across the nodes, but hosted on a single node as expected.

This entry was posted in Uncategorized and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s