The ABA Problem
published: Fri, 15-Sep-2006 | updated: Sun, 17-Sep-2006
Um, I have an apology to make. There is a very subtle bug in my C# implementations of the various lock-free data structures. Sigh. And by subtle I mean very subtle, like the difference in taste between Vittel and Evian.
It was weird, as multithreaded bugs usually are. Very rare, very intermittent, very annoying. In fact the only way I found it was when testing the rewrite of my lock-free code to Delphi32. I even went down this path of wondering how I'd managed to screw up the translation of the code to Delphi.
In the end, I stopped and pushed my chair away from the keyboard. Time for reflection with a pad of paper and a pen. The symptom of the problem was that some of the items in the queue seemed to disappear, as if the internal linked list was being corrupted. I suspected the free list: for some reason that free list had been the bane of my lock-free existence.
I reasoned like this. Suppose a thread gets a hold of a node as it's dequeuing an item. Let's imagine the thread freezes at that point while other threads are enqueuing and dequeuing. As a part of this activity, the node that's being held by the frozen thread is going to be reused over and over.
Let's be precise here. Although the pointer to the node will not change, the contents (that is, the item of data and the Next pointer) of that node will. However our frozen thread will not notice: all it has is a copy of the address of the node, but not of the node's data. So it is entirely possible that the conditions for the CAS operation could be satisfied, even though the node is in fact different.
And therein lies the problem. As it happens, it has a name: the ABA problem. (If you're reading this on my website I do apologize for the rather bad visual pun at the top of the page.) It's called "ABA" not as an abbreviation but because of how the problem manifests itself: the first thread thinks the node has contents A, another thread changes its contents to B, but when the first thread uses the node it still thinks the data is A.
What is the solution? well, by far the simplest for us C# developers is to simply chuck the free list into the bit bucket. In a garbage-collected environment like .NET, the node will only get disposed of (and the memory eventually reused) once no more threads have a reference to it, which is exactly what we want.
For a non-garbage-collected environment, such as Delphi32, we have an enormous problem. The algorithm is incredibly difficult and convoluted (and in fact its discoverer, John D. Valois, even had a race condition in his first published implementation). For Delphi, we could probably use the quasi garbage collection facilities of interfaces, but I've yet to prove this.
I've now updated the downloadable source code of the lock-free data structures to remove the free list.