Data Dependencies and wmb()
Version 1.0
Goal
Support lock-free algorithms without inefficient and ugly read-side code.
Obstacle Some CPUs do not support synchronous invalidation in hardware.
Example Code
Insertion into an unordered lock-free circular singly linked list,
while allowing concurrent searches.
Data Structures
The data structures used in all these examples are
a list element, a header element, and a lock.
struct el {
struct el *next;
long key;
long data;
};
struct el head;
spinlock_t mutex;
Search and Insert Using Global Locking
The familiar globally locked implementations of search() and insert() are as follows:
struct el *insert(long key, long data)
{
struct el *p;
p = kmalloc(sizeof(*p), GPF_ATOMIC);
spin_lock(&mutex);
p->next = head.next;
p->key = key;
p->data = data;
head.next = p;
spin_unlock(&mutex);
}
struct el *search(long key)
{
struct el *p;
p = head.next;
while (p != &head) {
if (p->key == key) {
return (p);
}
p = p->next;
}
return (NULL);
}
/* Example use. */
spin_lock(&mutex);
p = search(key);
if (p != NULL) {
/* do stuff with p */
}
spin_unlock(&mutex);
These implementations are quite straightforward, but are subject to locking bottlenecks.
Search and Insert Using wmb() and rmb()
The existing wmb() and rmb() primitives can be used to do lock-free insertion. The
searching task will either see the new element or not, depending on the exact timing,
just like the locked case. In any case, we are guaranteed not to see an invalid pointer,
regardless of timing, again, just like the locked case. The problem is that wmb() is
guaranteed to enforce ordering only on the writing CPU --
the reading CPU must use rmb() to keep the ordering.
struct el *insert(long key, long data)
{
struct el *p;
p = kmalloc(sizeof(*p), GPF_ATOMIC);
spin_lock(&mutex);
p->next = head.next;
p->key = key;
p->data = data;
wmb();
head.next = p;
spin_unlock(&mutex);
}
struct el *search(long key)
{
struct el *p;
p = head.next;
while (p != &head) {
rmb();
if (p->key == key) {
return (p);
}
p = p->next;
};
return (NULL);
}
(Note: see read-copy update for information on how to delete elements from this list
while still permitting lock-free searches.)
The rmb()s in search() cause unnecessary performance degradation on CPUs (such as the
i386, IA64, PPC, and SPARC) where data dependencies result in an implied rmb(). In
addition, code that traverses a chain of pointers would have to be broken up in order to
insert the needed rmb()s. For example:
d = p->next->data;
would have to be rewritten as:
q = p->next;
rmb();
d = q->data;
One could imagine much uglier code expansion where there are more dereferences in a
single expression. The inefficiencies and code bloat could be avoided if there were a
primitive like wmb() that allowed read-side data dependencies to act as implicit rmb()
invocations.
Why do You Need the rmb()?
Many CPUs have single instructions that cause other CPUs to see preceding stores before
subsequent stores, without the reading CPUs needing an explicit rmb() if a data dependency
forces the ordering.
However, some CPUs have no such instruction, the Alpha being a case in point. On these
CPUs, a wmb() only guarantees that the invalidate requests corresponding to the writes
will be emitted in order. The wmb() does not guarantee that the reading CPU will process
these invalidates in order.
For example, consider a CPU with a partitioned cache, as shown in the following diagram:
barrier(wmb,mb,rmb)和cache coherence
Here, even-numbered cachelines are maintained in cache bank 0, and odd-numbered cache
lines are maintained in cache bank 1. Suppose that head was maintained in cache bank 0,
and that a newly allocated element was maintained in cache bank 1. The insert() code's
wmb() would guarantee that the invalidates corresponding to the writes to the next, key,
and data fields would appear on the bus before the write to head->next.
But suppose that the reading CPU's cache bank 1 was extremely busy, with lots of pending
invalidates and outstanding accesses, and that the reading CPU's cache bank 0 was idle.
The invalidation corresponding to head->next could then be processed before that of the
three fields. If search() were to be executing just at that time, it would pick up the
new value of head->next, but, since the invalidates corresponding to the three fields
had not yet been processed, it could pick up the old (garbage!) value corresponding to
these fields, possibly resulting in an oops or worse.
Placing an rmb() between the access to head->next and the three fields fixes this
problem. The rmb() forces all outstanding invalidates to be processed before any
subsequent reads are allowed to proceed. Since the invalidate corresponding to the three
fields arrived before that of head->next, this will guarantee that if the new value of
head->next was read, then the new value of the three fields will also be read.
No oopses (or worse).
However, all the rmb()s add complexity, are easy to leave out, and hurt performance of
all architectures. And if you forget a needed rmb(), you end up with very intermittent
and difficult-to-diagnose memory-corruption errors. Just what we don't need in Linux!
So, there is strong motivation for a way of eliminating the need for these rmb()s.
Solutions for lockfree search and insertions
Search and Insert Using wmbdd()
It would much nicer (and faster, on many architectures) to have a primitive similar to
wmb(), but that allowed read-side data dependencies to substitute for an explicit rmb().
It is possible to do this (see patch). With such a primitive, the code looks as follows:
struct el *insert(long key, long data)
{
struct el *p;
p = kmalloc(sizeof(*p), GPF_ATOMIC);
spin_lock(&mutex);
p->next = head.next;
p->key = key;
p->data = data;
wmbdd();
head.next = p;
spin_unlock(&mutex);
}
struct el *search(long key)
{
struct el *p;
p = head.next;
while (p != &head) {
if (p->key == key) {
return (p);
}
p = p->next;
}
return (NULL);
}
This code is much nicer: no rmb()s are required, searches proceed
fully in parallel with no locks or writes, and no intermittent data corruption.
Search and Insert Using read_barrier_depends()
Introduce a new primitive read_barrier_depends() that is defined to be an rmb() on
Alpha, and a nop on other architectures. This removes the read-side performance
problem for non-Alpha architectures, but still leaves the read-side
read_barrier_depends(). It is almost possible for the compiler to do a good job of
generating these (assuming that a "lockfree" gcc struct-field attribute is created
and used), but, unfortunately, the compiler cannot reliably tell when the relevant lock
is held. (If the lock is held, the read_barrier_depends() calls should not be generated.)
After discussions in lkml about this, it was decided that putting an explicit
read_barrier_depends() is the appropriate thing to do in the linux kernel. Linus also
suggested that the barrier names be made more explict. With such a primitive,
the code looks as follows:
struct el *insert(long key, long data)
{
struct el *p;
p = kmalloc(sizeof(*p), GPF_ATOMIC);
spin_lock(&mutex);
p->next = head.next;
p->key = key;
p->data = data;
write_barrier();
head.next = p;
spin_unlock(&mutex);
}
struct el *search(long key)
{
struct el *p;
p = head.next;
while (p != &head) {
read_barrier_depends();
if (p->key == key) {
return (p);
}
p = p->next;
}
return (NULL);
}
A preliminary patch for this is barriers-2.5.7-1.patch. The future releases of this
patch can be found along with the RCU package here.
Other Approaches Considered
Just make wmb() work like wmbdd(), so that data dependencies act as implied rmb()s.
Although wmbdd()'s semantics are much more intuitive, there are a number of uses of
wmb() in Linux that do not require the stronger semantics of wmbdd(), and strengthening
the semantics would incur unnecessary overhead on many CPUs--or require many changes to
the code, and thus a much larger patch.
Just drop support for Alpha. After all, Compaq seems to be phasing it out, right? There
are nonetheless a number of Alphas out there running Linux, and Compaq (or perhaps HP)
will be manufacturing new Alphas for quite a few years to come. Microsoft would likely
focus quite a bit of negative publicity on Linux's dropping support for anything (never
mind that they currently support only two CPU architectures). And the code to make Alpha
work correctly is not all that complex, and it does not impact performance of other CPUs.
Besides, we cannot be 100% certain that there won't be some other CPU lacking a
synchronous invalidation instruction...