Re: [PATCH] Document Linux's memory barriers [try #2]

From: Linus Torvalds
Date: Wed Mar 08 2006 - 14:24:57 EST




On Wed, 8 Mar 2006, David Howells wrote:

> Alan Cox <alan@xxxxxxxxxx> wrote:
>
> > spin_lock(&foo->lock);
> > writel(0, &foo->regnum);
>
> I presume there only needs to be an mmiowb() here if you've got the
> appropriate CPU's I/O memory window set up to be weakly ordered.

Actually, since the different NUMA things may have different paths to the
PCI thing, I don't think even the mmiowb() will really help. It has
nothing to serialize _with_.

It only orders mmio from within _one_ CPU and "path" to the destination.
The IO might be posted somewhere on a PCI bridge, and and depending on the
posting rules, the mmiowb() just isn't relevant for IO coming through
another path.

Of course, to get into that deep doo-doo, your IO fabric must be separate
from the memory fabric, and the hardware must be pretty special, I think.

So for example, if you are using an Opteron with it's NUMA memory setup
between CPU's over HT links, from an _IO_ standpoint it's not really
anything strange, since it uses the same fabric for memory coherency and
IO coherency, and from an IO ordering standpoint it's just normal SMP.

But if you have a separate IO fabric and basically two different CPU's can
get to one device through two different paths, no amount of write barriers
of any kind will ever help you.

So in the really general case, it's still basically true that the _only_
thing that serializes a MMIO write to a device is a _read_ from that
device, since then the _device_ ends up being the serialization point.

So in the exteme case, you literally have to do a read from the device
before you release the spinlock, if ordering to the device from two
different CPU's matters to you. The IO paths simply may not be
serializable with the normal memory paths, so spinlocks have absolutely
_zero_ ordering capability, and a write barrier on either the normal
memory side or the IO side doesn't affect anything.

Now, I'm by no means claiming that we necessarily get this right in
general, or even very commonly. The undeniable fact is that "big NUMA"
machines need to validate the drivers they use separately. The fact that
it works on a normal PC - and that it's been tested to death there - does
not guarantee much anything.

The good news, of course, is that you don't use that kind of "big NUMA"
system the same way you'd use a regular desktop SMP. You don't plug in
random devices into it and just expect them to work. I'd hope ;)

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/