Re: ECC and DMA to/from disk controllers

From: Alan Cox
Date: Mon Sep 10 2007 - 09:46:46 EST


> In thinking about this, I began to wonder about the following. Suppose
> that a (possibly RAID) disk controller correctly reads data from disk and
> has correct data in the controller memory and buffers. However when that
> data is DMA'd into system memory some errors occur (cosmic rays,
> electrical noise, etc). Am I correct that these errors would NOT be
> detected, even on a 'reliable' server with ECC memory? In other words the
> ECC bits would be calculated in server memory based on incorrect data from
> the disk.

Architecture specific.

> The alternative is that disk controllers (or at least ones that are meant
> to be reliable) DMA both the data AND the ECC byte into system memory.
> So that if an error occurs in this transfer, then it would most likely be
> picked up and corrected by the ECC mechanism. But I don't think that
> 'this is how it works'. Could someone knowledgable please confirm or
> contradict?

Its almost entirely device specific at every level. Some general
information and comment however

- Drives normally do error correction and shouldn't be fooled very often
by bad bits.
- The ECC level on the drive processors and memory cache vary by vendor.
Good luck getting any information on this although maybe if you are Cern
sized they will talk

After the drive we cross the cable. For SATA this is pretty good, and
UDMA data transfer is CRC protected. For PATA the data is but not the
command block so on PATA there is a minute chance you send the CRC
protected block to the wrong place

Once its crossing the PCI bus and main memory and CPU cache its entirely
down to the system you are running what is protected and how much. Note
that a lot of systems won't report ECC errors unless you ask.

If you have hardware RAID controllers its all vendor specific including
CPU cache etc on the card etc.

The next usual mess is network transfers. The TCP checksum strength is
questionable for such workloads but the ethernet one is pretty good.
Unfortunately lots of high performance people use checksum offload which
removes much of the end to end protection and leads to problems with iffy
cards and the like. This is well studied and known to be very problematic
but in the market speed sells not correctness.

>From the paper type II sounds like slab might be a candidate kernel side
but also CPU bugs as near OOM we will be paging hard and any L2 cache page
out/page table race from software or hardware would fit what it describes,
especially the transient nature

Type III wrong block on PATA fits with the fact the block number isn't
protected and also the limits on the cache quality of drives/drive
firmware bugs.

For drivers/ide there are *lots* of problems with error handling so that
might be implicated (would want to do old v new ide tests on the same h/w
which would be very intriguing).

Stale data from disk cache I've seen reported, also offsets from FIFO
hardware bugs (The LOTR render farm hit the latter and had to avoid UDMA
to avoid a hardware bug)

Chunks of zero sounds like caches again, would be interesting to know
what hardware changes occurred at the point they began to pop up and what
software.

We also see chipset bugs under high contention some of which are
explained and worked around (VIA ones in the past), others we see are
clear correlations - eg between Nvidia chipsets and Silicon Image SATA
controllers.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/