Re: I/O throughput problem in newer kernels

From: Andrew Morton
Date: Tue Apr 07 2009 - 03:37:26 EST


On Thu, 2 Apr 2009 11:06:08 -0400 (EDT) adobra@xxxxxxxxxxxx wrote:

> While putting together a not so average machine for database research, I
> bumped into the following performance problem with the newer kernels (I
> tested 2.6.27.11, 2.6.29): the aggregate throughput drops drastically when
> more than 20 hard drives are involved in the operation. This problem is
> not happening on 2.6.22.9 or 2.6.20 (did not test other kernels).

Well that's bad. You'd at least expect the throughput to level out.

> Since I am not subscribed to the mailing list, I would appreciate you
> cc-ing me on any reply or discussion.
>
> 1. Description of the machine
> -----------------------------------------------
> 8 Quad-Core AMD Opteron(tm) Processor 8346 HE
> Each processor has independent memory banks (16GB in each bank for 128GB
> total)
> Two PCI busses (connected in different places in the NUMA architecture)
> 8 hard drives installed into the base system on SATA interfaces
> First hard drive dedicated to the OS
> 7 Western Digital hard drives (90 MB/s max throughput)
> Nvidia SATA chipset
> 4 Adaptec 5805 RAID cards installed in PCI-E 16X slots (all running at 8X
> speed)
> The 4 cards live on two separate PCI busses
> 6 IBM EXP3000 disk enclosures
> 2 cards connect to 2 enclosures each, the other 2 to 1 enclosure
> 8 Western Digital Velociraptor HD in each enclosure
> Max measured throughput 110-120 MB/s
>
> Total number of hard drives used the tests: 7+47=54 or subsets
> The Adaptec cards are configure to expose each disk individually to the
> OS. Any RAID configuration seems to limit the throughput at 300-350MB/s
> which is too low for the purpose of this system.
>
> 2. Throughput tests
> --------------------------------
> I did two types of tests: using dd (spawning parallel dd jobs that lasted
> at least 10s) or using a multi-threaded program that simulates the
> intended usage for the system. Results using both are consistent so I will
> only report the results with the custom program. Both the dd test and the
> custom one do reads in large chunks (256K/request at least). All request
> in the custom program are made with "read" system call to page aligned
> memory (allocated with mmap to make sure). The kernel is doing a zero-copy
> to user space otherwise the speeds observed are not possible.
>
> Here is what I observed in terms of throughput:
> a. Speed/WD disk: 90 MB/s
> b. Speed/Velociraptor disk: 110 MB/s
> c. Speed of all WD disks in base system: 700MB/s
> d. Speed of disks in one enclosure: 750 MB/s
> e. Speed of disks connected to one Adaptec card: 1000 MB/s
> f. Speed of disks connected on a single PCI bus: 2000 MB/s
>
> The above numbers look good and are consistent on all kernels that I tried.
>
> THE PROBLEM: when the number of disks exceeds 20 the throughput plummets
> on newer kernels.
>
> g. SPEED OF ALL DISKS: 600 MB/s on newer kernels, 2700 MB/s on older kernels
> The throughput drops drastically the moment 20-25 hard drives are involved
>
> 3. Tests I performed to ensure the number of hard drives is the culprit
> ----------------------------------------------------------------------------------------------------------------
> a. Took 1, 2, 3 and 4 disks from each enclosure to ensure uniform load on
> buses
> performance going up as expected until 20 drives reached than dropping
>
> b. Involved combinations of the regular WD drives and the Velociraptors.
> Had no major influence on the observation
>
> c. Involved combinations of enclosures
> No influence
>
> d. Used the hard drives in decreasing order of measured speed (as reported
> by hdparm)
> Only minor influence and still drastic drop at 20
>
> e. Changed the I/O scheduler used for the hard drives
> No influence
>
> 4. Things that I do not think are wrong
> --------------------------------------------------------------
> a. aacraid or scsi_nv drivers
> The problem depends only on the number of hard drives not the
> combination of the drives themselves
>
> b. Limitations on the buses
> The measured speeds of the subsystems indicate that no bottleneck on
> individual buses is reached. Even if this is the case, the throughput
> should level up not drop dramatically
>
> c. Failures in the system
> No errors reported in /var/log/messages or other logs related to I/O
>
> Of course, this begs the question WHAT IS WRONG?
>
> I would be more than happy to run any tests you suggest on my system to
> find the problem.
>

Did you monitor the CPU utilisation?

It would be interesting to test with O_DIRECT (dd conv=direct) to
remove the page allocator and page reclaim from the picture.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/