read stalls with large RAM: transparent huges pages, dirtybuffers, or I/O (block) scheduler?

From: Ulrich Windl
Date: Thu Oct 10 2013 - 04:25:30 EST


Hi!

We are running some x86_64 servers with large RAM (128GB). Just to imagine: With a memory speed of a little more than 9GB/s it takes > 10 seconds to read all RAM...

In the past and recently we had problems with read() stalls when the kernel was writing back big amounts (like 80GB) of dirty buffers on a somewhat slow (40MB/s) device. The problem is old and well-known, it seems, but to really solved.

One recommendation was to limit the amount of dirty buffers, which actually did not help to really avoid the problem, specifically if new dirty buffers are used as soon as they are available (i.e.: some were flushed). I had success with limiting the used memory (including dirty pages) with control groups (memory:iothrottle, SLES11 SP2), but the control framework (rccgconfig setting up proper rights for /sys/fs/cgroup/mem/iothrottle/tasks) is quite incomplete (no group write permission or ACL setup possible), so the end user can hardly use that.

I still don't know whether read stalls are caused by the I/O channel or device being saturated, or whether the kernel is waiting for unused buffers to receive the read data, but I learned that I/O schedulers (and possibly the block layer optimizations) can cause extra delays, too.

We had one situation where a single sector could not be read with direct I/O for 10 seconds.

Recently we had the problem again, but it was clear that it was _not_ the device being overloaded, nor was it the I/O channel. The read problem was reported for a devioce that was almost idle, and the I/O channel (FC) can handle much more than the disk system can in both directions. So the problem seems to be inside the kernel.

Oracle recommends (in article 1557478.1, without explaining the details) to turn off transparent huge pages. Before that I didn't think much about that feature. It seems the kernel is not just creating huge pages when they are requested explicitly (that's what I had thought), but also implicitly to reduce the number of pages to me managed. Collecting smaller pages to combine them for huge pages may also involve moving memory around (compaction), it seems. I still don't know whether the kernel will also try to compact dirty cache pages to huge pages, but we still see read stalls when there are many dirty pages (like when copying 400GB of data to a somewhat slow (30MB/s) disk.

Now I wonder what the real solution to the problem (not the numerous work-arounds) would be. Obviously simply stopping (yield) dirty buffer flush to give read a chance may not be sufficient when read needs to wait for unused pages, especially if the disks being read from are faster than those being written to.
To my understanding dirty pages have an "age" that is used to decide whether to flush them or not. Also the I/O scheduler seems to prefer read requests over write requests. What I do not know is whether a read request is sent to the I/O scheduler before buffer pages are assigned to the request, or after the pages were assigned. So a read request only has the chance to have an "age" once it entered the I/O scheduler, right?

So if read and writes had an "age" both, some EDF (earliest deadline first) scheduling could be used to perform I/O (which would be controlling buffer usage as a side-effect). For transparent huge pages, requests for a huge page should also have an age and a priority that is significantly below that of I/O buffers. If there exists an efficient algorithm and data model to perform these tasks, the problem may be solved.

Unfortunately if many buffers are dirtied at one moment and reads are requested significantly later, there may be an additional need for time-slices when doing I/O (note: I'm not talking about quotas of some MB, but quotas of time). The I/O throughput may vary a lot, and time seems the only way to manage latency correctly. To avoid a situation where reads may cause stalling writes (and thus the age of dirty buffers growing without bounds), the priority of writes should be _carefully_ increased, taking care not to create a "fright train of dirty buffers" to be flushed. So maybe "smuggle in" a few dirty buffers between read requests. As a high-level flow control (like for the cgroups mechanism), processes with a high amount of dirty buffers should be suspended or scheduled with very low priority to give the memory and I/O systems a change to process the dirty buffers.

For reference: The machine in question is at 3.0.74-0.6.10-default with the latest SLES11 SP2 kernel being 3.0.93-0.5.

I'd like to know what the gurus thing about that. I think with increasing RAM this issue will become extremely important soon.

Regards,
Ulrich
P.S: Not subscribed to linux-kernel, so keep me on CC:, please

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/