Re: [PATCH 1/2] sched/wait: Break up long wake list walk

From: Linus Torvalds
Date: Mon Aug 14 2017 - 21:48:13 EST


On Mon, Aug 14, 2017 at 5:52 PM, Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx> wrote:
> We encountered workloads that have very long wake up list on large
> systems. A waker takes a long time to traverse the entire wake list and
> execute all the wake functions.
>
> We saw page wait list that are up to 3700+ entries long in tests of large
> 4 and 8 socket systems. It took 0.8 sec to traverse such list during
> wake up. Any other CPU that contends for the list spin lock will spin
> for a long time. As page wait list is shared by many pages so it could
> get very long on systems with large memory.

I really dislike this patch.

The patch seems a band-aid for really horrible kernel behavior, rather
than fixing the underlying problem itself.

Now, it may well be that we do end up needing this band-aid in the
end, so this isn't a NAK of the patch per se. But I'd *really* like to
see if we can fix the underlying cause for what you see somehow..

In particular, if this is about the page wait table, maybe we can just
make the wait table bigger. IOW, are people actually waiting on the
*same* page, or are they mainly waiting on totally different pages,
just hashing to the same wait queue?

Because right now that page wait table is a small fixed size, and the
only reason it's a small fixed size is that nobody reported any issues
with it - particularly since we now avoid the wait table entirely for
the common cases by having that "contention" bit.

But it really is a *small* table. We literally have

#define PAGE_WAIT_TABLE_BITS 8

so it's just 256 entries. We could easily it much bigger, if we are
actually seeing a lot of collissions.

We *used* to have a very complex per-zone thing for bit-waitiqueues,
but that was because we got lots and lots of contention issues, and
everybody *always* touched the wait-queues whether they waited or not
(so being per-zone was a big deal)

We got rid of all that per-zone complexity when the normal case didn't
hit in the page wait queues at all, but we may have over-done the
simplification a bit since nobody showed any issue.

In particular, we used to size the per-zone thing by amount of memory.
We could easily re-introduce that for the new simpler page queues.

The page_waitiqueue() is a simple helper function inside mm/filemap.c,
and thanks to the per-page "do we have actual waiters" bit that we
have now, we can actually afford to make it bigger and more complex
now if we want to.

What happens to your load if you just make that table bigger? You can
literally test by just changing the constant from 8 to 16 or
something, making us use twice as many bits for hashing. A "real"
patch would size it by amount of memory, but just for testing the
contention on your load, you can do the hacky one-liner.

Linus