Re: [PATCH] mm: disallow direct reclaim page writeback

From: Chris Mason
Date: Wed Apr 14 2010 - 07:22:09 EST


On Wed, Apr 14, 2010 at 12:06:36PM +0200, Andi Kleen wrote:
> Chris Mason <chris.mason@xxxxxxxxxx> writes:
> >
> > Huh, 912 bytes...for select, really? From poll.h:
> >
> > /* ~832 bytes of stack space used max in sys_select/sys_poll before allocating
> > additional memory. */
> > #define MAX_STACK_ALLOC 832
> > #define FRONTEND_STACK_ALLOC 256
> > #define SELECT_STACK_ALLOC FRONTEND_STACK_ALLOC
> > #define POLL_STACK_ALLOC FRONTEND_STACK_ALLOC
> > #define WQUEUES_STACK_ALLOC (MAX_STACK_ALLOC - FRONTEND_STACK_ALLOC)
> > #define N_INLINE_POLL_ENTRIES (WQUEUES_STACK_ALLOC / sizeof(struct poll_table_entry))
> >
> > So, select is intentionally trying to use that much stack. It should be using
> > GFP_NOFS if it really wants to suck down that much stack...
>
> There are lots of other call chains which use multiple KB bytes by itself,
> so why not give select() that measly 832 bytes?
>
> You think only file systems are allowed to use stack? :)

Grin, most definitely.

>
> Basically if you cannot tolerate 1K (or more likely more) of stack
> used before your fs is called you're toast in lots of other situations
> anyways.

Well, on a 4K stack kernel, 832 bytes is a very large percentage for
just one function.

Direct reclaim is a problem because it splices parts of the kernel that
normally aren't connected together. The people that code in select see
832 bytes and say that's teeny, I should have taken 3832 bytes.

But they don't realize their function can dive down into ecryptfs then
the filesystem then maybe loop and then perhaps raid6 on top of a
network block device.

>
> > kernel had some sort of way to dynamically allocate ram, it could try
> > that too.
>
> It does this for large inputs, but the whole point of the stack fast
> path is to avoid it for common cases when a small number of fds is
> only needed.
>
> It's significantly slower to go to any external allocator.

Yeah, but since the call chain does eventually go into the allocator,
this function needs to be more stack friendly.

I do agree that we can't really solve this with noinline_for_stack pixie
dust, the long call chains are going to be a problem no matter what.

Reading through all the comments so far, I think the short summary is:

Cleaning pages in direct reclaim helps the VM because it is able to make
sure that lumpy reclaim finds adjacent pages. This isn't a fast
operation, it has to wait for IO (infinitely slow compared to the CPU).

Will it be good enough for the VM if we add a hint to the bdi writeback
threads to work on a general area of the file? The filesystem will get
writepages(), the VM will get the IO it needs started.

I know Mel mentioned before he wasn't interested in waiting for helper
threads, but I don't see how we can work without it.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/