Re: [PATCH] Deadlock during heavy write activity to userspace NFSserver on local NFS mount
From: Nick Piggin
Date: Wed Jul 28 2004 - 05:34:20 EST
Avi Kivity wrote:
Nick Piggin wrote:
What's stopping the NFS server from ooming the machine then? Every
time some bit of memory becomes free, the server will consume it
instantly. Eventually ext3 will not be able to write anything out
because it is out of memory.
The NFS server should do the writeout a page at a time.
The NFS server writes not only in response to page reclaim (as a local
NFS client), but also in response to pressure from non-local clients. If
both ext3 and NFS have the same allocation limits, NFS may starve out ext3.
What do you mean starve out ext3? ext3 gets written to *by the NFS server*
which is PF_MEMALLOC.
(In my case the NFS server actually writes data asynchronously, so it
doesn't really know it is responding to page reclaim, but the problem
occurs even in a synchrounous NFS server.)
I can't see this being the responsibility of the kernel. The NFS server
could probably find out if it is servicing a loopback request or not.
Remote requests don't help to free memory... unless maybe you want a
filesystem on a remote nbd to be exported back to server via NFS or
something crazy.
An even more complex case is when ext3 depends on some other process,
say it is mounted on a loopback nbd.
dirty NFS data -> NFS server -> ext3 -> nbd -> nbd server on
localhost -> ext3/raw device
You can't have both the NFS server and the nbd server PF_MEMALLOC,
since the NFS server may consume all memory, then wait for the nbd
server to reclaim.
The memory allocators will block when memory reaches the reserved
mark. Page reclaim will ask NFS to free one page, so the server
will write something out to the filesystem, this will cause the nbd
server (also PF_MEMALLOC) to write out to its backing filesystem.
If NFS and nbd have the same limit, then NFS may cause nbd to stall.
We've already established that NFS must be PF_MEMALLOC, so nbd must be
PF_MEMALLOC_HARDER or something like that.
No, your NFS server has to be coded differently. You can't allow it
to use up all PF_MEMALLOC memory just because it can.
The solution I have in mind is to replace the sync allocation logic from
if (free_mem() < some_global_limit && !current->PF_MEMALLOC)
wait_for_kswapd()
to
if (free_mem() < current->limit)
wait_for_kswapd()
kswapd would have the lowest ->limit, other processes as their place
in the food chain dictates.
I think this is barking up the wrong tree. It really doesn't matter
what process is freeing memory. There isn't really anything special
about the way kswapd frees memory.
To free memory you need (a) to allocate memory (b) possibly wait for
some freeing process to make some progress. That means all processes in
the freeing chain must be able to allocate at least some memory. If two
processes in the chain share the same blocking logic, they may deadlock
on each other.
The PF_MEMALLOC path isn't to be used like that. If a *single*
PF_MEMALLOC task were to allocate all its memory then that would
be a bug too.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/