Re: [PATCH] mlx4: Use GFP_NOFS calls during the ipoib TX path when creating the QP

From: Or Gerlitz
Date: Thu Mar 06 2014 - 08:31:30 EST


On 21/02/2014 23:53, Jiri Kosina wrote:
This was originally a patch from Matthew Finlay<matt@xxxxxxxxxxxx> that
addressed a problem whereby NFS writes would enter uninterruptible sleep
forever. The issue happened when using NFS over IPoIB. This is not a
recommended configuration as RDMA is preferred but it is still a valid
configuration and is important to have in situations where the NFS server
does not support RDMA. The problem encountered was described as follows:

It's not memory reclamation that is the problem as such. There is
an indirect dependency between network filesystems writing back
pages and ipoib_cm_tx_init() due to how a kworker is used. Page
reclaim cannot make forward progress until ipoib_cm_tx_init()
succeeds and it is stuck in page reclaim itself waiting for network
transmission. Ordinarily this sitaution may be avoided by having
the caller use GFP_NOFS but ipoib_cm_tx_init() does not have that information.


Hi Jiri,

Reading again (*) the problem description, the team here would be happy to clarify with you some details (possibly
few MM newbie questions, but it will help us):

1. just to make sure, the problem happen on the NFS client, not the NFS server, right? so writing-back means client
writing over the NFS mount --> network

2. you wrote "due to how a kworker is used", can you clarify if/why things go wrong b/c of the kworker usage, or this is matter of phrasing?

in earlier post over this thread you wrote "There was a problem with swapping over NFS, as writeback was deadlocked with memory reclaim (memory needs to be allocated so that > swap could be accessed to reclaim memory). That's fixed by allocating the buffers from PF_MEMALLOC reserve, introduced by Mel's and Peter's patchset back in 3.9 or so. Oh, and the same has been done for swapping over NBD, btw", in that respect:

3. you mentioned that the memory allocations in ipoib_cm_tx_init() and ib_create_qp() --> mlx4 driver requires
page reclaim and waits for network transmission, so this client node put their swap over that NFS partition?

4. Can you shed more light, why the problem hits also for kmalloc based allocations and not only for vmalloc
based allocation e.g not only b/c of the vzalloc call in ipoib_cm_tx_init but rather also b/c of misc kmalloc calls within
the HW (here mlx4) driver?

thanks,

Or.

(*) and sorry for my stupid question from yesterday, sometimes it's bad idea to ask questions on mailing lists when you are very tired
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/