Re: swapoff() runs forever

From: Hugh Dickins
Date: Thu Apr 12 2012 - 02:40:45 EST


On Thu, 12 Apr 2012, Richard Weinberger wrote:
> Am 09.04.2012 20:40, schrieb Hugh Dickins:
> > I've not seen any such issue in recent months (or years), but
> > I've not been using UML either. The most likely cause that springs
> > to mind would be corruption of the vmalloc'ed swap map: that would
> > be very likely to cause such a hang.
>
> It does not look like a swap map corruption.
> If I restart most user space processes swapoff() terminates fine.

Right, thanks, that's very useful info.

> Maybe it is a refcounting problem?

You may prove to be correct; but since killing and restarting
processes fixes it up without (I presume) issuing warnings,
it doesn't sound like a refcounting problem to me.

>
> > You say "recent Linux kernels": I wonder what "recent" means.
> > Is this something you can reproduce quickly and reliably enough
> > to do a bisection upon?
> >
>
> I can reproduce the issue on any UML kernel.
> The oldest I've tested was 2.6.20.
> Therefore, bug was not introduced by me. B-)

More useful info, thank you.

I think I've spotted two problems in the UML swp_entry_t handling;
but checking if I'm right, and if they're relevant, and how to fix them,
I'll leave to you - it's years since I tried UML and I remember 0.

One, likely to be your problem. Take a look at unuse_pte_range() in
mm/swapfile.c, where it searches the page table for the swp_pte it's
trying to "unuse". And take a look at set_pte() in
arch/um/include/asm/pgtable.h, which appears to add a mysterious
_PAGE_NEWPAGE bit into the page table entry. And UML doesn't provide
an alternative to generic pte_same() in include/asm-genric/pgtable.h.

My guess is that the _NEWPAGE bit prevents swapoff from matching pte
against swap entry in all or some cases (I didn't look to see if
_NEWPAGE is sometimes cleared later).

Probably a good fix to try would be providing a UML pte_same() which
takes that into account; but I don't know what conditionals it should
contain, and whether it would become too inefficient. Or, if _NEWPAGE
is always set in a swap pte, then swp_entry_to_pte() needs to set it.

(A word of warning if you're unfamiliar with swap entries: there's the
kernel's internal representation swp_entry_t, which has offset in the
low-order and type in the high-order, for efficient use with radix_tree
- see include/linux/swapops.h; and then there's the arch-dependent
representation as a page table entry, which rearranges the bits so
as not to be confused with a good present page table entry, and
traditionally has type on the lower side of offset.)

The other thing I noticed first, probably not relevant to the bug you're
seeing since I think you'd have mentioned if you had two swapfiles; but
the two or more swapfile case looks very broken to me. _PAGE_PROTNONE is
0x010 but __swp_type(x) is (((x).val >> 4) & 0x3f): unless I'm confused,
a swap entry of type 1 will look just like a PROT_NONE pte.

Or maybe that's resolved by the _PAGE_NEWPAGE and _PAGE_NEWPROT bits,
I didn't spend time working out what they're up to.

include/linux/swap.h does not allow MAX_SWAPFILES to exceed 32,
so you can easily change __swp_type(x) to use 5 and 0x1f instead
(with 5 instead of 4 in __swp_entry too of course). Though it doesn't
cause error, I wonder where the 11 in __swp_offset and __swp_entry
comes from: I think you can support larger swap by making it 10.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/