Re: objrmap-core-1 (rmap removal for file mappings to avoid 4:4 in <=16G machines)

From: Andrea Arcangeli
Date: Wed Mar 10 2004 - 07:34:58 EST

On Wed, Mar 10, 2004 at 12:35:01PM +0100, Ingo Molnar wrote:
> * Andrea Arcangeli <andrea@xxxxxxx> wrote:
> > the quality of such objrmap patch is still better than rmap. The DoS
> > thing is doable with vmtruncate too in any kernel out there.
> objrmap for now has a serious problem: test-mmap3.c locked up my box (i
> couldnt switch text consoles for 30 minutes when i turned the box off).
> I'm sure you'll fix it and i'm looking forward seeing it. However, i'd
> like to see the full fix instead of a promise to have this fixed
> sometime the future. There are valid application workloads that trigger
> _worse_ vma patterns than test-mmap3.c does (UML being one such thing,
> Oracle with indirect buffer-cache another - i'm sure there are other
> apps too.). Calling these applications 'exploits' doesnt help in
> getting this thing fixed. There's no problem with keeping this patchset
> separate until it's regression-free.
> > merging objrmap is the first step. Any other effort happens on top of
> > it.
> i'd like to see that effort combined with this code, and the full
> picture. Since this 'DoS property' is created by the current concept of
> the patch, it's not a 'bug' that is easily fixed so we must not (and
> cannot) sign up for it blindly, without seeing the full impact. But
> yes, it might be fixable. Anyway - the 2.6 kernel is a stable tree and
> i'm sure you know that avoiding regression is more important than
> anything else.

I'm fine to wait the whole work to be finished and to merge it all at
once (still from separate incremental patches) instead of merging it in
steps in mainline and your longer term confidence in our work is
promising, thanks.

since I need this fixed fast, I may have to go the rbtree way to go
safe (mainline could go with prio_trees in the long run instead).

however I still disagree the objrmap I posted is a regression for
applications like Oracle (dunno about uml). It's an obvious regression
for your test-mmap3.c and that's why I call test-mmap3.c an exploit and
not a "real app". Nobody would map 1 page per vma, get real, you have an
hard time to convince me a real app is going to scatter vma with 4k
aperture each. you wrote the very worst case that everybody is aware
about, a real app scenario would not do that. Note that there's quite an
huge amount of merging of file-vmas, you absolutely prevent that too.

Furthmore you said Oracle needs mlock to work "safe" with rmap. But with
2.6 if you use mlock it will still not work. If you use 2.6+objrmap
mlock will fix your DoS secenario too, and Oracle will work as fast as
rmap+mlock in your rmap 2.4 implementation.

Also you're advocating for the "merging in steps" and keeing "2.6
optimal", but you're ignoring the single reason you are forced to ship a
2.4 kernel with 4:4 for every >4G machine. 2.6 mainline (the current
2.6.3 step) has no way to be compiled with 4:4 model. So the current
great 2.6 kernel has no way to work with any machine >4G (if you ship
all PAE kernels with rmap compiled with 4:4 you must agree 2.6 mainline
has no way to work on any kernel with >4G of ram, so you should not be
surprised that I'm dealing with those issues currently). Is 2.6 an high
end kernel with rmap? I supported 4G on x86 the first time with bigmem
in 2.2.

Solving the problem by merging 4:4 instead of removing rmap is not the
way to go IMHO since it doesn't fix the memory waste for 64bit archs
compared to what we can do with 2.4 _mainline_ (64bit doesn't need
pte-highmem and there are no highmem issues to solve there).

At least with objrmap applied to 2.6, there would be a chance to survive
the load on >4G boxes in a 2.6 mainline kernel. Sure, you'd better be
careful not to swapout heavy or it would risk to hang badly (if the app
isn't using mlock, if the app uses mlock 2.6 will fly), but without
objrmap it would lockup before you can worry about reaching swap (mlock
or not).

So in practice I think it would been ok to merge objrmap as an
intermediate step (it's not that I didn't evaluate those possibilities
when I submitted it).

As for the DoS thing in security terms, truncate has the same issue. It
maybe easier to kill the "exploit" since it returns to userspace every
time, and userspace is not swapped out when it happens, but it would
still waste an indefinite amount of time in kernel space. So providing
an efficient means of the i_mmap vma lookup is a problem irrelevant to
the objrmap patch for the vm, I think we agree on this. Doing that will
fix all users (so the vm too).
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at