Re: [PATCH] oom_kill: use rss value instead of vm size for badness

From: David Rientjes
Date: Thu Oct 29 2009 - 05:02:02 EST

Next message: Yinghai Lu: "Re: [PATCH] pci: pciehp update the slot bridge res to get big rangefor pcie devices"
Previous message: Jens Axboe: "Re: pci-express hotplug"
In reply to: KAMEZAWA Hiroyuki: "Re: [PATCH] oom_kill: use rss value instead of vm size for badness"
Next in thread: KAMEZAWA Hiroyuki: "Re: [PATCH] oom_kill: use rss value instead of vm size for badness"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, 29 Oct 2009, KAMEZAWA Hiroyuki wrote:

> > This appears to actually prefer X more than total_vm in Vedran's test
> > case. He cited http://pastebin.com/f3f9674a0 in
> > http://marc.info/?l=linux-kernel&m=125678557002888.
> >
> > There are 12 ooms in this log, which has /proc/sys/vm/oom_dump_tasks
> > enabled. It shows the difference between the top total_vm candidates vs.
> > the top rss candidates.
> >
> > total_vm
> > 708945 test
> > 195695 krunner
> > 168881 plasma-desktop
> > 130567 ktorrent
> > 127081 knotify4
> > 125881 icedove-bin
> > 123036 akregator
> > 118641 kded4
> >
> > rss
> > 707878 test
> > 42201 Xorg
> > 13300 icedove-bin
> > 10209 ktorrent
> > 9277 akregator
> > 8878 plasma-desktop
> > 7546 krunner
> > 4532 mysqld
> >
> > This patch would pick the memory hogging task, "test", first everytime
> > just like the current implementation does. It would then prefer Xorg,
> > icedove-bin, and ktorrent next as a starting point.
> >
> > Admittedly, there are other heuristics that the oom killer uses to create
> > a badness score. But since this patch is only changing the baseline from
> > mm->total_vm to get_mm_rss(mm), its behavior in this test case do not
> > match the patch description.
> >
> yes, then I wrote "as start point". There are many environments.

And this environment has a particularly bad result.

> But I'm not sure why ntpd can be the first candidate...
> The scores you shown doesn't include children's score, right ?
>

Right, it's just the get_mm_rss(mm) for each thread shown in the oom dump,
the same value you've used as the new baseline. The actual badness scores
could easily be calculated by cat'ing /proc/*/oom_score prior to oom, but
this data was meant to illustrate the preference given the rss compared to
total_vm in a heuristic sense.

> I believe I'll have to remove "adding child's score to parents".
> I'm now considering how to implement fork-bomb detector for removing it.
>

Agreed, I'm looking forward to your proposal.

> ya, I'm now considering to drop file_rss from calculation.
>
> some reasons.
>
> - file caches remaining in memory at OOM tend to have some trouble to remove it.
> - file caches tend to be shared.
> - if file caches are from shmem, we never be able to drop them if no swap/swapfull.
>
> Maybe we'll have better result.
>

That sounds more appropriate.

I'm surprised you still don't see a value in using the peak VM and RSS
sizes, though, as part of your formula as it would indicate the proportion
of memory resident in RAM at the time of oom.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Yinghai Lu: "Re: [PATCH] pci: pciehp update the slot bridge res to get big rangefor pcie devices"
Previous message: Jens Axboe: "Re: pci-express hotplug"
In reply to: KAMEZAWA Hiroyuki: "Re: [PATCH] oom_kill: use rss value instead of vm size for badness"
Next in thread: KAMEZAWA Hiroyuki: "Re: [PATCH] oom_kill: use rss value instead of vm size for badness"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]