Re: [RFC] Fine-grained memory priorities and PI

From: Kyle Moffett
Date: Thu Dec 15 2005 - 07:51:19 EST


On Dec 15, 2005, at 04:04, Andi Kleen wrote:
When processes request memory through any subsystem, their memory priority would be passed through the kernel layers to the allocator, along with any associated information about how to free the memory in a low-memory condition. As a result, I could configure my database to have a much higher priority than SETI@home (or boinc or whatever), so that when the database server wants to fill memory with clean DB cache pages, the kernel will kill SETI@home for it's memory, even if we could just leave some DB cache pages unfaulted.

Iirc most of the freeing happens in process context anyways, so process priority information is already available. At least for CPU cost it might even be taken into account during schedules (Freeing can take up quite a lot of CPU time)

The problem with GFP_ATOMIC is though that someone else needs to free the memory in advance for you because you cannot do it yourself.

(you could call it a kind of "parasite" in the normally very cooperative society of memory allocators ...)

That would mess up your scheme too. The priority cannot be expressed because it's more a case of
"somewhen someone in the future might need it"

Well, that's currently expressed as a reserved pool with watermarks, so with a PI system you would have a single pool with some collection of reservation watermarks with various priorities. I'm not sure what the best data-structure would be, probably some sort of ordered priority tree. When allocating or freeing memory, the code would check the watermark data (which has some summary statistics so you don't need to check the whole tree each time); if any of the watermarks are too low with relative priority taken into account, you fail the allocation or move pages into the pool.

Questions? Comments? "This is a terrible idea that should never have seen the light of day"? Both constructive and destructive criticism welcomed! (Just please keep the language clean! :-D)

This won't help for this problem here - even with perfect priorities you could still get into situations where you can't make any progress if progress needs more memory.

Well the point would be that the priorities could force a more- extreme and selective OOM (maybe even dropping dirty pages for noncritical filesystems if necessary!), or handle the situation described with the IPSec daemon and IPSec network traffic (IPSec would inherit the increased memory priority, and when it tries to do networking, its send path and the global receive path would inherit that increased priority as well.

Naturally this is all still in the vaporware stage, but I think that if implemented the concept might at least improve the OOM/low-memory situation considerably. Starting to fail allocations for the cluster programs (including their kernel allocations) well before failing them for the swap-fallback tool would help the original poster, and I imagine various tweaked priorities would make true OOM-deadlock far less likely.

Cheers,
Kyle Moffett

--
When you go into court you either want a very, very, very bright line or you want the stomach to outlast the other guy in trench warfare. If both sides are reasonable, you try to stay _out_ of court in the first place.
-- Rob Landley



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/