Re: [RFC][PATCH] fix swap entries is not reclaimed in proper wayfor memg v3.

From: KAMEZAWA Hiroyuki
Date: Mon Apr 27 2009 - 04:23:18 EST


On Mon, 27 Apr 2009 13:42:06 +0530
Balbir Singh <balbir@xxxxxxxxxxxxxxxxxx> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> [2009-04-24 16:28:40]:
>
> > This is new one. (using new logic.) Maybe enough light-weight and caches all cases.
>
> You sure mean catches above :)
>
>
> >
> > Thanks,
> > -Kame
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
> >
> > Because free_swap_and_cache() function is called under spinlocks,
> > it can't sleep and use trylock_page() instead of lock_page().
> > By this, swp_entry which is not used after zap_xx can exists as
> > SwapCache, which will be never used.
> > This kind of SwapCache is reclaimed by global LRU when it's found
> > at LRU rotation.
> >
> > When memory cgroup is used, the global LRU will not be kicked and
> > stale Swap Caches will not be reclaimed. This is problematic because
> > memcg's swap entry accounting is leaked and memcg can't know it.
> > To catch this stale SwapCache, we have to chase it and check the
> > swap is alive or not again.
> >
> > This patch adds a function to chase stale swap cache and reclaim it
> > in modelate way. When zap_xxx fails to remove swap ent, it will be
> > recoreded into buffer and memcg's "work" will reclaim it later.
> > No sleep, no memory allocation under free_swap_and_cache().
> >
> > This patch also adds stale-swap-cache-congestion logic and try to avoid having
> > too much stale swap caches at the same time.
> >
> > Implementation is naive but maybe the cost meets trade-off.
> >
> > How to test:
> > 1. set limit of memory to very small (1-2M?).
> > 2. run some amount of program and run page reclaim/swap-in.
> > 3. kill programs by SIGKILL etc....then, Stale Swap Cache will
> > be increased. After this patch, stale swap caches are reclaimed
> > and mem+swap controller will not go to OOM.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
>
> Quick comment on the design
>
> 1. I like the marking of swap cache entries as stale

I like to. But there is no space to record it as stale. And "race" makes
that difficult even if we have enough space. If you read the whole thread,
you know there are many patterns of race.

> 2. Can't we reclaim stale entries during memcg LRU reclaim? Why write
> a GC for it?
>
Because they are not on memcg LRU. we can't reclaim it by memcg LRU.
(See the first mail from Nishimura of this thread. It explains well.)

One easy case is here.

- CPU0 call zap_pte()->free_swap_and_cache()
- CPU1 tries to swap-in it.
In this case, free_swap_and_cache() doesn't free swp_entry and swp_entry
is read into the memory. But it will never be added memcg's LRU until
it's mapped.
(What we have to consider here is swapin-readahead. It can swap-in memory
even if it's not accessed. Then, this race window is larger than expected.)

We can't use memcg's LRU then...what we can do is.

- scanning global LRU all
or
- use some trick to reclaim them in lazy way.


Thanks,
-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/