Re: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdatwhen reclaiming successfully

From: Minchan Kim
Date: Thu Jul 21 2011 - 20:22:03 EST


On Fri, Jul 22, 2011 at 2:01 AM, Mel Gorman <mgorman@xxxxxxx> wrote:
> On Fri, Jul 22, 2011 at 01:36:49AM +0900, Minchan Kim wrote:
>> > > > <SNIP>
>> > > > @@ -2740,17 +2742,23 @@ static int kswapd(void *p)
>> > > > Â Â Â tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
>> > > > Â Â Â set_freezable();
>> > > >
>> > > > - Â Â order = 0;
>> > > > - Â Â classzone_idx = MAX_NR_ZONES - 1;
>> > > > + Â Â order = new_order = 0;
>> > > > + Â Â classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
>> > > > Â Â Â for ( ; ; ) {
>> > > > - Â Â Â Â Â Â unsigned long new_order;
>> > > > - Â Â Â Â Â Â int new_classzone_idx;
>> > > > Â Â Â Â Â Â Â int ret;
>> > > >
>> > > > - Â Â Â Â Â Â new_order = pgdat->kswapd_max_order;
>> > > > - Â Â Â Â Â Â new_classzone_idx = pgdat->classzone_idx;
>> > > > - Â Â Â Â Â Â pgdat->kswapd_max_order = 0;
>> > > > - Â Â Â Â Â Â pgdat->classzone_idx = MAX_NR_ZONES - 1;
>> > > > + Â Â Â Â Â Â /*
>> > > > + Â Â Â Â Â Â Â* If the last balance_pgdat was unsuccessful it's unlikely a
>> > > > + Â Â Â Â Â Â Â* new request of a similar or harder type will succeed soon
>> > > > + Â Â Â Â Â Â Â* so consider going to sleep on the basis we reclaimed at
>> > > > + Â Â Â Â Â Â Â*/
>> > > > + Â Â Â Â Â Â if (classzone_idx >= new_classzone_idx && order == new_order) {
>> > > > + Â Â Â Â Â Â Â Â Â Â new_order = pgdat->kswapd_max_order;
>> > > > + Â Â Â Â Â Â Â Â Â Â new_classzone_idx = pgdat->classzone_idx;
>> > > > + Â Â Â Â Â Â Â Â Â Â pgdat->kswapd_max_order = Â0;
>> > > > + Â Â Â Â Â Â Â Â Â Â pgdat->classzone_idx = pgdat->nr_zones - 1;
>> > > > + Â Â Â Â Â Â }
>> > > > +
>> > >
>> > > But in this part.
>> > > Why do we need this?
>> >
>> > Lets say it's a fork-heavy workload and it is routinely being woken
>> > for order-1 allocations and the highest zone is very small. For the
>> > most part, it's ok because the allocations are being satisfied from
>> > the lower zones which kswapd has no problem balancing.
>> >
>> > However, by reading the information even after failing to
>> > balance, kswapd continues balancing for order-1 due to reading
>> > pgdat->kswapd_max_order, each time failing for the highest zone. It
>> > only takes one wakeup request per balance_pgdat() to keep kswapd
>> > awake trying to balance the highest zone in a continual loop.
>>
>> You made balace_pgdat's classzone_idx as communicated back so classzone_idx returned
>> would be not high zone and in [1/4], you changed that sleeping_prematurely consider only
>> classzone_idx not nr_zones. So I think it should sleep if low zones is balanced.
>>
>
> If a wakeup for order-1 happened during the last pgdat, the
> classzone_idx as communicated back from balance_pgdat() is lost and it
> will not sleep in this ordering of events
>
> kswapd                                 Âother processes
> ====== Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â===============
> order = balance_pgdat(pgdat, order, &classzone_idx);
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âwakeup for order-1
> kswapd balances lower zone
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âallocate from lower zone
> balance_pgdat fails balance for highest zone, returns
> Â Â Â Âwith lower classzone_idx and possibly lower order
> new_order = pgdat->kswapd_max_order   Â(order == 1)
> new_classzone_idx = pgdat->classzone_idx (highest zone)
> if (order < new_order || classzone_idx > new_classzone_idx) {
> Â Â Â Âorder = new_order;
> Â Â Â Âclasszone_idx = new_classzone_idx; (failure from balance_pgdat() lost)
> }
> order = balance_pgdat(pgdat, order, &classzone_idx);
>
> The wakup for order-1 at any point during balance_pgdat() is enough to
> keep kswapd awake even though the process that called wakeup_kswapd
> would be able to allocate from the lower zones without significant
> difficulty.
>
> This is why if balance_pgdat() fails its request, it should go to sleep
> if watermarks for the lower zones are met until woken by another
> process.

Hmm.

The role of kswapd is to reclaim pages by background until all of zone
meet HIGH_WMARK to prevent costly direct reclaim.(Of course, there is
another reason like GFP_ATOMIC). So it's not wrong to consume many cpu
usage by design unless other tasks are ready. It would be balanced or
unreclaimable at last so it should end up. However, the problem is
small part of highest zone is easily [set|reset] to be
all_unreclaimabe so the situation could be forever like our example.
So fundamental solution is to prevent it that all_unreclaimable is
set/reset easily, I think.
Unfortunately it have no idea now.

In different viewpoint, the problem is that it's too excessive
because kswapd is just best-effort and if it got fails, we have next
wakeup and even direct reclaim as last resort. In such POV, I think
this patch is right and it would be a good solution. Then, other
concern is on your reply about KOSAKI's question.

I think below your patch is needed.

Quote from
"
1. Read for balance-request-A (order, classzone) pair
2. Fail balance_pgdat
3. Sleep based on (order, classzone) pair
4. Wake for balance-request-B (order, classzone) pair where
balance-request-B != balance-request-A
5. Succeed balance_pgdat
6. Compare order,classzone with balance-request-A which will treat
balance_pgdat() as fail and try go to sleep

This is not the same as new_classzone_idx being "garbage" but is it
what you mean? If so, is this your proposed fix?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fe854d7..1a518e6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2770,6 +2770,8 @@ static int kswapd(void *p)
kswapd_try_to_sleep(pgdat, order, classzone_idx);
order = pgdat->kswapd_max_order;
classzone_idx = pgdat->classzone_idx;
+ new_order = order;
+ new_classzone_idx = classzone_idx;
"



-
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/