Re: swapping and the value of /proc/sys/vm/swappiness

From: Con Kolivas
Date: Thu Sep 16 2004 - 19:27:40 EST


Marcelo Tosatti wrote:
Con!

Spent some time reading your patch...

Great!

Well if "distress" is getting higher (with similar workload/pressure) thats because VM is having a harder time freeing pages (priority increases,
distress increases).

You say "distress is getting higher in later kernels". Can you expand
more on that? How did you find this out, and can you be more especific
wrt "later kernels".

When I say earlier kernels I mean prior to 2.6.8.

I'm still referring to the "hard_swappiness patch" that Florin was using to fix his problem which diagnosed that distress increased. I'm sorry if my newer patch confuses the issue. hard_swappiness effectively changed this:

-distress = 100 >> zone->prev_priority
-mapped_ratio = (sc->nr_mapped * 100) / total_memory;
-swap_tendency = mapped_ratio / 2 + distress + vm_swappiness
-if (swap_tendency >= 100)
- reclaim_mapped = 1;

into this:

+mapped_ratio = (sc->nr_mapped * 100) / total_memory;
+swap_tendency = mapped_ratio / 2 + vm_swappiness
+if (swap_tendency >= 100)
+ reclaim_mapped = 1;

This made swap_tendency dependant _only_ on the mapped_ratio. Now if you load up the same desktop and applications your mapped_ratio will be virtually identical regardless of the kernel. If you then copy a large file or convert a large video file etc, then the mapped ratio will be unchanged. Therefore if the swapping increased with this workload in 2.6.8 and later kernels but did _not_ increase with hard_swappiness it must be the "distress" value which is entirely dependant on zone->prev_priority. Does that make my conclusion clearer?


Below here you're referring to my mapped_watermark patch so I'll address that separately to avoid confusion.

I see you add a "z->nr_unmapped" watermark a bit above "z->pages_high", and use that to set "pgdat->mapped_nrpages" to what needs to be freed so z->free_pages reaches "z->nr_unmapped".

And then you use that per-pgdat "mapped_nrpages" count to avoid:

- moving mapped pages to inactive list (wasting the swappiness algorithm)
- swapping out pages at shrink_list

Those two only happen when pgdat->mapped_nrpages is zero, which becomes true when we go below pages_low.

To resume, deactivation/swapout of mapped pages only happens when we go any zone pages_low.

Correct?

Yes apart from one big caveat. scanning is expensive, so it only scans at lowest priority (DEF_PRIORITY). If it fails to release enough memory it simply returns quietly. This means that if vm pressure is hard enough and occurs frequently/fast enough it will still drop down below pages_high even if the watermarks have not been re-achieved. Then the normal algorithm will take over.

Now with v2.6 stock kernel, kswapd will deactivate (using vm_swappiness algorithm)
and swapout pages between the low and high zone watermarks.

That avoids swapping out as hard as possible until we go below pages_low.

IMHO this might be OK for common desktop workloads where people complain about swap, but might be harmful for other workloads where swapping out on
advance unused anonymous process memory is a _gain_.

As I said, it only does it lightly, and it's tunable.

I dont understand this check on balance_pgdat (kswapd worker function):

+ if (maplimit && sc.nr_mapped * 100 / total_memory > vm_mapped)
+ return 0;
+

So "if not any zone is under pages_low, and more than vm_mapped % of ram
is mapped, bail out."

This will only be hit if "maplimit" is true. This means we have entered balance_pgdat only due to the unmapped watermark (zone->pages_min * 4). Here is where the real "tunable" comes into play. If greater than vm_mapped % of ram is mapped (ie application) pages, it will not do anything at this watermark. By default it is set to 66%. Setting it to 0 inactivates this patch entirely and makes the vm behave much like setting swappiness to 100 in mainline.

I still think swapout behaviour can be correctly tuned with vm_swappiness,
and agree with Andrew on that we should not change anything in the algorithm
if this can be tuned.

I agree it can be, but something in the logic has definitely changed, and a different value is not giving users like Florin the desired result any more.

Cheers,
Con
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/