Re: [PATCH] slub: limit number of slabs to scan in count_partial()

From: Vlastimil Babka
Date: Tue Apr 16 2024 - 16:14:54 EST


On 4/16/24 8:58 PM, Jianfeng Wang wrote:
>
>
> On 4/15/24 12:35 AM, Vlastimil Babka wrote:
>> On 4/13/24 3:17 AM, Jianfeng Wang wrote:
>>>
>>>> On Apr 12, 2024, at 1:44 PM, Jianfeng Wang <jianfeng.w.wang@xxxxxxxxxx> wrote:
>>>>
>>>> On 4/12/24 1:20 PM, Vlastimil Babka wrote:
>>>>> On 4/12/24 7:29 PM, Jianfeng Wang wrote:
>>>>>>
>>>>>> On 4/12/24 12:48 AM, Vlastimil Babka wrote:
>>>>>>> On 4/11/24 7:02 PM, Christoph Lameter (Ampere) wrote:
>>>>>>>> On Thu, 11 Apr 2024, Jianfeng Wang wrote:
>>>>>>>>
>>>>>>>>> So, the fix is to limit the number of slabs to scan in
>>>>>>>>> count_partial(), and output an approximated result if the list is too
>>>>>>>>> long. Default to 10000 which should be enough for most sane cases.
>>>>>>>>
>>>>>>>>
>>>>>>>> That is a creative approach. The problem though is that objects on the
>>>>>>>> partial lists are kind of sorted. The partial slabs with only a few
>>>>>>>> objects available are at the start of the list so that allocations cause
>>>>>>>> them to be removed from the partial list fast. Full slabs do not need to
>>>>>>>> be tracked on any list.
>>>>>>>>
>>>>>>>> The partial slabs with few objects are put at the end of the partial list
>>>>>>>> in the hope that the few objects remaining will also be freed which would
>>>>>>>> allow the freeing of the slab folio.
>>>>>>>>
>>>>>>>> So the object density may be higher at the beginning of the list.
>>>>>>>>
>>>>>>>> kmem_cache_shrink() will explicitly sort the partial lists to put the
>>>>>>>> partial pages in that order.
>>>>>>>>
>>>
>>> Realized that I’d do "echo 1 > /sys/kernel/slab/dentry/shrink” to sort the list explicitly.
>>> After that, the numbers become:
>>> N = 10000 -> diff = 7.1 %
>>> N = 20000 -> diff = 5.7 %
>>> N = 25000 -> diff = 5.4 %
>>> So, expecting ~5-7% difference after shrinking.
>>>
>>>>>>>> Can you run some tests showing the difference between the estimation and
>>>>>>>> the real count?
>>>>>>
>>>>>> Yes.
>>>>>> On a server with one NUMA node, I create a case that uses many dentry objects.
>>>>>
>>>>> Could you describe in more detail how do you make dentry cache to grow such
>>>>> a large partial slabs list? Thanks.
>>>>>
>>>>
>>>> I utilized the fact that creating a folder will create a new dentry object;
>>>> deleting a folder will delete all its sub-folder's dentry objects.
>>>>
>>>> Then, I started to create N folders, while each folder has M empty sub-folders.
>>>> Assuming that these operations would consume a large number of dentry
>>>> objects in the sequential order. Their slabs were very likely to be full slabs.
>>>> After all folders were created, I deleted a subset of the N folders (i.e.,
>>>> one out of every two folders). This would create many holes, which turned a
>>>> subset of full slabs into partial slabs.
>>
>> Thanks, right, so that's quite a deterministic way to achieve the long
>> partial lists with very close to uniform ratio of free/used, so no wonder
>> the resulting accuracy is good and the diff is very small. But in practice
>> the workloads that may lead to long lists will not be so uniform. The result
>> after shrinking shows what happens if there's bias in which slabs we inspect
>> due to the sorting. But still most of the slabs will have the near-uniform
>> free/used ratio so the sorting will not do so much difference. But another
>> workload might do that.
>>
>> So what happens if you inspect X slabs from the head and X from the tail as
>> I suggested? That should help your test case even after you sort, and also
>> should in theory be more accurate even for less uniform workloads.
>
> Yes, the approach of counting from both directions and then approximating
> works better after sorting the partial list.

Yeah I think we could go with that approach then. Let's do 5000 from each
side. You can check whether n->nr_partial < 10000 and then just scan the
whole list in single direction with no approximation, and otherwise 5000
from each side with approximation. I think the code as you show below will
scan some slabs in the middle of the list twice if there's between 5000 and
10000 on the list, so checking n->nr_partial would avoid that. Thanks!

> Here is what I did.
> ---
> +static unsigned long count_partial(struct kmem_cache_node *n,
> + int (*get_count)(struct slab *))
> +{
> + unsigned long flags;
> + unsigned long x = 0;
> + unsigned long scanned = 0;
> + struct slab *slab;
> +
> + spin_lock_irqsave(&n->list_lock, flags);
> + list_for_each_entry(slab, &n->partial, slab_list) {
> + x += get_count(slab);
> + if (++scanned > MAX_PARTIAL_TO_SCAN)
> + break;
> + }
> +
> + if (scanned > max_partial_to_scan) {
> + scanned = 0;
> + list_for_each_entry_reverse(slab, &n->partial, slab_list) {
> + x += get_count(slab);
> + if (++scanned > MAX_PARTIAL_TO_SCAN) {
> + /* Approximate total count of objects */
> + x = mult_frac(x, n->nr_partial, scanned * 2);
> + x = min(x, node_nr_objs(n));
> + break;
> + }
> + }
> + }
> + spin_unlock_irqrestore(&n->list_lock, flags);
> + return x;
> +}
> ---
>
> Results:
> ---
> * Pre-shrink:
> MAX_SLAB_TO_SCAN | Diff (single-direction) | Diff (double-direction) |
> 1000 | 0.43 % | 0.80 % |
> 5000 | 0.06 % | 0.16 % |
> 10000 | 0.02 % | -0.003 % |
> 20000 | 0.009 % | 0.03 % |
>
> * After-shrink:
> MAX_SLAB_TO_SCAN | Diff (single-direction) | Diff (double-direction) |
> 1000 | 12.46 % | 3.60 % |
> 5000 | 5.38 % | 0.22 % |
> 10000 | 4.99 % | -0.06 % |
> 20000 | 4.86 % | -0.17 % |
> ---
>
> For |MAX_SLAB_TO_SCAN| >= 5000, count_partial() returns the exact
> object count if the length of the partial list is not larger than
> |MAX_SLAB_TO_SCAN|. Otherwise, it counts |MAX_SLAB_TO_SCAN| from both
> the head and from the tail, and outputs an approximation that shows
> <1% difference.
>
> With a slightly larger limit (like 10000), count_partial() should
> produce the exact number for most cases (that won't lead to a
> lockup) and avoid lockups with a good estimate.