Re: [mm, slub] 8ff60eb052: stress-ng.rawpkt.ops_per_sec -47.9% regression

From: Linus Torvalds
Date: Tue Mar 09 2021 - 18:57:53 EST


Jann,
it looks like that change of yours made a rather big negative impact
on this load.

On Sun, Feb 28, 2021 at 11:49 PM kernel test robot
<oliver.sang@xxxxxxxxx> wrote:
>
> FYI, we noticed a -47.9% regression of stress-ng.rawpkt.ops_per_sec due to commit:

Looking at the profile, nothing really obvious stands out, although
some of the numbers imply more polling, and less waiting, ie:

Lots less context switching::

> 12810 ą 17% -63.6% 4658 ą 8% sched_debug.cpu.nr_switches.avg

Less time spent sending packets:

> 68.52 -5.0 63.55 perf-profile.calltrace.cycles-pp.__dev_queue_xmit.packet_snd.sock_sendmsg.__sys_sendto.__x64_sys_sendto
> 69.00 -4.9 64.14 perf-profile.calltrace.cycles-pp.__sys_sendto.__x64_sys_sendto.do_syscall_64.entry_SYSCALL_64_after_hwframe

and quite a lot more time spent in what looks like the receive path,
which allocates the packets:

> 5.47 ą 12% +4.5 9.94 ą 2% perf-profile.calltrace.cycles-pp.kmem_cache_alloc.skb_clone.packet_rcv.__netif_receive_skb_core.__netif_receive_skb_one_core
> 5.46 ą 12% +4.5 9.92 ą 2% perf-profile.calltrace.cycles-pp.skb_clone.packet_rcv.dev_queue_xmit_nit.dev_hard_start_xmit.__dev_queue_xmit
> 5.37 ą 12% +4.5 9.88 ą 2% perf-profile.calltrace.cycles-pp.kmem_cache_alloc.skb_clone.packet_rcv.dev_queue_xmit_nit.dev_hard_start_xmit
> 5.18 ą 13% +4.5 9.71 ą 2% perf-profile.calltrace.cycles-pp.__slab_alloc.kmem_cache_alloc.skb_clone.packet_rcv.__netif_receive_skb_core
> 5.08 ą 13% +4.6 9.66 ą 2% perf-profile.calltrace.cycles-pp.__slab_alloc.kmem_cache_alloc.skb_clone.packet_rcv.dev_queue_xmit_nit
> 29.57 ą 2% +5.5 35.12 perf-profile.calltrace.cycles-pp.__x64_sys_recvfrom.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 29.48 ą 2% +5.6 35.08 perf-profile.calltrace.cycles-pp.__sys_recvfrom.__x64_sys_recvfrom.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 26.62 ą 2% +7.1 33.67 perf-profile.calltrace.cycles-pp.packet_recvmsg.__sys_recvfrom.__x64_sys_recvfrom.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 8.83 ą 15% +8.8 17.66 ą 2% perf-profile.calltrace.cycles-pp._raw_spin_lock.get_partial_node.___slab_alloc.__slab_alloc.kmem_cache_alloc
> 8.58 ą 15% +8.9 17.48 ą 2% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.get_partial_node.___slab_alloc.__slab_alloc
> 9.46 ą 14% +9.0 18.44 ą 2% perf-profile.calltrace.cycles-pp.get_partial_node.___slab_alloc.__slab_alloc.kmem_cache_alloc.skb_clone
> 10.19 ą 13% +9.1 19.28 ą 2% perf-profile.calltrace.cycles-pp.___slab_alloc.__slab_alloc.kmem_cache_alloc.skb_clone.packet_rcv
> 14.31 ą 6% +14.0 28.27 ą 3% perf-profile.calltrace.cycles-pp.skb_free_datagram.packet_recvmsg.__sys_recvfrom.__x64_sys_recvfrom.do_syscall_64
> 9.50 ą 11% +16.7 26.21 ą 4% perf-profile.calltrace.cycles-pp.kmem_cache_free.skb_free_datagram.packet_recvmsg.__sys_recvfrom.__x64_sys_recvfrom
> 6.09 ą 18% +17.8 23.88 ą 5% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.unfreeze_partials.put_cpu_partial.kmem_cache_free
> 6.37 ą 18% +17.9 24.23 ą 5% perf-profile.calltrace.cycles-pp._raw_spin_lock.unfreeze_partials.put_cpu_partial.kmem_cache_free.skb_free_datagram
> 6.92 ą 17% +18.0 24.94 ą 5% perf-profile.calltrace.cycles-pp.unfreeze_partials.put_cpu_partial.kmem_cache_free.skb_free_datagram.packet_recvmsg
> 7.03 ą 16% +18.0 25.06 ą 5% perf-profile.calltrace.cycles-pp.put_cpu_partial.kmem_cache_free.skb_free_datagram.packet_recvmsg.__sys_recvfrom

.. and I think the reason is here:

> 26.26 ą 6% +23.8 50.07 ą 3% perf-profile.children.cycles-pp._raw_spin_lock

Look at that +23.8 for _raw_spin_lock, and it really shows up here too:

> 25.86 ą 6% +23.4 49.22 ą 3% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath

I think what is going on is that your change caused the "contention on
the freelist" case to now loop - possibly several times, and
expensively with atomic operations - while you are holding the
'n->list_lock' spinlock in get_partial_node().

End result: contention on the freelist now becomes *huge* contention
on that list_lock instead.

Before, it would release the list lock, and generally then (maybe) try
again. Or more likely just get another page and avoid the contention.

So when you wrote:

However, the current code accidentally stops looking at the partial list
completely in that case. Especially on kernels without CONFIG_NUMA set,
this means that get_partial() fails and new_slab_objects() falls back to
new_slab(), allocating new pages. This could lead to an unnecessary
increase in memory fragmentation.

it really looks like this might well have been very intentional
indeed. Or at least very beneficial for _some_ loads.

Comments?

Linus