Re: [PATCH v3 0/5] Defer throttle when task exits to user

From: Matteo Martelli
Date: Fri Aug 08 2025 - 12:38:14 EST

Next message: Lokesh Gidra: "Re: [PATCH v3] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE"
Previous message: Evangelos Petrongonas: "[PATCH] efi: Support booting with kexec handover (KHO)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Aaron,

On Mon, 4 Aug 2025 15:52:04 +0800, Aaron Lu <ziqianlu@xxxxxxxxxxxxx> wrote:
> Hi Matteo,
>
> On Fri, Aug 01, 2025 at 04:31:25PM +0200, Matteo Martelli wrote:
> ... ...
> > I encountered this issue on a test image with both PREEMPT_RT and
> > CFS_BANDWIDTH kernel options enabled. The test image is based on
> > freedesktop-sdk (v24.08.10) [1] with custom system configurations on
> > top, and it was being run on qemu x86_64 with 4 virtual CPU cores. One
> > notable system configuration is having most of system services running
> > on a systemd slice, restricted on a single CPU core (with AllowedCPUs
> > systemd option) and using CFS throttling (with CPUQuota systemd option).
> > With this configuration I encountered RCU stalls during boots, I think
> > because of the increased probability given by multiple processes being
> > spawned simultaneously on the same core. After the first RCU stall, the
> > system becomes unresponsive and successive RCU stalls are detected
> > periodically. This seems to match with the deadlock situation described
> > in your cover letter. I could only reproduce RCU stalls with the
> > combination of both PREEMPT_RT and CFS_BANDWIDTH enabled.
> >
> > I previously already tested this patch set at v2 (RFC) [2] on top of
> > kernel v6.14 and v6.15. I've now retested it at v3 on top of kernel
> > v6.16-rc7. I could no longer reproduce RCU stalls in all cases with the
> > patch set applied. More specifically, in the last test I ran, without
> > patch set applied, I could reproduce 32 RCU stalls in 24 hours, about 1
> > or 2 every hour. In this test the system was rebooting just after the
> > first RCU stall occurrence (through panic_on_rcu_stall=1 and panic=5
> > kernel cmdline arguments) or after 100 seconds if no RCU stall occurred.
> > This means the system rebooted 854 times in 24 hours (about 3.7%
> > reproducibility). You can see below two RCU stall instances. I could not
> > reproduce any RCU stall with the same test after applying the patch set.
> > I obtained similar results while testing the patch set at v2 (RFC)[1].
> > Another possibly interesting note is that the original custom
> > configuration was with the slice CPUQuota=150%, then I retested it with
> > CPUQuota=80%. The issue was reproducible in both configurations, notably
> > even with CPUQuota=150% that to my understanding should correspond to no
> > CFS throttling due to the CPU affinity set to 1 core only.
>
> Agree. With cpu affinity set to 1 core, 150% quota should never hit. But
> from the test results, it seems quota is hit somehow because if quota is
> not hit, this series should make no difference.
>
> Maybe fire a bpftrace script and see if quota is actually hit? A
> reference script is here:
> https://lore.kernel.org/lkml/20250521115115.GB24746@bytedance/
>

I better looked into this and actually there was another slice
(user.slice) configured with CPUQuota=25%. Disabling the CPUQuota limit
on the first mentioned slice (system.slice) I could still reproduce the
RCU stalls. It looks like the throttling was happening during the first
login after boot as also shown by the following ftrace logs.

[ 12.019263] podman-user-gen-992 [000] dN.2. 12.023684: throttle_cfs_rq <-pick_task_fair
[ 12.051074] systemd-981 [000] dN.2. 12.055502: throttle_cfs_rq <-pick_task_fair
[ 12.150067] systemd-981 [000] dN.2. 12.154500: throttle_cfs_rq <-put_prev_entity
[ 12.251448] systemd-981 [000] dN.2. 12.255839: throttle_cfs_rq <-put_prev_entity
[ 12.369867] sshd-session-976 [000] dN.2. 12.374293: throttle_cfs_rq <-pick_task_fair
[ 12.453080] bash-1002 [000] dN.2. 12.457502: throttle_cfs_rq <-pick_task_fair
[ 12.551279] bash-1012 [000] dN.2. 12.555701: throttle_cfs_rq <-pick_task_fair
[ 12.651085] podman-998 [000] dN.2. 12.655505: throttle_cfs_rq <-pick_task_fair
[ 12.750509] podman-1001 [000] dN.2. 12.754931: throttle_cfs_rq <-put_prev_entity
[ 12.868351] podman-1030 [000] dN.2. 12.872780: throttle_cfs_rq <-put_prev_entity
[ 12.961076] podman-1033 [000] dN.2. 12.965504: throttle_cfs_rq <-put_prev_entity

By increasing the CPUQuota to 50% limit of the user.slice, the same test
mentioned in my previous email produced less RCU stalls and less
throttling events in the ftrace logs. Then by setting the user.slice to
100% I could no longer reproduce either RCU stalls or traced throttling
events.

> > I also ran some quick tests with stress-ng and systemd CPUQuota parameter to
> > verify that CFS throttling was behaving as expected. See details below after
> > RCU stall logs.
>
> Thanks for all these tests. If I read them correctly, in all these
> tests, CFS throttling worked as expected. Right?
>

Yes, correct.

> Best regards,
> Aaron
>

Best regards,
Matteo Martelli

Next message: Lokesh Gidra: "Re: [PATCH v3] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE"
Previous message: Evangelos Petrongonas: "[PATCH] efi: Support booting with kexec handover (KHO)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]