Re: [PATCH V3] perf: Fix the throttle error of some clock events

From: Liang, Kan
Date: Fri Jun 06 2025 - 14:38:24 EST




On 2025-06-06 1:42 p.m., Alexei Starovoitov wrote:
> On Fri, Jun 6, 2025 at 6:05 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
>>
>>
>>
>> On 2025-06-05 8:39 p.m., Alexei Starovoitov wrote:
>>> On Thu, Jun 5, 2025 at 4:50 PM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
>>>>
>>>>
>>>>
>>>> On 2025-06-05 4:46 p.m., Alexei Starovoitov wrote:
>>>>> On Thu, Jun 5, 2025 at 1:24 PM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2025-06-05 2:45 p.m., Alexei Starovoitov wrote:
>>>>>>> On Thu, Jun 5, 2025 at 6:46 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2025-06-04 7:21 p.m., Alexei Starovoitov wrote:
>>>>>>>>> On Wed, Jun 4, 2025 at 10:16 AM <kan.liang@xxxxxxxxxxxxxxx> wrote:
>>>>>>>>>>
>>>>>>>>>> From: Kan Liang <kan.liang@xxxxxxxxxxxxxxx>
>>>>>>>>>>
>>>>>>>>>> Both ARM and IBM CI reports RCU stall, which can be reproduced by the
>>>>>>>>>> below perf command.
>>>>>>>>>> perf record -a -e cpu-clock -- sleep 2
>>>>>>>>>>
>>>>>>>>>> The issue is introduced by the generic throttle patch set, which
>>>>>>>>>> unconditionally invoke the event_stop() when throttle is triggered.
>>>>>>>>>>
>>>>>>>>>> The cpu-clock and task-clock are two special SW events, which rely on
>>>>>>>>>> the hrtimer. The throttle is invoked in the hrtimer handler. The
>>>>>>>>>> event_stop()->hrtimer_cancel() waits for the handler to finish, which is
>>>>>>>>>> a deadlock. Instead of invoking the stop(), the HRTIMER_NORESTART should
>>>>>>>>>> be used to stop the timer.
>>>>>>>>>>
>>>>>>>>>> There may be two ways to fix it.
>>>>>>>>>> - Introduce a PMU flag to track the case. Avoid the event_stop in
>>>>>>>>>> perf_event_throttle() if the flag is detected.
>>>>>>>>>> It has been implemented in the
>>>>>>>>>> https://lore.kernel.org/lkml/20250528175832.2999139-1-kan.liang@xxxxxxxxxxxxxxx/
>>>>>>>>>> The new flag was thought to be an overkill for the issue.
>>>>>>>>>> - Add a check in the event_stop. Return immediately if the throttle is
>>>>>>>>>> invoked in the hrtimer handler. Rely on the existing HRTIMER_NORESTART
>>>>>>>>>> method to stop the timer.
>>>>>>>>>>
>>>>>>>>>> The latter is implemented here.
>>>>>>>>>>
>>>>>>>>>> Move event->hw.interrupts = MAX_INTERRUPTS before the stop(). It makes
>>>>>>>>>> the order the same as perf_event_unthrottle(). Except the patch, no one
>>>>>>>>>> checks the hw.interrupts in the stop(). There is no impact from the
>>>>>>>>>> order change.
>>>>>>>>>>
>>>>>>>>>> Reported-by: Leo Yan <leo.yan@xxxxxxx>
>>>>>>>>>> Reported-by: Aishwarya TCV <aishwarya.tcv@xxxxxxx>
>>>>>>>>>> Closes: https://lore.kernel.org/lkml/20250527161656.GJ2566836@xxxxxxxxxxxxxxx/
>>>>>>>>>> Reported-by: Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx>
>>>>>>>>>> Closes: https://lore.kernel.org/lkml/djxlh5fx326gcenwrr52ry3pk4wxmugu4jccdjysza7tlc5fef@ktp4rffawgcw/
>>>>>>>>>> Reported-by: Venkat Rao Bagalkote <venkat88@xxxxxxxxxxxxx>
>>>>>>>>>> Closes: https://lore.kernel.org/lkml/8e8f51d8-af64-4d9e-934b-c0ee9f131293@xxxxxxxxxxxxx/
>>>>>>>>>> Signed-off-by: Kan Liang <kan.liang@xxxxxxxxxxxxxxx>
>>>>>>>>>
>>>>>>>>> It seems the patch fixes one issue and introduces another ?
>>>>>>>>>
>>>>>>>>> Looks like the throttle event is sticky.
>>>>>>>>> Once it's reached the perf_event no longer works ?
>>>>>>>>
>>>>>>>> No. It should still work even the throttle is triggered.
>>>>>>>>
>>>>>>>> sdp@d404e6bce080:~$ sudo bash -c 'echo 10 >
>>>>>>>> /proc/sys/kernel/perf_event_max_sample_rate'
>>>>>>>> sdp@d404e6bce080:~$ sudo perf record -a -e cpu-clock -c10000 -- sleep 1
>>>>>>>> [ perf record: Woken up 1 times to write data ]
>>>>>>>> [ perf record: Captured and wrote 0.559 MB perf.data (584 samples) ]
>>>>>>>
>>>>>>> With the patch applied above command hangs in my VM:
>>>>>>>
>>>>>>> # perf record -a -e cpu-clock -c10000 -- sleep 1
>>>>>>> [ 43.656855] hrtimer: interrupt took 21640 ns
>>>>>>> [ 68.561052] watchdog: BUG: soft lockup - CPU#0 stuck for 41s! [perf:2253]
>>>>>>> [ 68.561056] Modules linked in: bpf_preload
>>>>>>> [ 68.561060] CPU: 0 UID: 0 PID: 2253 Comm: perf Not tainted
>>>>>>> 6.15.0-12294-gc89e5202e569 #1172 PREEMPT
>>>>>>> [ 68.561062] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>>>>>>> BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>>>>>>> [ 68.561063] RIP: 0010:handle_softirqs+0x77/0x2a0
>>>>>>> [ 68.561070] Code: 6b 96 02 00 01 00 00 bd 0a 00 00 00 44 89 64 24
>>>>>>> 14 89 6c 24 10 40 88 7c 24 04 31 c0 65 66 89 05 5f 6b 96 02 fb bb ff
>>>>>>> ff ff ff <48> c7 c0 c0 f
>>>>>>> [ 68.561071] RSP: 0018:ffa0000000003fa0 EFLAGS: 00000246
>>>>>>> [ 68.561072] RAX: 0000000000000000 RBX: 00000000ffffffff RCX: 00000000000006e0
>>>>>>> [ 68.561073] RDX: 0000000000000007 RSI: ff1100010212e100 RDI: 0000000000000000
>>>>>>> [ 68.561074] RBP: 000000000000000a R08: 0000000000000000 R09: 7fffffffffffffff
>>>>>>> [ 68.561074] R10: 00000005a3af2140 R11: 0000000000004601 R12: 0000000000400100
>>>>>>> [ 68.561075] R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000000
>>>>>>> [ 68.561111] FS: 00007fa1b536f780(0000) GS:ff110004abe26000(0000)
>>>>>>> knlGS:0000000000000000
>>>>>>> [ 68.561112] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>> [ 68.561113] CR2: 0000000001c67a30 CR3: 000000010f2f3003 CR4: 00000000003716f0
>>>>>>> [ 68.561113] Call Trace:
>>>>>>> [ 68.561170] <IRQ>
>>>>>>> [ 68.561174] irq_exit_rcu+0x91/0xb0
>>>>>>> [ 68.561176] sysvec_apic_timer_interrupt+0x71/0x90
>>>>>>> [ 68.561181] </IRQ>
>>>>>>> [ 68.561181] <TASK>
>>>>>>> [ 68.561182] asm_sysvec_apic_timer_interrupt+0x1a/0x20
>>>>>>> [ 68.561184] RIP: 0010:generic_exec_single+0x33/0x120
>>>>>>> [ 68.561188] Code: 65 39 3d 5c 4d 89 02 74 28 3b 3d b8 f5 17 02 0f
>>>>>>> 83 de 00 00 00 89 f8 48 0f a3 05 d8 f1 17 02 0f 83 ce 00 00 00 e8 bd
>>>>>>> fe ff ff <31> c0 5b 5d 9
>>>>>>> [ 68.561189] RSP: 0018:ffa00000025a3cc0 EFLAGS: 00000206
>>>>>>> [ 68.561190] RAX: 0000000000000000 RBX: ffffffff8145b310 RCX: ff110004abe26000
>>>>>>> [ 68.561190] RDX: 0000000000000007 RSI: 0000000000000000 RDI: ff1100042fa26540
>>>>>>> [ 68.561191] RBP: 0000000000000202 R08: 0000000000000000 R09: 0000000000000000
>>>>>>> [ 68.561191] R10: ffa00000025a3ee0 R11: 0000000000000000 R12: ffa00000025a3d40
>>>>>>> [ 68.561192] R13: ff11000100881220 R14: ff11000100880fc0 R15: 0000000000000000
>>>>>>> [ 68.561192] ? sw_perf_event_destroy+0x70/0x70
>>>>>>> [ 68.561197] smp_call_function_single+0xc4/0x110
>>>>>>> [ 68.561199] ? sw_perf_event_destroy+0x70/0x70
>>>>>>> [ 68.561200] event_function_call+0x160/0x170
>>>>>>> [ 68.561202] ? ctx_resched+0x2d0/0x2d0
>>>>>>> [ 68.561205] ? perf_event_set_state+0x60/0x60
>>>>>>> [ 68.561206] ? _perf_event_disable+0x50/0x50
>>>>>>> [ 68.561208] perf_event_for_each_child+0x37/0x80
>>>>>>> [ 68.561209] ? _perf_event_disable+0x50/0x50
>>>>>>> [ 68.561211] _perf_ioctl+0x1df/0xad0
>>>>>>> [ 68.561213] ? __set_cpus_allowed_ptr+0x71/0x80
>>>>>>> [ 68.561216] ? avc_has_perm+0x72/0x160
>>>>>>> [ 68.561219] ? ldsem_down_write+0x1bb/0x1fc
>>>>>>> [ 68.561222] perf_ioctl+0x42/0x70
>>>>>>> [ 68.561224] __x64_sys_ioctl+0x8f/0xd0
>>>>>>> [ 68.561226] do_syscall_64+0x46/0x160
>>>>>>> [ 68.561228] entry_SYSCALL_64_after_hwframe+0x4b/0x53
>>>>>>>
>>>>>>> but only after a fresh boot.
>>>>>>>
>>>>>>> If I run bpf selftests before that it works.
>>>>>>> Like test_progs -t stacktrace_build_id_nmi
>>>>>>> followed by the same
>>>>>>> perf record -a -e cpu-clock ..
>>>>>>> it works fine.
>>>>>>>
>>>>>>
>>>>>> There should be a bug in V3. When stops in the throttle, the event
>>>>>> should not be updated, stop(event, 0). But the cpu_clock_event_stop()
>>>>>> doesn't handle the flag. That changes the behavior a little bit.
>>>>>>
>>>>>> Could you please try the below patch and see if it helps?
>>>>>>
>>>>>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>>>>>> index cd85b1820e7d..b6c57ba24e78 100644
>>>>>> --- a/kernel/events/core.c
>>>>>> +++ b/kernel/events/core.c
>>>>>> @@ -2656,8 +2656,8 @@ static void perf_event_unthrottle(struct
>>>>>> perf_event *event, bool start)
>>>>>>
>>>>>> static void perf_event_throttle(struct perf_event *event)
>>>>>> {
>>>>>> - event->pmu->stop(event, 0);
>>>>>> event->hw.interrupts = MAX_INTERRUPTS;
>>>>>> + event->pmu->stop(event, 0);
>>>>>> if (event == event->group_leader)
>>>>>> perf_log_throttle(event, 0);
>>>>>> }
>>>>>> @@ -11777,7 +11777,12 @@ static void perf_swevent_cancel_hrtimer(struct
>>>>>> perf_event *event)
>>>>>> {
>>>>>> struct hw_perf_event *hwc = &event->hw;
>>>>>>
>>>>>> - if (is_sampling_event(event)) {
>>>>>> + /*
>>>>>> + * The throttle can be triggered in the hrtimer handler.
>>>>>> + * The HRTIMER_NORESTART should be used to stop the timer,
>>>>>> + * rather than hrtimer_cancel(). See perf_swevent_hrtimer()
>>>>>> + */
>>>>>> + if (is_sampling_event(event) && (hwc->interrupts != MAX_INTERRUPTS)) {
>>>>>> ktime_t remaining = hrtimer_get_remaining(&hwc->hrtimer);
>>>>>> local64_set(&hwc->period_left, ktime_to_ns(remaining));
>>>>>>
>>>>>> @@ -11832,7 +11837,8 @@ static void cpu_clock_event_start(struct
>>>>>> perf_event *event, int flags)
>>>>>> static void cpu_clock_event_stop(struct perf_event *event, int flags)
>>>>>> {
>>>>>> perf_swevent_cancel_hrtimer(event);
>>>>>> - cpu_clock_event_update(event);
>>>>>> + if (flags & PERF_EF_UPDATE)
>>>>>> + cpu_clock_event_update(event);
>>>>>> }
>>>>>
>>>>> Nope. The last hunk didn't make any difference.
>>>>> Same soft lockup.
>>>>
>>>> Thanks for the verification.
>>>>
>>>> I have some questions, could you please help to clarify?
>>>>
>>>> - What's the value of proc/sys/kernel/perf_event_max_sample_rate in the
>>>> test?
>>>
>>> 100000
>>>
>>>> - Can the soft lockup issue be produced by both bpf selftests and the
>>>> perf record command? Or just the perf record command?
>>>
>>> just perf record.
>>>
>>>> - Can the soft lockup issue be reproduced with the v2 patchset?
>>>> https://lore.kernel.org/lkml/20250528175832.2999139-1-kan.liang@xxxxxxxxxxxxxxx/
>>>
>>> yes. With v2
>>> perf record -a -e cpu-clock -c10000 -- sleep 1
>>>
>>> soft locks up as well with the same stack trace.
>>>
>>>> - Furthermore, can the soft lockup issue be reproduced after reverting
>>>> the recent generic throttle logic fix?
>>>> commit e800ac51202f ("perf: Only dump the throttle log for the leader")
>>>> commit 9734e25fbf5a ("perf: Fix the throttle logic for a group")
>>>
>>> yes.
>>>
>>> perf record -a -e cpu-clock -- sleep 1
>>> works
>>>
>>> but
>>> perf record -a -e cpu-clock -c100 -- sleep 1
>>> locks up.
>>> -c1, -c100, -c10000 all lockup.
>>
>> Thanks for the confirmation.
>>
>> The soft lockup issue is likely an existing issue in virtualization,
>> unrelated to the recent throttle patches.
>>
>> The issue should be caused by the overwhelming overflows.
>> The -c means fixed period. It's too small for the SW event in VM.
>> For 1 GHZ CPU, the -c10000 will trigger 100000 samples per second.
>> -c1 will trigger 1000000000 samples per second.
>> The perf_event_max_sample_rate is set to 100000. The throttle mechanism
>> should be triggered for the above case, but even with 100000 samples per
>> second. It seems still be big number in virtualization.
>>
>> There is no such issue without -c. I think it's because of the
>> auto-adjust mechanism for the freq mode. The period is automatically
>> adjusted, so a reasonable number of samples can be generated per second.
>>
>> If you change the proc/sys/kernel/perf_event_max_sample_rate to a small
>> value, e.g., 1000, I think the issue should be gone.
>>
>> Could you please give a try?
>
> Smaller value helps, but this is not an acceptable workaround.
> The kernel shouldn't lock up like that with a simple perf command.

Right, but the soft lockup is a different issue. Even without the recent
generic throttle logic change, it still can be triggered, right?

How does the below sound?
I will send a V4 with the small change I made yesterday. Please help to
verify it and see if it fixes your original report (test_progs cases).
If yes, let's merge it first, so it can unblock other people (ARM CI,
IBM CI, Vince...).

Then I will go back and see what I can do for this long-standing issue
in VM.

Thanks,
Kan