Re: [BUG] Kernel panic in __migrate_swap_task() on 6.16-rc2 (NULL pointer dereference)
From: Jirka Hladky
Date: Wed Jun 18 2025 - 07:39:34 EST
Hi Abhigyan,
The testing is done on bare metal. The kernel panics occur after
several hours of benchmarking.
Out of 20 servers, the problem has occurred on 6 of them:
intel-sapphire-rapids-gold-6448y-2s
intel-emerald-rapids-platinum-8558-2s
amd-epyc5-turin-9655p-1s
amd-epyc4-zen4c-bergamo-9754-1s
amd-epyc3-milan-7713-2s
intel-skylake-2s
The number in the name is the CPU model. 1s: single socket, 2s: dual socket.
We were not able to find a clear pattern. It appears to be a race
condition of some kind.
We run various performance benchmarks, including Linpack, Stream, NAS
(https://www.nas.nasa.gov/software/npb.html), and Stress-ng. Testing
is conducted with various thread counts and settings. All benchmarks
together are running ~24 hours. One benchmark takes ~4 hours. Please
also note that we repeat the benchmarks to collect performance
statistics. In many cases, kernel panic has occurred when the
benchmark was repeated.
Crash occurred while running these tests:
Stress_ng: Starting test 'fork' (#29 out of 41), number of threads 32,
iteration 1 out of 5
SPECjbb2005: Starting DEFAULT run with 4 SPECJBB2005 instances, each
with 24 warehouses, iteration 2 out of 3
Stress_ng: test 'sem' (#30 out of 41), number of threads 24, iteration
2 out of 5
Stress_ng: test 'sem' (#30 out of 41), number of threads 64, iteration
4 out of 5
SPECjbb2005: SINGLE run with 1 SPECJBB2005 instances, each with 128
warehouses, iteration 2 out of 3
Linpack: Benchmark-utils/linpackd, iteration 3, testType affinityRun,
number of threads 128
NAS: NPB_sources/bin/is.D.x
There is no clear benchmark triggering the kernel panic. Looping
Stress_ng's sem test looks, however, like it's worth trying.
I hope this helps. Please let me know if there's anything I can help
with to pinpoint the problem.
Thanks
Jirka
On Wed, Jun 18, 2025 at 7:19 AM Abhigyan ghosh
<zscript.team.zs@xxxxxxxxx> wrote:
>
> Hi Jirka,
>
> Thanks for the detailed report.
>
> I'm curious about the specific setup in which this panic was triggered. Could you share more about the exact configuration or parameters you used for running `stress-ng` or Linpack? For instance:
>
> - How many threads/cores were used?
> - Was it running inside a VM, container, or bare-metal?
> - Was this under any thermal throttling or power-saving mode?
>
> I'd like to try reproducing it locally to study the failure further.
>
> Best regards,
> Abhigyan Ghosh
>
> On 18 June 2025 1:35:30 am IST, Jirka Hladky <jhladky@xxxxxxxxxx> wrote:
> >Hi all,
> >
> >I’ve encountered a reproducible kernel panic on 6.16-rc1 and 6.16-rc2
> >involving a NULL pointer dereference in `__migrate_swap_task()` during
> >CPU migration. This occurred on various AMD and Intel systems while
> >running a CPU-intensive workload (Linpack, Stress_ng - it's not
> >specific to a benchmark).
> >
> >Full trace below:
> >---
> >BUG: kernel NULL pointer dereference, address: 00000000000004c8
> >#PF: supervisor read access in kernel mode
> >#PF: error_code(0x0000) - not-present page
> >PGD 4078b99067 P4D 4078b99067 PUD 0
> >Oops: Oops: 0000 [#1] SMP NOPTI
> >CPU: 74 UID: 0 PID: 466 Comm: migration/74 Kdump: loaded Not tainted
> >6.16.0-0.rc2.24.eln149.x86_64 #1 PREEMPT(lazy)
> >Hardware name: GIGABYTE R182-Z91-00/MZ92-FS0-00, BIOS M07 09/03/2021
> >Stopper: multi_cpu_stop+0x0/0x130 <- migrate_swap+0xa7/0x120
> >RIP: 0010:__migrate_swap_task+0x2f/0x170
> >Code: 41 55 4c 63 ee 41 54 55 53 48 89 fb 48 83 87 a0 04 00 00 01 65
> >48 ff 05 e7 14 dd 02 48 8b af 50 0a 00 00 66 90 e8 61 93 07 00 <48> 8b
> >bd c8 04 00 00 e8 85 11 35 00 48 85 c0 74 12 ba 01 00 00 00
> >RSP: 0018:ffffce79cd90bdd0 EFLAGS: 00010002
> >RAX: 0000000000000001 RBX: ffff8e9c7290d1c0 RCX: 0000000000000000
> >RDX: ffff8e9c71e83680 RSI: 000000000000001b RDI: ffff8e9c7290d1c0
> >RBP: 0000000000000000 R08: 00056e36392913e7 R09: 00000000002ab980
> >R10: ffff8eac2fcb13c0 R11: ffff8e9c77997410 R12: ffff8e7c2fcf12c0
> >R13: 000000000000001b R14: ffff8eac71eda944 R15: ffff8eac71eda944
> >FS: 0000000000000000(0000) GS:ffff8eac9db4a000(0000) knlGS:0000000000000000
> >CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >CR2: 00000000000004c8 CR3: 0000003072388003 CR4: 0000000000f70ef0
> >PKRU: 55555554
> >Call Trace:
> > <TASK>
> > migrate_swap_stop+0xe8/0x190
> > multi_cpu_stop+0xf3/0x130
> > ? __pfx_multi_cpu_stop+0x10/0x10
> > cpu_stopper_thread+0x97/0x140
> > ? __pfx_smpboot_thread_fn+0x10/0x10
> > smpboot_thread_fn+0xf3/0x220
> > kthread+0xfc/0x240
> > ? __pfx_kthread+0x10/0x10
> > ? __pfx_kthread+0x10/0x10
> > ret_from_fork+0xf0/0x110
> > ? __pfx_kthread+0x10/0x10
> > ret_from_fork_asm+0x1a/0x30
> > </TASK>
> >---
> >
> >**Kernel Version:**
> >6.16.0-0.rc2.24.eln149.x86_64 (Fedora rawhide)
> >https://koji.fedoraproject.org/koji/buildinfo?buildID=2732950
> >
> >**Reproducibility:**
> >Happened multiple times during routine CPU-intensive operations. It
> >happens with various benchmarks (Stress_ng, Linpack) after several
> >hours of performance testing. `migration/*` kernel threads hit a NULL
> >dereference in `__migrate_swap_task`.
> >
> >**System Info:**
> >- Platform: GIGABYTE R182-Z91-00 (dual socket EPYC)
> >- BIOS: M07 09/03/2021
> >- Config: Based on Fedora’s debug kernel (`PREEMPT(lazy)`)
> >
> >**Crash Cause (tentative):**
> >NULL dereference at offset `0x4c8` from a task struct pointer in
> >`__migrate_swap_task`. Possibly an uninitialized or freed
> >`task_struct` field.
> >
> >Please let me know if you’d like me to test a patch or if you need
> >more details.
> >
> >Thanks,
> >Jirka
> >
> >
>
> aghosh
>
--
-Jirka