Re: [PATCH v9 0/8] Parallel CPU bringup for x86_64

From: Kim Phillips
Date: Mon Feb 20 2023 - 18:24:05 EST


On 2/20/23 3:39 PM, David Woodhouse wrote:
On 20 February 2023 21:23:38 GMT, Oleksandr Natalenko <oleksandr@xxxxxxxxxxxxxx> wrote:
Hello.

On 20.02.2023 21:31, David Woodhouse wrote:
On Mon, 2023-02-20 at 17:40 +0100, Oleksandr Natalenko wrote:
On pondělí 20. února 2023 17:20:13 CET David Woodhouse wrote:
On Mon, 2023-02-20 at 17:08 +0100, Oleksandr Natalenko wrote:

I've applied this to the v6.2 kernel, and suspend/resume broke on
my
Ryzen 5950X desktop. The machine suspends just fine, but on
resume
the screen stays blank, and there's no visible disk I/O.

Reverting the series brings suspend/resume back to working state.

Hm, thanks. What if you add 'no_parallel_bringup' on the command
line?

If the `no_parallel_bringup` param is added, the suspend/resume
works.

Thanks for the testing. Can I ask you to do one further test: apply the
series only as far as patch 6/8 'x86/smpboot: Support parallel startup
of secondary CPUs'.

That will do the new startup asm sequence where each CPU finds its own
per-cpu data so it *could* work in parallel, but doesn't actually do
the bringup in parallel yet.

With patches 1 to 6 (including) applied and no extra cmdline params added the resume doesn't work.

Hm. Kim, is there some weirdness with the way AMD CPUs get their APIC ID in CPUID 0x1? Especially after resume?

Not to my knowledge. Mario?

Perhaps we turn it off for any AMD CPU that doesn't have X2APIC and CPUID 0xB?

Perhaps.

Does your box have a proper serial port?

No, sorry. I know it'd help with getting logs, and I do have a serial-to-USB cable that I use for another machine, but in this one the port is not routed to outside. I think I can put a header there as the motherboard does have pins, but I'd have to buy one first. In theory, I can do that, but that won't happen within the next few weeks.

P.S. Piotr Gorski (in Cc) also reported this: "My friend from CachyOS can confirm bugs with smpboot patches. AMD FX 6300 only shows 1 core when using smp boot patchset". Probably, he can reply to this thread and provide more details.


I ran mem/disk versions of 'sudo rtcwake --mode mem -s 60'
on my Rome server, and multiple suspend/resumes succeeded, and
with all CPUs, but then the NETDEV WATCHDOG fired - not sure
if it's related:

...
[ 2751.335882] smpboot: Booting Node 1 Processor 127 APIC 0x7f
[ 2751.340124] ACPI: \_SB_.C07F: Found 2 idle states
[ 2751.392591] CPU127 is up
[ 2751.455650] nvme nvme0: 7/0/0 default/read/poll queues
[ 2751.466112] e1000e 0000:41:00.0 enp65s0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[ 2751.635573] PM: Cannot find swap device, try swapon -a
[ 2751.641315] PM: Cannot get swap writer
[ 2751.926594] ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 2751.928527] ata7.00: supports DRM functions and may not be fully accessible
[ 2751.933208] ata7.00: supports DRM functions and may not be fully accessible
[ 2751.937797] ata7.00: configured for UDMA/133
[ 2751.948170] ata7.00: Enabling discard_zeroes_data
[ 2762.428397] PM: hibernation: Basic memory bitmaps freed
[ 2762.429004] OOM killer enabled.
[ 2762.429008] Restarting tasks ... done.
[ 2762.433155] PM: hibernation: hibernation exit
[ 2762.447318] systemd-journald[1387]: Sent WATCHDOG=1 notification.
[ 2830.013372] systemd-journald[1387]: Sent WATCHDOG=1 notification.
[ 2919.099718] systemd-journald[1387]: Sent WATCHDOG=1 notification.
[ 2992.729927] ------------[ cut here ]------------
[ 2992.729965] NETDEV WATCHDOG: enp65s0 (e1000e): transmit queue 0 timed out
[ 2992.730012] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x234/0x250
[ 2992.730032] Modules linked in: intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd rapl ipmi_si wmi_bmof binfmt_misc kvm_amd kvm nls_iso8859_1 joydev input_leds k10temp mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ipmi_devintf ipmi_msghandler msr ramoops reed_solomon efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath linear ast i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea sysfillrect crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha512_ssse3 aesni_intel crypto_simd cryptd hid_generic sysimgblt nvme usbhid ahci drm libahci hid nvme_core i2c_piix4 wmi
[ 2992.730711] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.2.0-rc8+ #20
[ 2992.730720] Hardware name: AMD Corporation ETHANOL_X/ETHANOL_X, BIOS RXM1007C 11/08/2021
[ 2992.730727] RIP: 0010:dev_watchdog+0x234/0x250
[ 2992.730732] Code: 00 e9 12 ff ff ff 4c 89 e7 c6 05 6f d5 42 01 01 e8 c1 55 f8 ff 44 89 f1 4c 89 e6 48 c7 c7 28 ff 88 b1 48 89 c2 e8 85 fc 1c 00 <0f> 0b e9 22 ff ff ff 0f 1f 44 00 00 e9 0b ff ff ff 66 66 2e 0f 1f
[ 2992.730740] RSP: 0018:ffffbf6100003e30 EFLAGS: 00010286
[ 2992.730754] RAX: 0000000000000000 RBX: ffff9a8606da8550 RCX: 0000000000000000
[ 2992.730764] RDX: 0000000000000103 RSI: ffffffffb1741bcc RDI: 00000000ffffffff
[ 2992.730770] RBP: ffffbf6100003e50 R08: 0000000000000000 R09: 00000000ffefffff
[ 2992.730775] R10: ffffbf6100003ca8 R11: ffff9b038d9fd668 R12: ffff9a8606da8000
[ 2992.730780] R13: ffff9a8606da8460 R14: 0000000000000000 R15: ffff9ac346fe2480
[ 2992.730787] FS: 0000000000000000(0000) GS:ffff9ac346e00000(0000) knlGS:0000000000000000
[ 2992.730788] systemd-journald[1387]: Compressed data object 717 -> 433 using ZSTD
[ 2992.730797] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2992.730804] CR2: 00007f4c419a0890 CR3: 0008004e47412003 CR4: 0000000000770ef0
[ 2992.730812] PKRU: 55555554
[ 2992.730819] Call Trace:
[ 2992.730826] <IRQ>
[ 2992.730840] ? __pfx_dev_watchdog+0x10/0x10
[ 2992.730852] call_timer_fn+0xac/0x250
[ 2992.730868] ? __pfx_dev_watchdog+0x10/0x10
[ 2992.730876] __run_timers+0x22d/0x2e0
[ 2992.730884] ? seqcount_lockdep_reader_access.constprop.0+0x45/0x60
[ 2992.730893] ? ktime_get+0x28/0xc0
[ 2992.730903] ? __pfx_read_tsc+0x10/0x10
[ 2992.730915] ? ktime_get+0x56/0xc0
[ 2992.730922] ? sched_clock+0xd/0x20
[ 2992.730930] ? sched_clock_cpu+0x14/0xd0
[ 2992.730941] run_timer_softirq+0x33/0x60
[ 2992.730947] __do_softirq+0x12f/0x380
[ 2992.730960] __irq_exit_rcu+0xaf/0x120
[ 2992.730970] irq_exit_rcu+0x12/0x20
[ 2992.730976] sysvec_apic_timer_interrupt+0xb4/0xd0
[ 2992.730984] </IRQ>
[ 2992.730989] <TASK>
[ 2992.730993] asm_sysvec_apic_timer_interrupt+0x1f/0x30
[ 2992.731004] RIP: 0010:cpuidle_enter_state+0x12d/0x4d0
[ 2992.731015] Code: 00 31 ff e8 e5 c0 45 ff 80 7d d7 00 74 16 9c 58 0f 1f 40 00 f6 c4 02 0f 85 7b 03 00 00 31 ff e8 99 de 4d ff fb 0f 1f 44 00 00 <45> 85 ff 0f 88 d1 01 00 00 49 63 d7 4c 89 f1 48 2b 4d c8 48 8d 04
[ 2992.731018] RSP: 0018:ffffffffb1a03dd8 EFLAGS: 00000246
[ 2992.731024] RAX: ffff9ac346e00000 RBX: ffff9ac4ec3ce400 RCX: 000000000000001f
[ 2992.731027] RDX: 0000000000000000 RSI: ffffffffb1741bcc RDI: ffffffffb17470fa
[ 2992.731035] RBP: ffffffffb1a03e10 R08: 000002b8cc9a617d R09: 000002b8c7e0b83a
[ 2992.731041] R10: 000000000002bc82 R11: ffff9ac346ff2d44 R12: 0000000000000002
[ 2992.731048] R13: ffffffffb1f60bc0 R14: 000002b8cc9a617d R15: 0000000000000002
[ 2992.731059] ? cpuidle_enter_state+0x10b/0x4d0
[ 2992.731067] cpuidle_enter+0x32/0x50
[ 2992.731072] call_cpuidle+0x23/0x50
[ 2992.731080] do_idle+0x1dc/0x250
[ 2992.731090] cpu_startup_entry+0x24/0x30
[ 2992.731096] rest_init+0x108/0x110
[ 2992.731101] arch_call_rest_init+0x12/0x35
[ 2992.731114] start_kernel+0x6f3/0x71d
[ 2992.731120] x86_64_start_reservations+0x28/0x2e
[ 2992.731127] x86_64_start_kernel+0x80/0x8a
[ 2992.731133] secondary_startup_64_no_verify+0x186/0x18b
[ 2992.731151] </TASK>
[ 2992.731158] ---[ end trace 0000000000000000 ]---
[ 2992.731371] e1000e 0000:41:00.0 enp65s0: Reset adapter unexpectedly
[ 2992.870854] systemd-journald[1387]: Successfully sent stream file descriptor to service manager.
[ 2996.338900] e1000e 0000:41:00.0 enp65s0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

Thanks,

Kim