Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop

From: Thiago Jung Bauermann
Date: Fri Dec 07 2018 - 21:41:14 EST

Next message: Martin K. Petersen: "Re: [PATCH] scsi: megaraid_sas: NULL check before some freeing functions is not needed."
Previous message: Martin K. Petersen: "Re: [PATCH 01/41] scsi: BusLogic: mark expected switch fall-through"
In reply to: Gautham R Shenoy: "Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Gautham R Shenoy <ego@xxxxxxxxxxxxxxxxxx> writes:
> On Fri, Dec 07, 2018 at 04:13:11PM +0530, Gautham R Shenoy wrote:
>> Sure. I will test the patch and report back.
>
> I added the following debug patch on top of your patch, and after an
> hour's run, the system crashed. Appending the log at the end.

Thank you very much for testing! Your debug patch was very helpful as
well.

> I suppose we still need to increase the number of tries since we wait
> only for 2.5ms looping before giving up.

Do you think it would have helped? In the debug output you posted I
would have expected the following message to show up if the loop
finished too early, and it didn't:

"Querying DEAD? cpu %i (%i) shows %i\n"

So I don't think increasing the loop length would have made a
difference. In fact, the call to smp_query_cpu_stopped() always
succeeded on the first iteration.

I think there is something else going on which we don't fully understand
yet. From your other email:

> I agree that the Kernel has to respect RTAS's restriction. The PAPR
> v2.8.1, Requirement R1-7.2.3-8 under section 7.2.3 says the following:
>
> "The stop-self service needs to be serialized with calls to the
> stop-self, start-cpu, and set-power-level services. The OS must
> be able to call RTAS services on other processors while the
> processor is stopped or being stopped"
>
> Thus the onus is on the OS to ensure that there are no concurrent rtas
> calls with "stop-self" token.

As you say perhaps there's another call to stop-self, start-cpu or
set-power-level being made concurrently. I don't currently see how more
than one stop-self or start-cpu call could be in flight at the same time
given that there are a number of locks being grabbed during CPU hotplug
and unplug. OTOH the CPU that actually calls stop-self doesn't seem to
grab any locks itself so it's a possibility.

As for set-power-level, it's only used in the case of PCI hotplug from
what I can see, and that isn't part of the picture in this case, right?

We could address this problem directly by adding another lock separate
from rtas.lock to serialize just these calls. The challenge is with
stop-self, because the CPU calling it will never return to release the
lock. Is it possible to grab a lock (or down a semaphore) in the CPU
calling stop-self and then release the lock (or up the semaphore) in the
CPU running pseries_cpu_die()?

> > There's also a race between the CPU driving the unplug and the CPU
> > being unplugged which I think is not easy for the CPU being
> > unplugged to win, which makes the busy loop in pseries_cpu_die() a
> > bit fragile. I describe the race in the patch description.
> >
> > My solution to make the race less tight is to make the CPU driving
> > the unplug to only start the busy loop only after the CPU being
> > unplugged is in the CPU_STATE_OFFLINE state. At that point, we know
> > that it either is about to call RTAS or it already has.
>
> Ah, yes this is good optimization. Though, I think we ought to
> unconditionally wait until the target CPU has woken up from CEDE and
> changed its state to CPU_STATE_OFFLINE. After if PROD failed, then we
> would have caught it in dlpar_offline_cpu() itself.

I recently saw a QEMU-implemented hcall (H_LOGICAL_CI_LOAD) return
success when it had been given an invalid memory address to load from,
so my confidence in the error reporting of hcalls is a bit shaken. :-)

In that case the CPU would wait forever for the CPU state to change. If
you believe 100 ms is too short a timeout, we could make it 500 ms or
even 1s. What do you think?

> cpu 112 (hwid 112) Ready to die...
> [DEBUG] Waited for CPU 112 to enter rtas: tries=0, time=65
> cpu 113 (hwid 113) Ready to die...
> [DEBUG] Waited for CPU 113 to enter rtas: tries=0, time=1139
> cpu 114 (hwid 114) Ready to die...
> [DEBUG] Waited for CPU 114 to enter rtas: tries=0, time=1036
> cpu 115 (hwid 115) Ready to die...
> [DEBUG] Waited for CPU 115 to enter rtas: tries=0, time=133
> cpu 116 (hwid 116) Ready to die...
> [DEBUG] Waited for CPU 116 to enter rtas: tries=0, time=1231
> cpu 117 (hwid 117) Ready to die...
> [DEBUG] Waited for CPU 117 to enter rtas: tries=0, time=1231
> cpu 118 (hwid 118) Ready to die...
> [DEBUG] Waited for CPU 118 to enter rtas: tries=0, time=1231
> cpu 119 (hwid 119) Ready to die...
> [DEBUG] Waited for CPU 119 to enter rtas: tries=0, time=1131
> cpu 104 (hwid 104) Ready to die...
> [DEBUG] Waited for CPU 104 to enter rtas: tries=0, time=40

Interesting, so 1.2 ms can pass before the dying CPU actually gets close
to making the stop-self call. And even in those cases the retry loop is
succeeding on the first try! So this shows that changing the code to
wait for the CPU_STATE_OFFLINE state is worth it.

> ******* RTAS CALL BUFFER CORRUPTION *******
> 393: rtas32_call_buff_ptr=
> 0000 0060 0000 0060 0000 0060 0000 0060 [...`...`...`...`]
> 0000 0060 0000 0060 0000 0060 0000 0060 [...`...`...`...`]
> 0000 0060 0000 0060 0000 0060 0000 0060 [...`...`...`...`]
> 0000 0060 0800 E07F ACA7 0000 0000 00C0 [...`............]
> 2500 0000 0000 0000 0000 0000 0000 0000 [%...............]
> 0000 0000 0000 0000 0000 0000 306E 7572 [............0nur]
> 4800 0008 .... .... .... .... .... .... [H...........0nur]
> 394: rtas64_map_buff_ptr=
> 0000 0000 5046 5743 0000 0000 4F44 4500 [....PFWC....ODE.]
> 0000 0000 6000 0000 0000 0000 0000 0069 [....`..........i]
> 0000 0000 0000 0000 0000 0000 0000 0000 [................]
> 0000 0000 0000 0005 0000 0000 0000 0001 [................]
> 0000 0000 1A00 0000 0000 0000 0000 0000 [................]
> 0000 0000 8018 6398 0000 0000 0300 00C0 [......c.........]
> 0000 0000 .... .... .... .... .... .... [......c.........]

Ah! I never saw this error message. So indeed the kernel is causing RTAS
to blow up. Perhaps it would be useful to instrument more RTAS calls
(especially start-cpu and set-power-level) to see if it's one of them
that is being called at the time this corruption happens.

> cpu 105 (hwid 105) Ready to die...
> Bad kernel stack pointer 1fafb6c0 at 0
> Oops: Bad kernel stack pointer, sig: 6 [#1]
> LE SMP NR_CPUS=2048 NUMA pSeries
> Modules linked in:
> CPU: 105 PID: 0 Comm: swapper/105 Not tainted 4.20.0-rc5-thiago+ #45
> NIP: 0000000000000000 LR: 0000000000000000 CTR: 00000000007829c8
> REGS: c00000001e63bd30 TRAP: 0700 Not tainted (4.20.0-rc5-thiago+)
> MSR: 8000000000081000 <SF,ME> CR: 28000004 XER: 00000010
> CFAR: 000000001ec153f0 IRQMASK: 8000000000009033
> GPR00: 0000000000000000 000000001fafb6c0 000000001ec236a0 0000000000000040
> GPR04: 00000000000000c0 0000000000000080 00046c4fb4842557 00000000000000cd
> GPR08: 000000000001f400 000000001ed035dc 0000000000000000 0000000000000000
> GPR12: 0000000000000000 c00000001eb5e480 c0000003a1b53f90 000000001eea3e20
> GPR16: 0000000000000000 c0000006fd845100 c00000000004c8b0 c0000000013d5300
> GPR20: c0000006fd845300 0000000000000008 c0000000019d2cf8 c0000000013d6888
> GPR24: 0000000000000069 c0000000013d688c 0000000000000002 c0000000013d688c
> GPR28: c0000000019cecf0 0000000000000348 0000000000000000 0000000000000000
> NIP [0000000000000000] (null)
> LR [0000000000000000] (null)
> Call Trace:
> Instruction dump:
> XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
> XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX 60000000 60000000 60000000 60000000
> ---[ end trace 1aa3b4936949457e ]---

Ok, so at about the time CPU 105 makes the stop-self call there is this
RTAS call buffer corruption and this bad kernel stack pointer in CPU 105.
We need to understand better what is causing this.

> Bad kernel stack pointer 1fafb4b0 at 1ec15004
> rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> rcu: 88-...!: (0 ticks this GP) idle=2ce/1/0x4000000000000000 softirq=28076/28076 fqs=78
> rcu: (detected by 72, t=10866 jiffies, g=180529, q=2526)
> Sending NMI from CPU 72 to CPUs 88:
> CPU 88 didn't respond to backtrace IPI, inspecting paca.
> irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 22978 (drmgr)
> Back trace of paca->saved_r1 (0xc0000006f94ab750) (possibly stale):
> Call Trace:
> [c0000006f94ab750] [c0000006f94ab790] 0xc0000006f94ab790 (unreliable)
> [c0000006f94ab930] [c0000000000373f8] va_rtas_call_unlocked+0xc8/0xe0
> [c0000006f94ab950] [c000000000037a98] rtas_call+0x98/0x200
> [c0000006f94ab9a0] [c0000000000d7d28] smp_query_cpu_stopped+0x58/0xe0
> [c0000006f94aba20] [c0000000000d9dbc] pseries_cpu_die+0x1ec/0x240
> [c0000006f94abad0] [c00000000004f284] __cpu_die+0x44/0x60
> [c0000006f94abaf0] [c0000000000d8e10] dlpar_cpu_remove+0x160/0x340
> [c0000006f94abbc0] [c0000000000d9184] dlpar_cpu_release+0x74/0x100
> [c0000006f94abc10] [c000000000025a74] arch_cpu_release+0x44/0x70
> [c0000006f94abc30] [c0000000009bd1bc] cpu_release_store+0x4c/0x80
> [c0000006f94abc60] [c0000000009ae000] dev_attr_store+0x40/0x70
> [c0000006f94abc80] [c000000000495810] sysfs_kf_write+0x70/0xb0
> [c0000006f94abca0] [c00000000049453c] kernfs_fop_write+0x17c/0x250
> [c0000006f94abcf0] [c0000000003ccb6c] __vfs_write+0x4c/0x1f0
> [c0000006f94abd80] [c0000000003ccf74] vfs_write+0xd4/0x240
> [c0000006f94abdd0] [c0000000003cd338] ksys_write+0x68/0x110
> [c0000006f94abe20] [c00000000000b288] system_call+0x5c/0x70

So CPU 88 is the one driving the hot unplug and waiting for CPU 105 to
die. But it is stuck inside RTAS. Perhaps because of the call buffer
corruption?

> rcu: rcu_sched kthread starved for 10709 jiffies! g180529 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=72
> rcu: RCU grace-period kthread stack dump:
> rcu_sched I 0 11 2 0x00000808
> Call Trace:
> [c0000000061ab840] [c0000003a4a84800] 0xc0000003a4a84800 (unreliable)
> [c0000000061aba20] [c00000000001e24c] __switch_to+0x2dc/0x430
> [c0000000061aba80] [c000000000e5fb94] __schedule+0x3d4/0xa20
> [c0000000061abb50] [c000000000e6022c] schedule+0x4c/0xc0
> [c0000000061abb80] [c000000000e64ffc] schedule_timeout+0x1dc/0x4e0
> [c0000000061abc80] [c0000000001af40c] rcu_gp_kthread+0xc3c/0x11f0
> [c0000000061abdb0] [c00000000013c7c8] kthread+0x168/0x1b0
> [c0000000061abe20] [c00000000000b658] ret_from_kernel_thread+0x5c/0x64

I don't know what to make of CPU 72. :-) Perhaps it's the one making
the other "rogue" RTAS call interfering with stop-self in CPU 105?

It must be some RTAS call made with rtas_call_unlocked, because CPU 88
is holding the RTAS lock.

-- Thiago Jung Bauermann
IBM Linux Technology Center

Next message: Martin K. Petersen: "Re: [PATCH] scsi: megaraid_sas: NULL check before some freeing functions is not needed."
Previous message: Martin K. Petersen: "Re: [PATCH 01/41] scsi: BusLogic: mark expected switch fall-through"
In reply to: Gautham R Shenoy: "Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]