Re: [PATCH] pwm: core: Zero-initialize the temp state

From: Munehisa Kamata
Date: Tue Feb 28 2023 - 03:54:11 EST


Hi

On Mon, 2023-02-27 09:16:21 +0000, Uwe Kleine-König wrote:
>
> Hello,
>
> On Sun, Feb 26, 2023 at 06:48:30PM -0800, Munehisa Kamata wrote:
> > On Sun, 2023-02-26 09:17:52 +0000, Uwe Kleine-König <u.kleine-koenig@xxxxxxxxxxxxxx> wrote:
> > > On Sat, Feb 25, 2023 at 05:37:21PM -0800, Munehisa Kamata wrote:
> > > > Zero-initialize the on-stack structure to avoid unexpected behaviors. Some
> > > > drivers may not set or initialize all the values in pwm_state through their
> > > > .get_state() callback and therefore some random values may remain there and
> > > > be set into pwm->state eventually.
> > > >
> > > > This actually caused regression on ODROID-N2+ as reported in [1]; kernel
> > > > fails to boot due to random panic or hang-up.
> > > >
> > > > [1] https://forum.odroid.com/viewtopic.php?f=177&t=46360
> > >
> > > Looking through the report I wonder what actually made the machine fail
> > > to boot. Doesn't this paper over a problem that should be fixed (also)
> > > somewhere else?
> >
> > Yes, you're right and I think the commit message could have described more
> > details. This patch is for ensuring all drivers see zeroed state same as
> > before, but I still don't fully understand how it ends up such random-ish
> > crashes. There could be another or bigger problem that should be fixed.
> >
> > > Which driver is the one that the problem occur for?
> >
> > It's pwm-meson and seems to be caused by random polarity value somehow. If
> > meson_pwm_get_state() sets polarity to zero instead, I don't see the
> > problem. According the comment, looks like the driver does not set polarity
> > by design.
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pwm/pwm-meson.c?h=v6.2&id=c9c3395d5e3dcc6daee66c6908354d47bf98cb0c#n9
> >
> > Before commit c73a3107624d, the memory was kcalloc'ed and always zeroed,
> > but I don't know if the driver was (is) assuming that. I'm adding Meson SoC
> > people to CC.
> >
> > Apart from how polarity should be handled in the driver, I'm very puzzled
> > by the crashes I've observed so far. There seems to be some patterns, but
> > they don't seem obviously related to PWM.
> >
> > [ 1.360542] soc soc0: Amlogic Meson G12B (S922X) Revision 29:c (40:2) Detected
> > [ 1.363906] Insufficient stack space to handle exception!
> > [ 1.363913] ESR: 0x0000000096000047 -- DABT (current EL)
> > [ 1.363917] FAR: 0xffff800009a47ff0
> > [ 1.363920] Task stack: [0xffff800009a48000..0xffff800009a4c000]
> > [ 1.363923] IRQ stack: [0xffff8000099a8000..0xffff8000099ac000]
> > [ 1.363927] Overflow stack: [0xffff000077b76100..0xffff000077b77100]
> > [ 1.363931] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 6.2.0-odroid-arm64 #47
> > [ 1.363938] Hardware name: Hardkernel ODROID-N2Plus (DT)
> > [ 1.363941] pstate: 400000c5 (nZcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > [ 1.363947] pc : __do_kernel_fault+0x4/0x180
> > [ 1.363961] lr : do_page_fault+0xd0/0x3d0
> > [ 1.363968] sp : ffff800009a48020
> > [ 1.363970] x29: ffff800009a48020 x28: ffff000002948000 x27: 0000000000000000
> > [ 1.363980] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
> > [ 1.363987] x23: 0000000096000004 x22: ffff800009a48110 x21: 00000000ffffffff
> > [ 1.363994] x20: ffff800009a48110 x19: 00000000ffffffff x18: 000000000000001c
> > [ 1.364001] x17: 00000000863047f6 x16: ffff800009860a80 x15: 0000000000000003
> > [ 1.364008] x14: 00000000000003ef x13: 0000000000000000 x12: 0000000000000279
> > [ 1.364015] x11: 0000000000000001 x10: 0000000000000001 x9 : 0000000000000400
> > [ 1.364021] x8 : ffff000077b7fc40 x7 : ffff000077b7fbc0 x6 : ffff8000094f7990
> > [ 1.364028] x5 : 0000000000000000 x4 : ffff000002948000 x3 : 0001000000000000
> > [ 1.364035] x2 : ffff800009a48110 x1 : 0000000096000004 x0 : 00000000ffffffff
> > [ 1.364043] Kernel panic - not syncing: kernel stack overflow
> > [ 1.364047] SMP: stopping secondary CPUs
> >
> > Another example:
> >
> > [ 1.360997] soc soc0: Amlogic Meson G12B (S922X) Revision 29:c (40:2) Detected
> > [ 1.364333] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
> > [ 1.364435] Unable to handle kernel paging request at virtual address 0000000008003388
> > [ 1.367470] Internal error: Oops - Undefined instruction: 0000000002000000 [#1] PREEMPT SMP
> > [ 1.367483] Modules linked in:
> > [ 1.367490] CPU: 2 PID: 66 Comm: kworker/u12:1 Not tainted 6.2.0-odroid-arm64 #47
> > [ 1.367498] Hardware name: Hardkernel ODROID-N2Plus (DT)
> > [ 1.367502] Workqueue: events_unbound async_run_entry_fn
> > [ 1.367516] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > [ 1.367523] pc : arch_timer_handler_phys+0x0/0x44
> > [ 1.367535] lr : handle_percpu_devid_irq+0x84/0x140
> > [ 1.367543] sp : ffff8000099abf70
> > [ 1.367546] x29: ffff8000099abf70 x28: ffff000002a40000 x27: 0000000000000006
> > [ 1.367556] x26: ffff800009d208a4 x25: ffff800009d20b44 x24: 0000000000000008
> > [ 1.367565] x23: ffff8000094f8b50 x22: 000000000000000b x21: ffff000002829000
> > [ 1.367573] x20: ffff800008f89fb0 x19: ffff00000282a600 x18: ffff0000021792e8
> > [ 1.367581] x17: ffff80006e685000 x16: ffff8000099ac000 x15: 0000000000004000
> > [ 1.367589] x14: ffff800009d27bde x13: ffff800009d2887b x12: 0000000000000308
> > [ 1.367597] x11: 0000000000000a92 x10: 0000000000000068 x9 : ef01a5948f440b31
> > [ 1.367606] x8 : 0000000000003273 x7 : ffff800009d262a8 x6 : 000000003754d375
> > [ 1.367614] x5 : ffff80006e685000 x4 : ffff8000099abf70 x3 : ffff80006e685000
> > [ 1.367622] x2 : ffff800008c26c30 x1 : ffff000077b83a00 x0 : 000000000000000b
> > [ 1.367630] Call trace:
> > [ 1.367633] arch_timer_handler_phys+0x0/0x44
> > [ 1.367641] generic_handle_domain_irq+0x2c/0x44
> > [ 1.367650] gic_handle_irq+0x44/0xc4
> > [ 1.367659] call_on_irq_stack+0x2c/0x5c
> > [ 1.367666] do_interrupt_handler+0x80/0x84
> > [ 1.367672] el1_interrupt+0x34/0x70
> > [ 1.367682] el1h_64_irq_handler+0x18/0x2c
> > [ 1.367686] el1h_64_irq+0x64/0x68
> > [ 1.367690] HUF_decompress4X1_usingDTable_internal_default+0x2fc/0xd60
> > [ 1.367702] HUF_decompress4X_hufOnly_wksp_bmi2+0xec/0x140
> > [ 1.367711] ZSTD_decodeLiteralsBlock+0x580/0x630
> > [ 1.367717] ZSTD_decompressBlock_internal.part.0+0x5c/0x1b4
> > [ 1.367723] ZSTD_decompressBlock_internal+0x1c/0x30
> > [ 1.367729] ZSTD_decompressContinue.part.0+0x364/0x444
> > [ 1.367734] ZSTD_decompressContinueStream+0x98/0x180
> > [ 1.367738] ZSTD_decompressStream+0x5b0/0x8c0
> > [ 1.367743] zstd_decompress_stream+0x10/0x20
> > [ 1.367751] unzstd+0x290/0x37c
> > [ 1.367760] unpack_to_rootfs+0x174/0x298
> > [ 1.367767] do_populate_rootfs+0x84/0x1dc
> > [ 1.367773] async_run_entry_fn+0x34/0x150
> > [ 1.367778] process_one_work+0x1d0/0x320
> > [ 1.367785] worker_thread+0x14c/0x444
> >
> > Perhaps, the driver or the PWM core could do something wrong based on the
> > invalid polarity and corrupt certain memory location, but I'm still not
> > able to relate the random value to these crashes.
>
> I can imagine that the compiler creates code (a jump table) that relies
> on .polarity being either PWM_POLARITY_NORMAL or PWM_POLARITY_INVERSED
> and that exibits UB if that isn't true. (I think the compiler is allowed
> to do that.)
>
> Looking at the driver, there are several problems:
>
> - The driver does just
>
> if (state->polarity == PWM_POLARITY_INVERSED)
> duty = period - duty;
>
> which is broken because each period is supposed to start with the
> active part. (You could argue this being ridiculous and I'd agree.
> But that's what we have and just doing it differently in a driver is
> wrong.) The fix is to check if the hardware actually emits normal or
> inversed polarity and refuse the other one. Then in .get_state() set
> the appropriate polarity.

Looking at the comment in the source, I'm not sure if such checking is
possible with this hardware.

>From pwm-meson.c:
* The hardware has no "polarity" setting. This driver reverses the period
* cycles (the low length is inverted with the high length) for
* PWM_POLARITY_INVERSED. This means that .get_state cannot read the polarity
* from the hardware.

As far as I see, there seems to be some drivers that hard-code polarity in
.get_state() without getting it from hardware. Perhaps, we could do that
instead? Honestly, I don't understand how it's a bad idea here.

> - In meson_pwm_calc() we have:
>
> unsigned int duty, ...;
>
> duty = state->duty_cycle;
>
> which is wrong for state->duty_cycle > U32_MAX. The same for period.
>
> - After period is fixed to be proper u64, fin_freq * (u64)period might
> overflow.
>
> - harmless: The check
>
> duty_cnt = div64_u64(fin_freq * (u64)duty,
> NSEC_PER_SEC * (pre_div + 1));
> if (duty_cnt > 0xffff)
> ...
>
> never triggers, as duty <= period (after fixing the first issue) and
> we already know that
>
> div64_u64(fin_freq * (u64)period, NSEC_PER_SEC * (pre_div + 1)) <= 0xffff.

Ah, I believe this requires a separate fix.


Regards,
Munehisa