Re: Crashing 'kzm' target in next-20160913 due to 'gpio: mxc: shift gpio_mxc_init() to subsys_initcall level'

From: Guenter Roeck
Date: Thu Sep 15 2016 - 10:35:29 EST


On 09/15/2016 07:23 AM, Uwe Kleine-König wrote:
On Thu, Sep 15, 2016 at 04:35:04PM +0300, Vladimir Zapolskiy wrote:
Hi Guenter,

On 09/14/2016 06:20 AM, Guenter Roeck wrote:
Hi Vladimir,

your commit e188cbf7564f ("gpio: mxc: shift gpio_mxc_init() to subsys_initcall level")
in -next causes the following crash when running the 'kzm' target (and most likely
the real thing) with qemu.

[ 1.211426] Unable to handle kernel NULL pointer dereference at virtual address 0000000c
[ 1.211600] pgd = c0004000
[ 1.211680] [0000000c] *pgd=00000000
[ 1.212067] Internal error: Oops: 5 [#1] SMP ARM
[ 1.212245] Modules linked in:
[ 1.212542] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.8.0-rc6-next-20160913 #1
[ 1.212671] Hardware name: Kyoto Microcomputer Co., Ltd. KZM-ARM11-01
[ 1.212825] task: c6848000 task.stack: c683e000
[ 1.213231] PC is at platform_get_irq+0xc0/0xe8

See http://kerneltests.org/builders/qemu-arm-next/builds/525/steps/qemubuildcommand/logs/stdio
for a complete log.

Problem is quite subtle. The change causes the gpio driver to be installed later.
As a result, kzm_init_smsc9118() fails to initialize the gpio pins correctly.
gpio_request() in that function returns -EPROBE_DEFER, which is ignored,
gpio_to_irq() then returns -22 which is unconditionally assigned as interrupt number.
platform_get_irq(), as called from the smsc driver, gets this negative interrupt
number, and passes it unconditionally to irq_get_irq_data(), which returns NULL.
The NULL pointer is then passed to irqd_set_trigger_type() which, not entirely
surprisingly, crashes.

So, in other words, lots of bugs here. Nevertheless, I would suggest to keep using
postcore_initcall(), at least until it is sure that all gpio clients handle -EPROBE_DEFER
correctly.

I'm inviting Shawn and Uwe to the discussion.

The proper fix in this particular case should be like this one:

diff --git a/arch/arm/mach-imx/mach-kzm_arm11_01.c b/arch/arm/mach-imx/mach-kzm_arm11_01.c
index 31df4361996f..8288acfe7221 100644
--- a/arch/arm/mach-imx/mach-kzm_arm11_01.c
+++ b/arch/arm/mach-imx/mach-kzm_arm11_01.c
@@ -245,13 +245,17 @@ static void __init kzm_board_init(void)
mxc_iomux_setup_multiple_pins(kzm_pins,
ARRAY_SIZE(kzm_pins), "kzm");
- kzm_init_ext_uart();
- kzm_init_smsc9118();
kzm_init_imx_uart();
pr_info("Clock input source is 26MHz\n");
}
+static void __init kzm_late_init(void)
+{
+ kzm_init_ext_uart();
+ kzm_init_smsc9118();
+}
+
/*
* This structure defines static mappings for the kzm-arm11-01 board.
*/
@@ -291,5 +295,6 @@ MACHINE_START(KZM_ARM11_01, "Kyoto Microcomputer Co., Ltd. KZM-ARM11-01")
.init_irq = mx31_init_irq,
.init_time = kzm_timer_init,
.init_machine = kzm_board_init,
+ .init_late = kzm_late_init,
.restart = mxc_restart,
MACHINE_END

That + checking the return code of gpio_request and the other calls.
Or better, convert the machine to dt.

But I agree that there might be more legacy boards (i.MX31 only IMHO),
which may attempt to manipulate GPIO lines before subsys_initcall()
level.

I wouldn't revert anything for legacy boards. That's the chance to say
in the near future: They stopped working in September 2016, obviously
nobody cares, let's rip them. :-)

New kernel development philosophy ? Regressions are acceptable as long as
they affect a board older than X years ? What is your cut-off date for accepting
regressions like that ?

If that is the new philosophy, can it be made official ? That would help me
understand when I don't need to bother sending bug reports.

Following that new development paradigm, should I refrain from reporting that
arm-realview-eb-11mp-revb has been broken in -next since next-20160914 ?

Thanks,
Guenter