APIC timer checked before it is set up, boot fails on Connex L1430

From: Daniel Drake
Date: Fri Dec 28 2018 - 01:33:38 EST


Hi,

On the Connex L1430 laptop based on Intel Apollo Lake N3350, Linux
doesn't boot. It hangs early on a blank screen. Reproduced with Linus
git, 4.18 and 4.19 (there is no previous known working kernel
version). EFI earlyprintk shows:

APIC: switch to symmetric I/O mode setup
x2apic: IRQ remapping doesn't support X2APIC mode
..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
..MP-BIOS bug: 8254 timer not connected to IO-APIC
...tryign to set up timer (IRQ0) through the 8259A ...
..... (found apic 0 pin 2) ...
....... failed.
...trying to set up timer as Virtual Wire IRQ...
..... failed.
...trying to set up timer as ExtINT IRQ...
do_IRQ: 0.55 No irq handler for vector
..... failed :(.
Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with
apic=debug and send a report.

Looking closer, check_timer() is observing that the IOAPIC timer
doesn't tick, so it then tries some other approaches but doesn't
manage to get them working either.

The strange thing is, I booted with the no_timer_check parameter and
the system works fine! With this parameter it assumes the IOAPIC timer
is ticking and just continues the boot sequence anyway. Here is the
boot log with apic=debug no_timer_check:
https://gist.github.com/dsd/6f40d8ecc7102dd5dcb90c5dedc69214#file-dmesg-txt

/proc/interrupts shows that APIC Local timer interrupts are working
fine on both CPUs:
https://gist.github.com/dsd/6f40d8ecc7102dd5dcb90c5dedc69214#file-interrupts-txt

So, check_timer() is incorrectly deducing that the IOAPIC timer isn't
working. The way it checks this is to do a delay loop and then check
if jiffies has advanced. I presume the expectation here is that during
this delay, the hardware IRQ will invoke local_apic_timer_interrupt()
which will then increment jiffies. Indeed, during check_timer()
execution this interrupt does not fire, however by using
no_timer_check and adding a log message I can see that it fires for
the first time quite some time after check_timer() is done:

..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles:
0xfc66f4fc7c, max_idle_ns: 440795224246 ns
Calibrating delay loop (skipped), value calculated using timer
frequency.. 2188.80 BogoMIPS (lpj=1094400)
pid_max: default: 32768 minimum: 301
LSM: Security Framework initializing
SELinux: Initializing.
Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes)
Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes)
Mount-cache hash table entries: 4096 (order: 3, 32768 bytes)
Mountpoint-cache hash table entries: 4096 (order: 3, 32768 bytes)
mce: CPU supports 7 MCE banks
mce: CPU0: Thermal monitoring enabled (TM1)
Last level iTLB entries: 4KB 48, 2MB 0, 4MB 0
Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
Spectre V2 : Spectre mitigation: kernel not compiled with retpoline;
no mitigation available!
Freeing SMP alternatives memory: 44K
TSC deadline timer enabled
smpboot: CPU0: Intel(R) Celeron(R) CPU N3350 @ 1.10GHz (family: 0x6,
model: 0x5c, stepping: 0x9)
Performance Events: PEBS fmt3+, Goldmont events, 32-deep LBR,
full-width counters, Intel PMU driver.
... version: 4
... bit width: 48
... generic registers: 4
... value mask: 0000ffffffffffff
... max period: 00007fffffffffff
... fixed-purpose events: 3
... event mask: 000000070000000f
rcu: Hierarchical SRCU implementation.
smp: Bringing up secondary CPUs ...
!!! local_apic_timer_interrupt for the first time cpu0 !!!

Experimenting further, I used the same approach of adding delays and
checking for the interrupt during the delay to figure out at which
precise point during the boot sequence the timer interrupt starts
working. It's here:

static void setup_APIC_timer(void)
{
[...]
if (this_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER)) {
[...]
clockevents_config_and_register(levt,
tsc_khz * (1000 / TSC_DIVISOR),
0xF, ~0UL);
}
}

We reach clockevents_register_device() which does:
1. Take a spinlock and disable IRQs
2. lapic_set_oneshot() which leads to "TSC deadline timer enabled" message
3. lapic_next_deadline()
4. Spin unlock & re-enable IRQs

At the exact point where IRQs are re-enabled above, which is at the
time of return from clockevents_config_and_register(), timer
interrupts start working.


The overall ordering here seems surprising. check_timer() is probing
whether the APIC timer works well before setup_APIC_timer() has been
called. Shouldn't the timer be checked only after it has been set up?

Or is Linux assuming that the BIOS will boot with the APIC timer
already running?

Any other debugging suggestions much appreciated.

Thanks
Daniel