Re: Solid freezes with 2.6.25

From: Andrew Morton
Date: Mon Apr 28 2008 - 12:26:10 EST


On Mon, 28 Apr 2008 16:29:35 +0200 Gabor Gombas <gombasg@xxxxxxxxx> wrote:

> Hi,
>
> I'm seeing solid freezes with 2.6.25. 2.6.24.x works fine, 2.6.25 never
> had an uptime longer than 4-6 hours so far. netconsole captured the
> following:
>
> NMI Watchdog detected LOCKUP on CPU 1
> CPU 1
> Modules linked in: edd netconsole configfs i915 radeon drm rfcomm l2cap bluetooth xfrm_user xfrm4_tunnel tunnel4 ipcomp esp4 aead ah4 nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs ipt_ULOG microcode ipt_REJECT nf_conntrack_ipv4 xt_state nf_conntrack xt_tcpudp ipt_LOG xt_limit iptable_filter ip_tables x_tables deflate zlib_deflate zlib_inflate ctr twofish twofish_common camellia serpent blowfish des_generic cbc aes_x86_64 aes_generic xcbc sha256_generic sha1_generic md5 crypto_null af_key fuse dm_crypt crypto_blkcipher dm_snapshot dm_mirror dm_mod coretemp w83627ehf hwmon_vid snd_hda_intel snd_pcm 8250_pnp snd_timer 8250 sg snd 8139too serial_core video r8169 snd_page_alloc usbhid i2c_i801 sr_mod iTCO_wdt floppy cdrom [last unloaded: netconsole]
> Pid: 2535, comm: postgres Not tainted 2.6.25 #11
> RIP: 0010:[<ffffffff8021aa54>] [<ffffffff8021aa54>] hpet_rtc_interrupt+0x11a/0x2fd
> RSP: 0000:ffff81012fc77ec8 EFLAGS: 00200097
> RAX: 0000000000000000 RBX: 0000000000200002 RCX: 0000000000000000
> RDX: 000000000000c6c6 RSI: 0000000000200002 RDI: ffffffff80655ef8
> RBP: 000000010011144c R08: ffffffffff5fc128 R09: 0000000000000000
> R10: 0000000000200046 R11: 0000000000000000 R12: 00000000000000a6
> R13: ffff81012fcf8800 R14: 0000000000000000 R15: 0000000000000000
> FS: 0000000000000000(0000) GS:ffff81012fc0f480(0063) knlGS:00000000f7f228e0
> CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
> CR2: 00000000f1559000 CR3: 0000000128cd8000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process postgres (pid: 2535, threadinfo ffff810128d18000, task ffff81012cbb6930)
> Stack: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> ffffffff00000000 0000000000000001 ffffffff806432c0 ffff81012fe25bc0
> 0000000000000000 0000000000000000 0000000000000008 ffffffff8025d6d0
> Call Trace:
> <IRQ> [<ffffffff8025d6d0>] ? handle_IRQ_event+0x25/0x53
> [<ffffffff8025ec3a>] ? handle_edge_irq+0xdd/0x11c
> [<ffffffff8020c0cc>] ? call_softirq+0x1c/0x28
> [<ffffffff8020e26a>] ? do_IRQ+0xf1/0x15f
> [<ffffffff8020b451>] ? ret_from_intr+0x0/0xa
> <EOI>
>
> Code: a0 28 00 bf 0a 00 00 00 48 89 c3 e8 73 6b ff ff 48 89 de 41 88 c4 48 c7 c7 f8 5e 65 80 e8 14 a1 28 00 45 84 e4 78 04 eb 12 f3 90 <48> 8b 05 25 1e 3e 00 48 29 e8 48 83 f8 04 76 ee 48 c7 c7 f8 5e
> ---[ end trace 8625c90c6582673f ]---
> Kernel panic - not syncing: Aiee, killing interrupt handler!
>
> Also, I have these messages in syslog:
>
> Apr 28 13:13:31 boogie kernel: rtc: lost 157 interrupts
> Apr 28 13:13:32 boogie kernel: rtc: lost 37 interrupts
> Apr 28 13:25:37 boogie kernel: rtc: lost 60 interrupts
>
> More info about the machine is attached. I've also seen similar hangs with
> 2.6.25-rc6 on an nforce4/Athlon64 box but I'm reluctant to re-test there
> because RAID rebuild takes too long.

I don't see any loop in hpet_rtc_interrupt() which can lock up so I assume
that for some reason we stop clearing the interrupt source and we
continuously reenter the interrupt handler.

I think this could also happen if someone runs
hpet_unregister_irq_handler() while the hpet is still active.

Ugly. If it was sanely reproducible then you could perhaps bisect it, but
two hours makes that unfeasible :(

Suspicion would have to be directed at the 2.6.25 CONFIG_HPET_EMULATE_RTC
changes.

I think our best bet here would be to persuade someone who knows what's
going on in there to prepare a debugging patch for you to run with
(please). See if we can find out what the code is doing at the time when
it freezes up.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/