Re: [BUG] [BISECTED] System gets unresponsive since 2.6.35-rc1

From: Thomas Gleixner
Date: Wed Sep 15 2010 - 07:15:14 EST

Next message: Haojian Zhuang: "Re: [PATCH] mmc: Add architecture dependency for Marvell SoC controller"
Previous message: Ming Lei: "Re: [Discussion] USB: musb-gadget: how to fix ZLP issue in musb_g_tx"
In reply to: Martin Kepplinger: "[BUG] [BISECTED] System gets unresponsive since 2.6.35-rc1"
Next in thread: Martin Kepplinger: "Re: [BUG] [BISECTED] System gets unresponsive since 2.6.35-rc1"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, 15 Sep 2010, Martin Kepplinger wrote:

> Hi,
>
> My Ubuntu 10.04, 32bit on a Lenovo G550 Laptop has the following
> issue since .35-rc1: After a short uptime doing little things like
> video playback, the system gets slow and stuck after some time. At
> first, compiz-animations and everything (video playback,
> terminal-commands) gets unusably slow. /var/log/messages comes up
> with the following _during_ video playback. These warnings stop when
> the video is stopped. The system stays (quite) unresponsive /
> unusable.
>
> Sep 15 12:15:51 mobil pulseaudio[1456]: ratelimit.c: 19 events suppressed
> Sep 15 12:15:57 mobil pulseaudio[1456]: ratelimit.c: 25 events suppressed
> Sep 15 12:16:02 mobil pulseaudio[1456]: ratelimit.c: 6 events suppressed
> Sep 15 12:16:07 mobil pulseaudio[1456]: ratelimit.c: 11 events suppressed
> Sep 15 12:16:17 mobil pulseaudio[1456]: ratelimit.c: 12 events suppressed
> Sep 15 12:16:23 mobil pulseaudio[1456]: ratelimit.c: 3 events suppressed
>
> Shutdown also hangs very soon. The problem is still present in
> 2.6.36-rc3! I'm about to test -rc4. The last "good" one is 2.6.34.
>
> I bisected the problem. Unfortunately I could not revert the
> patch. Additionally, I'm only 90% sure about the result because I
> had to skip one step of the bisect (but since the toplevel-Makefile
> showed "2.6.34-something", I could guess it to be "good")
>
> The rest of this E-Mail is the result (git show). I can only report
> it. I'd happily test patches to it and hope this makes sense to
> someone of you.

I doubt that this is the real culprit. We just decoded a HPET problem,
which might be the real cause. Patch is below.

Can you provide the output of /proc/timer_list please ?

Thanks,

tglx
---
commit 54ff7e595d763d894104d421b103a89f7becf47c
Author: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Date: Tue Sep 14 22:10:21 2010 +0200

x86: hpet: Work around hardware stupidity

This more or less reverts commits 08be979 (x86: Force HPET
readback_cmp for all ATI chipsets) and 30a564be (x86, hpet: Restrict
read back to affected ATI chipsets) to the status of commit 8da854c
(x86, hpet: Erratum workaround for read after write of HPET
comparator).

The delta to commit 8da854c is mostly comments and the change from
WARN_ONCE to printk_once as we know the call path of this function
already.

This needs really in depth explanation:

First of all the HPET design is a complete failure. Having a counter
compare register which generates an interrupt on matching values
forces the software to do at least one superfluous readback of the
counter register.

While it is nice in theory to program "absolute" time events it is
practically useless because the timer runs at some absurd frequency
which can never be matched to real world units. So we are forced to
calculate a relative delta and this forces a readout of the actual
counter value, adding the delta and programming the compare
register. When the delta is small enough we run into the danger that
we program a compare value which is already in the past. Due to the
compare for equal nature of HPET we need to read back the counter
value after writing the compare rehgister (btw. this is necessary for
absolute timeouts as well) to make sure that we did not miss the timer
event. We try to work around that by setting the minimum delta to a
value which is larger than the theoretical time which elapses between
the counter readout and the compare register write, but that's only
true in theory. A NMI or SMI which hits between the readout and the
write can easily push us beyond that limit. This would result in
waiting for the next HPET timer interrupt until the 32bit wraparound
of the counter happens which takes about 306 seconds.

So we designed the next event function to look like:

match = read_cnt() + delta;
write_compare_ref(match);
return read_cnt() < match ? 0 : -ETIME;

At some point we got into trouble with certain ATI chipsets. Even the
above "safe" procedure failed. The reason was that the write to the
compare register was delayed probably for performance reasons. The
theory was that they wanted to avoid the synchronization of the write
with the HPET clock, which is understandable. So the write does not
hit the compare register directly instead it goes to some intermediate
register which is copied to the real compare register in sync with the
HPET clock. That opens another window for hitting the dreaded "wait
for a wraparound" problem.

To work around that "optimization" we added a read back of the compare
register which either enforced the update of the just written value or
just delayed the readout of the counter enough to avoid the issue. We
unfortunately never got any affirmative info from ATI/AMD about this.

One thing is sure, that we nuked the performance "optimization" that
way completely and I'm pretty sure that the result is worse than
before some HW folks came up with those.

Just for paranoia reasons I added a check whether the read back
compare register value was the same as the value we wrote right
before. That paranoia check triggered a couple of years after it was
added on an Intel ICH9 chipset. Venki added a workaround (commit
8da854c) which was reading the compare register twice when the first
check failed. We considered this to be a penalty in general and
restricted the readback (thus the wasted CPU cycles) to the known to
be affected ATI chipsets.

This turned out to be a utterly wrong decision. 2.6.35 testers
experienced massive problems and finally one of them bisected it down
to commit 30a564be which spured some further investigation.

Finally we got confirmation that the write to the compare register can
be delayed by up to two HPET clock cycles which explains the problems
nicely. All we can do about this is to go back to Venki's initial
workaround in a slightly modified version.

Just for the record I need to say, that all of this could have been
avoided if hardware designers and of course the HPET committee would
have thought about the consequences for a split second. It's out of my
comprehension why designing a working timer is so hard. There are two
ways to achieve it:

1) Use a counter wrap around aware compare_reg <= counter_reg
implementation instead of the easy compare_reg == counter_reg

Downsides:

- It needs more silicon.

- It needs a readout of the counter to apply a relative
timeout. This is necessary as the counter does not run in
any useful (and adjustable) frequency and there is no
guarantee that the counter which is used for timer events is
the same which is used for reading the actual time (and
therefor for calculating the delta)

Upsides:

- None

2) Use a simple down counter for relative timer events

Downsides:

- Absolute timeouts are not possible, which is not a problem
at all in the context of an OS and the expected
max. latencies/jitter (also see Downsides of #1)

Upsides:

- It needs less or equal silicon.

- It works ALWAYS

- It is way faster than a compare register based solution (One
write versus one write plus at least one and up to four
reads)

I would not be so grumpy about all of this, if I would not have been
ignored for many years when pointing out these flaws to various
hardware folks. I really hate timers (at least those which seem to be
designed by janitors).

Though finally we got a reasonable explanation plus a solution and I
want to thank all the folks involved in chasing it down and providing
valuable input to this.

Bisected-by: Nix <nix@xxxxxxxxxxxxx>
Reported-by: Artur Skawina <art.08.09@xxxxxxxxx>
Reported-by: Damien Wyart <damien.wyart@xxxxxxx>
Reported-by: John Drescher <drescherjm@xxxxxxxxx>
Cc: Venkatesh Pallipadi <venki@xxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxx>
Cc: H. Peter Anvin <hpa@xxxxxxxxx>
Cc: Arjan van de Ven <arjan@xxxxxxxxxxxxxxx>
Cc: Andreas Herrmann <andreas.herrmann3@xxxxxxx>
Cc: Borislav Petkov <borislav.petkov@xxxxxxx>
Cc: stable@xxxxxxxxxx
Acked-by: Suresh Siddha <suresh.b.siddha@xxxxxxxxx>
Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 004e6e2..1d5c08a 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -68,7 +68,6 @@ extern unsigned long force_hpet_address;
extern u8 hpet_blockid;
extern int hpet_force_user;
extern u8 hpet_msi_disable;
-extern u8 hpet_readback_cmp;
extern int is_hpet_enabled(void);
extern int hpet_enable(void);
extern void hpet_disable(void);
diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index e5cc7e8..ebdb85c 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -18,7 +18,6 @@
#include <asm/apic.h>
#include <asm/iommu.h>
#include <asm/gart.h>
-#include <asm/hpet.h>

static void __init fix_hypertransport_config(int num, int slot, int func)
{
@@ -192,21 +191,6 @@ static void __init ati_bugs_contd(int num, int slot, int func)
}
#endif

-/*
- * Force the read back of the CMP register in hpet_next_event()
- * to work around the problem that the CMP register write seems to be
- * delayed. See hpet_next_event() for details.
- *
- * We do this on all SMBUS incarnations for now until we have more
- * information about the affected chipsets.
- */
-static void __init ati_hpet_bugs(int num, int slot, int func)
-{
-#ifdef CONFIG_HPET_TIMER
- hpet_readback_cmp = 1;
-#endif
-}
-
#define QFLAG_APPLY_ONCE 0x1
#define QFLAG_APPLIED 0x2
#define QFLAG_DONE (QFLAG_APPLY_ONCE|QFLAG_APPLIED)
@@ -236,8 +220,6 @@ static struct chipset early_qrk[] __initdata = {
PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs },
{ PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_SBX00_SMBUS,
PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs_contd },
- { PCI_VENDOR_ID_ATI, PCI_ANY_ID,
- PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_hpet_bugs },
{}
};

diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 351f9c0..410fdb3 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -35,7 +35,6 @@
unsigned long hpet_address;
u8 hpet_blockid; /* OS timer block num */
u8 hpet_msi_disable;
-u8 hpet_readback_cmp;

#ifdef CONFIG_PCI_MSI
static unsigned long hpet_num_timers;
@@ -395,23 +394,27 @@ static int hpet_next_event(unsigned long delta,
* at that point and we would wait for the next hpet interrupt
* forever. We found out that reading the CMP register back
* forces the transfer so we can rely on the comparison with
- * the counter register below.
+ * the counter register below. If the read back from the
+ * compare register does not match the value we programmed
+ * then we might have a real hardware problem. We can not do
+ * much about it here, but at least alert the user/admin with
+ * a prominent warning.
*
- * That works fine on those ATI chipsets, but on newer Intel
- * chipsets (ICH9...) this triggers due to an erratum: Reading
- * the comparator immediately following a write is returning
- * the old value.
+ * An erratum on some chipsets (ICH9,..), results in
+ * comparator read immediately following a write returning old
+ * value. Workaround for this is to read this value second
+ * time, when first read returns old value.
*
- * We restrict the read back to the affected ATI chipsets (set
- * by quirks) and also run it with hpet=verbose for debugging
- * purposes.
+ * In fact the write to the comparator register is delayed up
+ * to two HPET cycles so the workaround we tried to restrict
+ * the readback to those known to be borked ATI chipsets
+ * failed miserably. So we give up on optimizations forever
+ * and penalize all HPET incarnations unconditionally.
*/
- if (hpet_readback_cmp || hpet_verbose) {
- u32 cmp = hpet_readl(HPET_Tn_CMP(timer));
-
- if (cmp != cnt)
+ if (unlikely((u32)hpet_readl(HPET_Tn_CMP(timer)) != cnt)) {
+ if (hpet_readl(HPET_Tn_CMP(timer)) != cnt)
printk_once(KERN_WARNING
- "hpet: compare register read back failed.\n");
+ "hpet: compare register read back failed.\n");
}

return (s32)(hpet_readl(HPET_COUNTER) - cnt) >= 0 ? -ETIME : 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Haojian Zhuang: "Re: [PATCH] mmc: Add architecture dependency for Marvell SoC controller"
Previous message: Ming Lei: "Re: [Discussion] USB: musb-gadget: how to fix ZLP issue in musb_g_tx"
In reply to: Martin Kepplinger: "[BUG] [BISECTED] System gets unresponsive since 2.6.35-rc1"
Next in thread: Martin Kepplinger: "Re: [BUG] [BISECTED] System gets unresponsive since 2.6.35-rc1"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]