Re: [tip:x86/urgent] x86: hpet: Work around hardware stupidity

From: Venkatesh Pallipadi
Date: Tue Sep 14 2010 - 19:36:50 EST


On Tue, Sep 14, 2010 at 4:10 PM, tip-bot for Thomas Gleixner
<tglx@xxxxxxxxxxxxx> wrote:
> Commit-ID:  54ff7e595d763d894104d421b103a89f7becf47c
> Gitweb:     http://git.kernel.org/tip/54ff7e595d763d894104d421b103a89f7becf47c
> Author:     Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> AuthorDate: Tue, 14 Sep 2010 22:10:21 +0200
> Committer:  Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> CommitDate: Wed, 15 Sep 2010 00:55:13 +0200
>
> x86: hpet: Work around hardware stupidity
>
> This more or less reverts commits 08be979 (x86: Force HPET
> readback_cmp for all ATI chipsets) and 30a564be (x86, hpet: Restrict
> read back to affected ATI chipsets) to the status of commit 8da854c
> (x86, hpet: Erratum workaround for read after write of HPET
> comparator).
>
> The delta to commit 8da854c is mostly comments and the change from
> WARN_ONCE to printk_once as we know the call path of this function
> already.
>
> This needs really in depth explanation:
>
> First of all the HPET design is a complete failure. Having a counter
> compare register which generates an interrupt on matching values
> forces the software to do at least one superfluous readback of the
> counter register.
>
> While it is nice in theory to program "absolute" time events it is
> practically useless because the timer runs at some absurd frequency
> which can never be matched to real world units. So we are forced to
> calculate a relative delta and this forces a readout of the actual
> counter value, adding the delta and programming the compare
> register. When the delta is small enough we run into the danger that
> we program a compare value which is already in the past. Due to the
> compare for equal nature of HPET we need to read back the counter
> value after writing the compare rehgister (btw. this is necessary for
> absolute timeouts as well) to make sure that we did not miss the timer
> event. We try to work around that by setting the minimum delta to a
> value which is larger than the theoretical time which elapses between
> the counter readout and the compare register write, but that's only
> true in theory. A NMI or SMI which hits between the readout and the
> write can easily push us beyond that limit. This would result in
> waiting for the next HPET timer interrupt until the 32bit wraparound
> of the counter happens which takes about 306 seconds.
>
> So we designed the next event function to look like:
>
>   match = read_cnt() + delta;
>   write_compare_ref(match);
>   return read_cnt() < match ? 0 : -ETIME;
>
> At some point we got into trouble with certain ATI chipsets. Even the
> above "safe" procedure failed. The reason was that the write to the
> compare register was delayed probably for performance reasons. The
> theory was that they wanted to avoid the synchronization of the write
> with the HPET clock, which is understandable. So the write does not
> hit the compare register directly instead it goes to some intermediate
> register which is copied to the real compare register in sync with the
> HPET clock. That opens another window for hitting the dreaded "wait
> for a wraparound" problem.
>
> To work around that "optimization" we added a read back of the compare
> register which either enforced the update of the just written value or
> just delayed the readout of the counter enough to avoid the issue. We
> unfortunately never got any affirmative info from ATI/AMD about this.
>
> One thing is sure, that we nuked the performance "optimization" that
> way completely and I'm pretty sure that the result is worse than
> before some HW folks came up with those.
>
> Just for paranoia reasons I added a check whether the read back
> compare register value was the same as the value we wrote right
> before. That paranoia check triggered a couple of years after it was
> added on an Intel ICH9 chipset. Venki added a workaround (commit
> 8da854c) which was reading the compare register twice when the first
> check failed. We considered this to be a penalty in general and
> restricted the readback (thus the wasted CPU cycles) to the known to
> be affected ATI chipsets.
>
> This turned out to be a utterly wrong decision. 2.6.35 testers
> experienced massive problems and finally one of them bisected it down
> to commit 30a564be which spured some further investigation.
>
> Finally we got confirmation that the write to the compare register can
> be delayed by up to two HPET clock cycles which explains the problems
> nicely. All we can do about this is to go back to Venki's initial
> workaround in a slightly modified version.
>
> Just for the record I need to say, that all of this could have been
> avoided if hardware designers and of course the HPET committee would
> have thought about the consequences for a split second. It's out of my
> comprehension why designing a working timer is so hard. There are two
> ways to achieve it:
>
>  1) Use a counter wrap around aware compare_reg <= counter_reg
>    implementation instead of the easy compare_reg == counter_reg
>
>    Downsides:
>
>        - It needs more silicon.
>
>        - It needs a readout of the counter to apply a relative
>          timeout. This is necessary as the counter does not run in
>          any useful (and adjustable) frequency and there is no
>          guarantee that the counter which is used for timer events is
>          the same which is used for reading the actual time (and
>          therefor for calculating the delta)
>
>    Upsides:
>
>        - None
>
>  2) Use a simple down counter for relative timer events
>
>    Downsides:
>
>        - Absolute timeouts are not possible, which is not a problem
>          at all in the context of an OS and the expected
>          max. latencies/jitter (also see Downsides of #1)
>
>   Upsides:
>
>        - It needs less or equal silicon.
>
>        - It works ALWAYS
>
>        - It is way faster than a compare register based solution (One
>          write versus one write plus at least one and up to four
>          reads)
>
> I would not be so grumpy about all of this, if I would not have been
> ignored for many years when pointing out these flaws to various
> hardware folks. I really hate timers (at least those which seem to be
> designed by janitors).
>
> Though finally we got a reasonable explanation plus a solution and I
> want to thank all the folks involved in chasing it down and providing
> valuable input to this.
>
> Bisected-by: Nix <nix@xxxxxxxxxxxxx>
> Reported-by: Artur Skawina <art.08.09@xxxxxxxxx>
> Reported-by: Damien Wyart <damien.wyart@xxxxxxx>
> Reported-by: John Drescher <drescherjm@xxxxxxxxx>
> Cc: Venkatesh Pallipadi <venki@xxxxxxxxxx>
> Cc: Ingo Molnar <mingo@xxxxxxx>
> Cc: H. Peter Anvin <hpa@xxxxxxxxx>
> Cc: Arjan van de Ven <arjan@xxxxxxxxxxxxxxx>
> Cc: Andreas Herrmann <andreas.herrmann3@xxxxxxx>
> Cc: Borislav Petkov <borislav.petkov@xxxxxxx>
> Cc: stable@xxxxxxxxxx
> Acked-by: Suresh Siddha <suresh.b.siddha@xxxxxxxxx>
> Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> ---
>  arch/x86/include/asm/hpet.h    |    1 -
>  arch/x86/kernel/early-quirks.c |   18 ------------------
>  arch/x86/kernel/hpet.c         |   31 +++++++++++++++++--------------
>  3 files changed, 17 insertions(+), 33 deletions(-)
>
> diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
> index 004e6e2..1d5c08a 100644
> --- a/arch/x86/include/asm/hpet.h
> +++ b/arch/x86/include/asm/hpet.h
> @@ -68,7 +68,6 @@ extern unsigned long force_hpet_address;
>  extern u8 hpet_blockid;
>  extern int hpet_force_user;
>  extern u8 hpet_msi_disable;
> -extern u8 hpet_readback_cmp;
>  extern int is_hpet_enabled(void);
>  extern int hpet_enable(void);
>  extern void hpet_disable(void);
> diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
> index e5cc7e8..ebdb85c 100644
> --- a/arch/x86/kernel/early-quirks.c
> +++ b/arch/x86/kernel/early-quirks.c
> @@ -18,7 +18,6 @@
>  #include <asm/apic.h>
>  #include <asm/iommu.h>
>  #include <asm/gart.h>
> -#include <asm/hpet.h>
>
>  static void __init fix_hypertransport_config(int num, int slot, int func)
>  {
> @@ -192,21 +191,6 @@ static void __init ati_bugs_contd(int num, int slot, int func)
>  }
>  #endif
>
> -/*
> - * Force the read back of the CMP register in hpet_next_event()
> - * to work around the problem that the CMP register write seems to be
> - * delayed. See hpet_next_event() for details.
> - *
> - * We do this on all SMBUS incarnations for now until we have more
> - * information about the affected chipsets.
> - */
> -static void __init ati_hpet_bugs(int num, int slot, int func)
> -{
> -#ifdef CONFIG_HPET_TIMER
> -       hpet_readback_cmp = 1;
> -#endif
> -}
> -
>  #define QFLAG_APPLY_ONCE       0x1
>  #define QFLAG_APPLIED          0x2
>  #define QFLAG_DONE             (QFLAG_APPLY_ONCE|QFLAG_APPLIED)
> @@ -236,8 +220,6 @@ static struct chipset early_qrk[] __initdata = {
>          PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs },
>        { PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_SBX00_SMBUS,
>          PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs_contd },
> -       { PCI_VENDOR_ID_ATI, PCI_ANY_ID,
> -         PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_hpet_bugs },
>        {}
>  };
>
> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
> index 351f9c0..410fdb3 100644
> --- a/arch/x86/kernel/hpet.c
> +++ b/arch/x86/kernel/hpet.c
> @@ -35,7 +35,6 @@
>  unsigned long                          hpet_address;
>  u8                                     hpet_blockid; /* OS timer block num */
>  u8                                     hpet_msi_disable;
> -u8                                     hpet_readback_cmp;
>
>  #ifdef CONFIG_PCI_MSI
>  static unsigned long                   hpet_num_timers;
> @@ -395,23 +394,27 @@ static int hpet_next_event(unsigned long delta,
>         * at that point and we would wait for the next hpet interrupt
>         * forever. We found out that reading the CMP register back
>         * forces the transfer so we can rely on the comparison with
> -        * the counter register below.
> +        * the counter register below. If the read back from the
> +        * compare register does not match the value we programmed
> +        * then we might have a real hardware problem. We can not do
> +        * much about it here, but at least alert the user/admin with
> +        * a prominent warning.
>         *
> -        * That works fine on those ATI chipsets, but on newer Intel
> -        * chipsets (ICH9...) this triggers due to an erratum: Reading
> -        * the comparator immediately following a write is returning
> -        * the old value.
> +        * An erratum on some chipsets (ICH9,..), results in
> +        * comparator read immediately following a write returning old
> +        * value. Workaround for this is to read this value second
> +        * time, when first read returns old value.
>         *
> -        * We restrict the read back to the affected ATI chipsets (set
> -        * by quirks) and also run it with hpet=verbose for debugging
> -        * purposes.
> +        * In fact the write to the comparator register is delayed up
> +        * to two HPET cycles so the workaround we tried to restrict
> +        * the readback to those known to be borked ATI chipsets
> +        * failed miserably. So we give up on optimizations forever
> +        * and penalize all HPET incarnations unconditionally.
>         */
> -       if (hpet_readback_cmp || hpet_verbose) {
> -               u32 cmp = hpet_readl(HPET_Tn_CMP(timer));
> -
> -               if (cmp != cnt)
> +       if (unlikely((u32)hpet_readl(HPET_Tn_CMP(timer)) != cnt)) {
> +               if (hpet_readl(HPET_Tn_CMP(timer)) != cnt)

Minor nit.
I guess (u32) in first check above is not needed as hpet_readl
actually returns unsigned int.
Otherwise
Acked-by: Venkatesh Pallipadi <venki@xxxxxxxxxx>

>                        printk_once(KERN_WARNING
> -                           "hpet: compare register read back failed.\n");
> +                               "hpet: compare register read back failed.\n");
>        }
>
>        return (s32)(hpet_readl(HPET_COUNTER) - cnt) >= 0 ? -ETIME : 0;
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/