Re: [git-pull -tip] x86: msr architecture debug code

From: Andreas Herrmann
Date: Thu Mar 05 2009 - 12:02:16 EST


On Thu, Mar 05, 2009 at 03:08:09PM +0100, Ingo Molnar wrote:
> * Andreas Herrmann <andreas.herrmann3@xxxxxxx> wrote:
> > Having this stuff in the kernel unnecessarily bloats up kernel code.
>
> it should be a default-off Kconfig option and it is in debugfs
> so there's no real bloat issue here.

I attached parts of an autogenerated file which contains MSR
definitions for AMD family 10h in some condensed format. I stripped off
some lines -- the file had 487 lines and is about 30k.
You really like to have similar stuff for all x86 CPUs in-kernel?

> > What the kernel needs to provide is a reliable interface to
> > access MSRs -- to pass the data to userspace. This interface
> > is already there.
> >
> > IMHO all kind of parsing and grouping of that data belongs in
> > user space.
> >
> > One exception are MSRs that need to be checked early during
> > boot (e.g. MTRRs). For debugging purposes you might want to
> > dump certain MSRs early. But then you will use printk and not
> > debugfs.
>
> Well it's really nice to know the _kernel's_ enumeration of MSRs
> and its knowledge about the structure of those MSRs.
>
> Sure, we can and do export the flat MSR space to user-space, but
> the kernel also enumerates them internally, in various places.
> The debugfs interface shows them in one way - and as such also
> acts as a central force to keep these things tidy.
>
> a VFS namespace is also pretty educative. You can see which MSRs
> matter to the lapic for example, you can see their symbolic
> names, their current state, etc. etc.

> > > Maybe a symlink pointing it back to the topic directory
> > > would be useful as well. For example:
> > >
> > > /debug/x86/cpu/msr/raw/0x372/topic_dir -> /debug/x86/cpu/msr/pmu/pmc_0/
> > >
> > > Other "topic directories" are possible too: a
> > > /debug/x86/cpu/msr/apic/ layout would be very useful and
> > > informative as well, and so are some of the other MSRs we
> > > tweak during bootup.
> >
> > All nice suggestions but why in-kernel?
> >
> > Just hack some script to do this. This is much more
> > maintainable. You don't need a kernel update to add support
> > for new CPUs or to fix bugs in this code itself -- you just
> > have to tweak your script.
>
> the kernel tends to know a lot about these MSRs already so we
> just provide that information in a more structured form as well.
>
> Such more structured form, beyond the debugging and
> education/development advantages, also acts as a counter-force
> back to the MSR enumeration code of the kernel and makes them
> more structured. It will no doubt also extend the kernel's
> knowledge of MSRs - read-only MSRs we dont normally read.

If we don't read them
we don't need them --
in kernel code.

The knowledge of MSRs is usually required by certain code, drivers or
subsystems. I think, we should only add MSR information if it is
needed for real kernel functionality. Some examples are

- MCA MSRs for mce
- Pstate and FIDVID MSRs for powernow-k8
- MTRRs for cpu/mtrr code

We don't have interfaces for PCI devices to show all their config
space register values in decoded form. The kernel provides the
interface to retrieve that information from userspace and usually you
call lspci to decode some standard information and to dump all the
rest.

For MSRs we have an interface, too. There is a lack of a standard
tool to do the decoding. (As a start you can use lsmsr.)

> There's also a few other things like the IRR readout in the APIC
> code or the perfcounters status dump can also be done cleanly
> via /debug/x86/cpu/msr/.
>
> Eventually i'd like /debug/x86/ to become a full CPU state dump:
> the kernel pagetable dumping code could go there, we could show
> control registers, we could show the GDT and IDT settings and
> contents, etc. etc.

Yes, we could do a lot in the kernel. But should we?

I second that dumping and decoding MSRs (and also CPU config space
registers for AMD CPUs) is sometimes needed for debugging. But doing
all of this in-kernel -- I think, that's not cool.


Regards,
Andreas

--
/*
* Licensed under the terms of the GNU GENERAL PUBLIC LICENSE version 2.
* See file COPYING for details.
*/

#ifndef fam10h_h
#define fam10h_h

#include "../msr.h"

_RANGE(fam10h_LSMCAaddr,48,16,0);
_NAMES(fam10h_LSMCAaddr,"ADDR",0);
_RANGE(fam10h_LSMCAstatus,16,4,25,1,1,8,2,1,1,1,1,1,1,1,0);
_NAMES(fam10h_LSMCAstatus,"ErrorCode","ErrorCodeExt",0,"UECC","CECC","SYND",0,"PCC","ADDRV","MISCV","EN","UC","OVER","VAL");
_RANGE(fam10h_TSC,64,0);
_NAMES(fam10h_TSC,"TSC");
_RANGE(fam10h_APIC_BASE,8,1,2,1,36,16,0);
_NAMES(fam10h_APIC_BASE,0,"BSC",0,"ApicEn","ApicBar",0);
_RANGE(fam10h_EBL_CR_POWERON,16,2,46,0);
_NAMES(fam10h_EBL_CR_POWERON,0,"ClusterID",0);
_RANGE(fam10h_PATCH_LEVEL,32,32,0);
_NAMES(fam10h_PATCH_LEVEL,"PATCH_LEVEL",0);
_RANGE(fam10h_MTRRcap,8,1,1,1,53,0);
_NAMES(fam10h_MTRRcap,"MtrrCapVCnt","MtrrCapFix",0,"MtrrCapWc",0);
_RANGE(fam10h_SYSENTER_CS,16,48,0);
_NAMES(fam10h_SYSENTER_CS,"SYSENTER_CS",0);
_RANGE(fam10h_SYSENTER_ESP,32,32,0);
_NAMES(fam10h_SYSENTER_ESP,"SYSENTER_ESP",0);
_RANGE(fam10h_SYSENTER_EIP,32,32,0);
_NAMES(fam10h_SYSENTER_EIP,"SYSENTER_EIP",0);
_RANGE(fam10h_MCG_CAP,8,1,55,0);
_NAMES(fam10h_MCG_CAP,"Count","MCG_CTL_P",0);
_RANGE(fam10h_MCG_STAT,1,1,1,61,0);
_NAMES(fam10h_MCG_STAT,"RIPV","EIPV","MCIP",0);
_RANGE(fam10h_MCG_CTL,1,1,1,1,1,1,58,0);
_NAMES(fam10h_MCG_CTL,"DCE","ICE","BUE","LSE","NBE","FRE",0);
_RANGE(fam10h_DBG_CTL_MSR,1,1,1,1,1,1,58,0);
_NAMES(fam10h_DBG_CTL_MSR,"LBR","BTF","PB0","PB1","PB2","PB3",0);
_RANGE(fam10h_BR_FROM,64,0);
_NAMES(fam10h_BR_FROM,"LastBranchFromIP");

...

_RANGE(fam10h_MC5_CTL,1,63,0);
_NAMES(fam10h_MC5_CTL,"CPUWDT",0);
_RANGE(fam10h_MC5_STATUS,16,4,4,8,8,1,4,1,1,8,2,1,1,1,1,1,1,1,0);
_NAMES(fam10h_MC5_STATUS,"ErrorCode","ErrorCodeExt",0,"Syndrome",0,"Scrub",0,"UECC","CECC","Syndrome",0,"PCC","AddrV","MiscV","En","UC","OVER","VAL");
_RANGE(fam10h_MC5_ADDR,48,16,0);
_NAMES(fam10h_MC5_ADDR,"ADDR",0);
_RANGE(fam10h_MC5_MISC,12,52,0);
_NAMES(fam10h_MC5_MISC,"State",0);
_RANGE(fam10h_EFER,1,7,1,1,1,1,1,1,1,49,0);
_NAMES(fam10h_EFER,"SYSCALL",0,"LME",0,"LMA","NXE","SVME","LMSLE","FFXSE",0);
_RANGE(fam10h_STAR,32,16,16,0);
_NAMES(fam10h_STAR,"Target","SysCallSel","SysRetSel");
_RANGE(fam10h_STAR64,64,0);
_NAMES(fam10h_STAR64,"LSTAR");
_RANGE(fam10h_STARCOMPAT,64,0);
_NAMES(fam10h_STARCOMPAT,"CSTAR");
_RANGE(fam10h_SYSCALL_FLAG_MASK,32,32,0);
_NAMES(fam10h_SYSCALL_FLAG_MASK,"MASK",0);
_RANGE(fam10h_FS_BASE,64,0);
_NAMES(fam10h_FS_BASE,"FS_BASE");
_RANGE(fam10h_GS_BASE,64,0);
_NAMES(fam10h_GS_BASE,"GS_BASE");
_RANGE(fam10h_KernelGSbase,64,0);
_NAMES(fam10h_KernelGSbase,"KernelGSBase");
_RANGE(fam10h_TSC_AUX,32,32,0);
_NAMES(fam10h_TSC_AUX,"TscAux",0);
_RANGE(fam10h_MC4_MISC1,24,8,12,4,1,2,1,4,5,1,1,1,0);
_NAMES(fam10h_MC4_MISC1,0,"BlkPtr","ErrCnt",0,"Ovrflw","IntType","CntEn","LvtOffset",0,"Locked","CntP","Valid");
_RANGE(fam10h_MC4_MISC2,24,8,12,4,1,2,1,4,5,1,1,1,0);
_NAMES(fam10h_MC4_MISC2,0,"BlkPtr","ErrCnt",0,"Ovrflw","IntType","CntEn","LvtOffset",0,"Locked","CntP","Valid");
_RANGE(fam10h_MC4_MISC3,24,8,32,0);
_NAMES(fam10h_MC4_MISC3,0,"BlkPtr",0);
_RANGE(fam10h_PERF_CTL0,8,8,1,1,1,1,1,1,1,1,8,4,4,1,1,22,0);
_NAMES(fam10h_PERF_CTL0,"EventSelect","UnitMask","User","OS","Edge",0,"Int",0,"En","Inv","CntMask","EventSelect",0,"GuestOnly","HostOnly",0);
_RANGE(fam10h_PERF_CTL1,8,8,1,1,1,1,1,1,1,1,8,4,4,1,1,22,0);
_NAMES(fam10h_PERF_CTL1,"EventSelect","UnitMask","User","OS","Edge",0,"Int",0,"En","Inv","CntMask","EventSelect",0,"GuestOnly","HostOnly",0);
_RANGE(fam10h_PERF_CTL2,8,8,1,1,1,1,1,1,1,1,8,4,4,1,1,22,0);
_NAMES(fam10h_PERF_CTL2,"EventSelect","UnitMask","User","OS","Edge",0,"Int",0,"En","Inv","CntMask","EventSelect",0,"GuestOnly","HostOnly",0);
_RANGE(fam10h_PERF_CTL3,8,8,1,1,1,1,1,1,1,1,8,4,4,1,1,22,0);
_NAMES(fam10h_PERF_CTL3,"EventSelect","UnitMask","User","OS","Edge",0,"Int",0,"En","Inv","CntMask","EventSelect",0,"GuestOnly","HostOnly",0);
_RANGE(fam10h_PERF_CTR0,48,16,0);
_NAMES(fam10h_PERF_CTR0,"CTR",0);

...

_RANGE(fam10h_IbsFetchCtl,16,16,16,1,1,1,1,1,2,1,1,1,6,0);
_NAMES(fam10h_IbsFetchCtl,"IbsFetchMaxCnt","IbsFetchCnt","IbsFetchLat","IbsFetchEn","IbsFetchVal","IbsFetchComp","IbsIcMiss","IbsPhyAddrValid","IbsL1TlbPgSz","IbsL1TlbMiss","IbsL2TlbMiss","IbsRandEn",0);
_RANGE(fam10h_IbsFetchLinAd,64,0);
_NAMES(fam10h_IbsFetchLinAd,"IbsFetchLinAd");
_RANGE(fam10h_IbsFetchPhysAd,64,0);
_NAMES(fam10h_IbsFetchPhysAd,"IbsFetchPhysAd");
_RANGE(fam10h_IbsOpCtl,16,1,1,1,45,0);
_NAMES(fam10h_IbsOpCtl,"IbsOpMaxCnt",0,"IbsOpEn","IbsOpVal",0);
_RANGE(fam10h_IbsOpRip,64,0);
_NAMES(fam10h_IbsOpRip,"IbsOpRip");
_RANGE(fam10h_IbsOpData,16,16,1,1,1,1,1,1,26,0);
_NAMES(fam10h_IbsOpData,"IbsCompToRetCtr","IbsTagToRetCtr","IbsOpBrnResync","IbsOpMispReturn","IbsOpReturn","IbsOpBrnTaken","IbsOpBrnMisp","IbsOpBrnRet",0);
_RANGE(fam10h_IbsOpData2,3,1,1,1,58,0);
_NAMES(fam10h_IbsOpData2,"NbIbsReqSrc",0,"NbIbsReqDstProc","NbIbsReqCacheHitSt",0);
_RANGE(fam10h_IbsOpData3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,13,16,16,0);
_NAMES(fam10h_IbsOpData3,"IbsLdOp","IbsStOp","IbsDcL1tlbMiss","IbsDcL2tlbMiss","IbsDcL1tlbHit2M","IbsDcL1tlbHit1G","IbsDcL2tlbHit2M","IbsDcMiss","IbsDcMisAcc","IbsDcLdBnkCon","IbsDcStBnkCon","IbsDcStToLdFwd","IbsDcStToLdCan","IbsDcUcMemAcc","IbsDcWcMemAcc","IbsDcLockedOp","IbsDcMabHit","IbsDcLinAddrValid","IbsDcPhyAddrValid",0,"IbsDcMissLat",0);
_RANGE(fam10h_IbsDcLinAd,64,0);
_NAMES(fam10h_IbsDcLinAd,"IbsDcLinAd");
_RANGE(fam10h_IbsDcPhysAd,64,0);
_NAMES(fam10h_IbsDcPhysAd,"IbsDcPhysAd");
_RANGE(fam10h_IbsControl,4,4,1,55,0);
_NAMES(fam10h_IbsControl,"LvtOffset",0,"LvtOffsetVal",0);

struct reg_spec fam10h_spec [] = {
_SPEC(0x0000, LSMCAaddr, "load-store MCA address", fam10h_),
_SPEC(0x0001, LSMCAstatus, "load-store MCE status", fam10h_),
_SPEC(0x0010, TSC, "time-stamp counter", fam10h_),
_SPEC(0x001b, APIC_BASE, "APIC base address", fam10h_),
_SPEC(0x002a, EBL_CR_POWERON, "cluster ID", fam10h_),
_SPEC(0x008b, PATCH_LEVEL, "microcode patch level", fam10h_),
_SPEC(0x00fe, MTRRcap, "MTRR capabilities", fam10h_),
_SPEC(0x0174, SYSENTER_CS, "SYSENTER/SYSEXIT code segment selector", fam10h_),
_SPEC(0x0175, SYSENTER_ESP, "SYSENTER/SYSEXIT stack pointer", fam10h_),
_SPEC(0x0176, SYSENTER_EIP, "SYSENTER/SYSEXIT instruction pointer", fam10h_),
_SPEC(0x0179, MCG_CAP, "global MC capabilities", fam10h_),
_SPEC(0x017a, MCG_STAT, "global MC status", fam10h_),
_SPEC(0x017b, MCG_CTL, "global MC control", fam10h_),
_SPEC(0x01d9, DBG_CTL_MSR, "debug control", fam10h_),
_SPEC(0x01db, BR_FROM, "last branch from IP", fam10h_),
_SPEC(0x01dc, BR_TO, "last branch to IP", fam10h_),
_SPEC(0x01dd, LastExceptionFromIP, "last exception from IP", fam10h_),
_SPEC(0x01de, LastExceptionToIP, "last exception to IP", fam10h_),
_SPEC(0x0200, MTRRphysBase0, "base of variable-size MTRR (0)", fam10h_),
_SPEC(0x0201, MTRRphysMask0, "mask of variable-size MTRR (0)", fam10h_),

...

_SPEC(0xc0011023, BU_CFG, "bus unit configuration", fam10h_),
_SPEC(0xc001102A, BU_CFG2, "bus unit configuration 2", fam10h_),
_SPEC(0xc0011030, IbsFetchCtl, "IBS fetch control", fam10h_),
_SPEC(0xc0011031, IbsFetchLinAd, "IBS fetch linear address", fam10h_),
_SPEC(0xc0011032, IbsFetchPhysAd, "IBS fetch physical address", fam10h_),
_SPEC(0xc0011033, IbsOpCtl, "IBS execution control", fam10h_),
_SPEC(0xc0011034, IbsOpRip, "IBS Op logical address", fam10h_),
_SPEC(0xc0011035, IbsOpData, "IBS Op data", fam10h_),
_SPEC(0xc0011036, IbsOpData2, "IBS Op data 2", fam10h_),
_SPEC(0xc0011037, IbsOpData3, "IBS Op data 3", fam10h_),
_SPEC(0xc0011038, IbsDcLinAd, "IBS DC linear address", fam10h_),
_SPEC(0xc0011039, IbsDcPhysAd, "IBS DC physical address", fam10h_),
_SPEC(0xc001103a, IbsControl, "IBS control", fam10h_),
{0, NULL, NULL, NULL, NULL},
};

#endif /* fam10h_h */


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/