MTRR on Xen - BIOS use and implications for Linux

From: Luis R. Rodriguez
Date: Wed Mar 16 2016 - 16:09:22 EST


As v4.3 Linux now sports no direct usage of MTRR calls anymore, the
exported symbols mtrr_add() are no longer directly available to
drivers, they must the PAT compatible arch_phys_wc_add() from then on.
This is a huge win for Linux on the x86 front for a few non-Xen
related reasons, one being that in the future on x86 Linux we may soon
be able to flip the switch of default ioremap_nocache() from UC- to UC
for PAT systems. On Xen however this has other series of benefits, the
biggest one in particular was that we didn't have to end up
implementing MTRR hypervisor call support from Linux guests out.
Another side benefit of that is that long term none of the Linux MTRR
x86 code is ever called on Linux Xen guests as Xen guests go blessed
with the MTRR MSR unset. If you look at the Linux MTRR code its a huge
convoluted mess, which includes calling stop_machine() on each CPU
during bootup, resume, and CPU online, and of course whenever an MTRR
is would have been set.

I'd like to review some last concerns and notes. First, long ago Toshi
had mentioned that even if the kernel does not use MTRR directly the
BIOS may have, and in some cases this is very likely. Its only
recently became clear why, he notes that as far as he can tell, BIOS
can *only* use MTRR to specify a UC cache attribute on x86. I recently
asked for confirmation on this [0], given that as I see it future
BIOSes can deprecate MTRR it'd be beneficial to Linux as we avoid all
that convoluted MTRR code, and thereby also enabling parity in
functionality with Linux Xen guests in so far as MTRR. Toshi notes
this is likely not possible though even in the future, so we'll see
why. In the meantime this also means we should consider as a last
measure for MTRR consideration on Xen at the very least BIOS use of
MTRR, so we can ensure we translate the respective setup done by the
BIOS to guests, even if they only use PAT. I started looking at what
Xen does and I have a few notes on the existing hypervisor
implementation and I'd like to review with you and confirm the
behavior and see if we need anything else.

Keir Fraser long ago in 2009 committed a change on Xen which clips the
available RAM to Xen on the MTRRs [1], and later made it only
applicable to Intel system [2] while also enabling a boot parameter
[no-]e820-mtrr-clip to either force enable this for any system or
force disable this clipping. Reviewing that logic it would seem that
its trying to confirm that all MTRRs are set to WB by default, and if
so, it leaves the MTRR ranges as part of the e820 memory given to a
guest. If the default is not WB, it iterates over the variable ranges
(Linux set_num_var_ranges() on Intel uses rdmsr(MSR_MTRRcap, config,
dummy); as with the xen implementation for the mtrr_cap) and tries to
look for the highest WB range (and notes "overlapping UC/WT ranges
dominate"?), and once it has that it trims the memory up to the
highest WB MTRR, if WB was not default. This seems to look very
similar to what Linux does on mtrr_cleanup(), only, if I understand
this correctly, Linux disregards this cleanup also if the number of
variable WB MTRRs + number of variable UC entries matches the total
number of variable ranges (see Linux mtrr_need_cleanup()). I wonder if
this needs fixing / updating, but also since it seems we trim the
MTRRs if any UC MTRR is encountered up to the last WB MTRR, I wonder
if this would suffice to address BIOS UC concerns. If anything and I
understood this, it would seem this just discards UC MTRRs from the
guests. Fan control was mentioned as one example use of UC by the
BIOS, is the BIOS then in full control of the fan when this is done?
This seems to only be done for Intel as well matching Linux'
mtrr_cleanup(), do we not need such considerations for non-Intel CPUs?

Toshi noted a while ago as well that if BIOS/firmware enables MTRR but
the kernel does not have it enabled one issue might have been any
MTRRs set up by the BIOS and ensuring the mapping is respected, in
particular UC settings, this concern is raised above. Another issue
though is that the kernel would be "unable to verify if a large page
mapping is aligned with MTRRs" [3], so mtrr_type_lookup() would have
to return a valid type as it runs on the platform. For Linux this
means we'd have to implement a get_mtrr() call for Xen... only to use
the hypercall XENPF_read_memtype. I looked at this prospect but is
rather odd, given MTRRs would be disabled on the MSR, I'm rather
afraid of hacking on stuff on top of Linux's MTRR to address this
requirement, we currently do not support enabling just one MTRR hook
if MTRRs are disabled but you're a guest and we need to ask the
hypervisor for the MTRR type... Supporting this for Xen to me seems
like a terrible idea. Instead I wonder if we can address this concern
on Xen by simply now tacking off all MTRRs completely from the memory
given out to for the e820 map to guests, keeping all MTRRs internal to
the hypervisor. Or does the cleanup solution in place already suffice?
I would hope this is reasonable given that all drivers, on Linux at
least, now don't use MTRR directly (with the exception of ivtv, a
legacy driver, and ipath, on its way out of the kernel).

With regards to the Linux cleanup code, some discussion on this also
had stalled a while ago. Back in August 2015 Stewards proposed
extending the mtrr_cleanup() code for Linux to support systems with
more than 4 GiB of RAM where it seems mtrr_cleaup() will fail with
large memory configurations because it limits chunk_size to 2GB,
meaning each MTRR can only cover 2GB of memory [4]. He noted that some
systems with say 256GiB or RAM may have ten variable MTRRs, it may not
be possible to use MTRRs to cover all of memory. He extended the MTRR
chunk size to support larger memory -- but since we no longer use MTRR
on kernel drivers upstream I raised the question if we even need the
MTRR cleanup code anymore. If we don't want to interfere with BIOS
MTRR setup can we just ignore MTRRs on Linux on the e820 map as well?
Or do we need that for some kernel interfaces? If so which are they?

[0] http://lkml.kernel.org/r/20160315232916.GJ1990@xxxxxxxxxxxxx
[1[ http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=522b335995907366ff995a36a8098bc6b1e4cdf1
[2] http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=9d85142e675142191f64d73aa40791a03a9f7389
[3] http://lkml.kernel.org/r/1441322474.3277.78.camel@xxxxxxx
[4] http://lkml.kernel.org/r/55E47B4D.1050103@xxxxxxxxx
[5] http://lkml.kernel.org/r/CAB=NE6X3ix5pSp2u6owraV73CfP+JBh+Ct0Ek8bNvw1Ft-5bGw@xxxxxxxxxxxxxx

Luis