Re: [RFC PATCH 2/7] lib: Add huge I/O map capability interfaces

From: Toshi Kani
Date: Thu Feb 05 2015 - 15:56:44 EST


On Mon, 2015-01-26 at 18:01 -0700, Toshi Kani wrote:
> On Mon, 2015-01-26 at 15:54 -0800, Andrew Morton wrote:
> > On Mon, 26 Jan 2015 16:13:24 -0700 Toshi Kani <toshi.kani@xxxxxx> wrote:
> >
> > > Add ioremap_pud_enabled() and ioremap_pmd_enabled(), which
> > > return 1 when I/O mappings of pud/pmd are enabled on the kernel.
> > >
> > > ioremap_huge_init() calls arch_ioremap_pud_supported() and
> > > arch_ioremap_pmd_supported() to initialize the capabilities.
> > >
> > > A new kernel option "nohgiomap" is also added, so that user can
> > > disable the huge I/O map capabilities if necessary.
> >
> > Why? What's the problem with leaving it enabled?
>
> No, there should not be any problem with leaving it enabled. This
> option is added as a way to workaround a problem when someone hit an
> issue unexpectedly.

Intel SDM states "large page size considerations" as quoted in the
bottom of this email (Thanks Robert Elliott for this info). There are
two cases mentioned:

1) When large page is mapped to a region where MTRRs have multiple
different memory types, processor can behave in an undefined manner.
2) When large page is mapped to the first 1MB which conflicts with the
fixed MTRRs, processor maps the range with multiple 4KB pages.

Case 2) is not an issue here since ioremap() does not remap the ISA
space in the first 1MB, and it's just a processor's "special" support.

For case 1), MTRR is a legacy feature and a driver calling ioremap() for
a large range covered by multiple MTRRs with two different types sounds
very unlikely to me, but it is theoretically possible. (Note, /dev/mem
uses remap_pfn_range(), not ioremap().)

Here are three options I can think of for case 1).

A) ioremap() to change a requested type to UC in case of 1)
B) ioremap() to force 4KB mappings in case of 1)
C) ioremap() to have no special handling for case 1)

In option A), pat_x_mtrr_type(), called from reserve_memtype(), already
has a special handling to convert WB request to UC-. This handling
needs to be changed to convert all request types to UC (not UC-) in case
of 1). reserve_memtype() is shared by other interfaces, so it needs to
have an additional argument to see if the caller supports large page
mapping since this conversion is only needed for large pages.

In option B), reserve_memtype() tells the caller that 4KB mappings need
to be used in case of 1) by returning 1. All callers need to handle
this new return value properly. ioremap_page_range() is then extended
to have additional flag that forces to use 4KB mappings.

In option C), we only document this potential issue, and do not make any
special handling for case 1), at least until we know this case really
exists in the real world.

Case 1) is better handled in the order of B), A), C) with additional
complexity & risk of the changes. I am willing to make necessary
changes (A or B), but I am also thinking that we may be better off with
C) since MTRRs are legacy.

Do you think we need to protect the ioremap callers from case 1)? Any
thoughts/suggestions will be very appreciated.

Thanks,
-Toshi

=====
11.11.9 Large Page Size Considerations

The MTRRs provide memory typing for a limited number of regions that
have a 4 KByte granularity (the same gran-ularity as 4-KByte pages). The
memory type for a given page is cached in the processorâs TLBs. When
using large pages (2 MBytes, 4 MBytes, or 1 GBytes), a single page-table
entry covers multiple 4-KByte granules, each with a single memory type.
Because the memory type for a large page is cached in the TLB, the
processor can behave in an undefined manner if a large page is mapped to
a region of memory that MTRRs have mapped with multiple memory types.

Undefined behavior can be avoided by insuring that all MTRR memory-type
ranges within a large page are of the same type. If a large page maps to
a region of memory containing different MTRR-defined memory types, the
PCD and PWT flags in the page-table entry should be set for the most
conservative memory type for that range. For example, a large page used
for memory mapped I/O and regular memory is mapped as UC memory.
Alternatively, the operating system can map the region using multiple
4-KByte pages each with its own memory type.

The requirement that all 4-KByte ranges in a large page are of the same
memory type implies that large pages with different memory types may
suffer a performance penalty, since they must be marked with the lowest
common denominator memory type. The same consideration apply to 1 GByte
pages, each of which may consist of multiple 2-Mbyte ranges.

The Pentium 4, Intel Xeon, and P6 family processors provide special
support for the physical memory range from 0 to 4 MBytes, which is
potentially mapped by both the fixed and variable MTRRs. This support is
invoked when a Pentium 4, Intel Xeon, or P6 family processor detects a
large page overlapping the first 1 MByte of this memory range with a
memory type that conflicts with the fixed MTRRs. Here, the processor
maps the memory range as multiple 4-KByte pages within the TLB. This
operation insures correct behavior at the cost of performance. To avoid
this performance penalty, operating-system software should reserve the
large page option for regions of memory at addresses greater than or
equal to 4 MBytes.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/