Re: [patch 1/2] x86_64 page fault NMI-safe

From: Frederic Weisbecker
Date: Fri Jul 16 2010 - 06:47:35 EST

Next message: Sonic Zhang: "Re: [PATCH v2] MMC:mmc_spi: Recover from CRC error for SD read/write operation over SPI."
Previous message: Mel Gorman: "Re: [PATCH 2/7] memcg: mem_cgroup_shrink_node_zone() doesn't needsc.nodemask"
Next in thread: Steven Rostedt: "Re: [patch 1/2] x86_64 page fault NMI-safe"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Jul 15, 2010 at 10:46:13AM -0400, Steven Rostedt wrote:
> On Thu, 2010-07-15 at 16:11 +0200, Frederic Weisbecker wrote:
>
> > > - make sure that you only ever use _one_ single top-level entry for
> > > all vmalloc issues, and can make sure that all processes are created
> > > with that static entry filled in. This is optimal, but it just doesn't
> > > work on all architectures (eg on 32-bit x86, it would limit the
> > > vmalloc space to 4MB in non-PAE, whatever)
> >
> >
> > But then, even if you ensure that, don't we need to also fill lower level
> > entries for the new mapping.
>
> If I understand your question, you do not need to worry about the lower
> level entries because all the processes will share the same top level.
>
> process 1's GPD ------,
> |
> +------> PMD --> ...
> |
> process 2' GPD -------'
>
> Thus we have one page entry shared by all processes. The issue happens
> when the vm space crosses the PMD boundary and we need to update all the
> GPD's of all processes to point to the new PMD we need to add to handle
> the spread of the vm space.

Oh right. We point to that PMD, and the update has been made itself inside
the lower level entries pointed by the PMD. Indeed.

>
> >
> > Also, why is this a worry for vmalloc but not for kmalloc? Don't we also
> > risk to add a new memory mapping for new memory allocated with kmalloc?
>
> Because all of memory (well 800 some megs on 32 bit) is mapped into
> memory for all processes. That is, kmalloc only uses this memory (as
> does get_free_page()). All processes have a PMD (or PUD, whatever) that
> maps this memory. The issues only arises when we use new virtual memory,
> which vmalloc does. Vmalloc may map to physical memory that is already
> mapped to all processes but the address that the vmalloc uses to access
> that memory is not yet mapped.

Ok I see.

>
> The usual reason the kernel uses vmalloc is to get a contiguous range of
> memory. The vmalloc can map several pages as one contiguous piece of
> memory that in reality is several different pages scattered around
> physical memory. kmalloc can only map pages that are contiguous in
> physical memory. That is, if kmalloc gets 8192 bytes on an arch with
> 4096 byte pages, it will allocate two consecutive pages in physical
> memory. If two contiguous pages are not available even if thousand of
> single pages are, the kmalloc will fail, where as the vmalloc will not.
>
> An allocation of vmalloc can use two different pages and just map the
> page table to make them contiguous in view of the kernel. Note, this
> comes at a cost. One is when we do this, we suffer the case where we
> need to update a bunch of page tables. The other is that we must waste
> TLB entries to point to these separate pages. Kmalloc and
> get_free_page() uses the big memory mappings. That is, if the TLB allows
> us to map large pages, we can do that for kernel memory since we just
> want the contiguous memory as it is in physical memory.
>
> Thus the kernel maps the physical memory with the fewest TLB entries as
> needed (large pages and large TLB entries). If we can map 64K pages, we
> do that. Then kmalloc just allocates within this range, it does not need
> to map any pages. They are already mapped.
>
> Does this make a bit more sense?

Totally! You've made it very clear to me.
Moreover I did not know we can have such variable page size. I mean I thought
we can have variable page size but that would apply to every pages.

>
> >
> >
> >
> > > - at vmalloc time, when adding a new page directory entry, walk all
> > > the tens of thousands of existing page tables under a lock that
> > > guarantees that we don't add any new ones (ie it will lock out fork())
> > > and add the required pgd entry to them.
> > >
> > > - or just take the fault and do the "fill the page tables" on demand.
> > >
> > > Quite frankly, most of the time it's probably better to make that last
> > > choice (unless your hardware makes it easy to make the first choice,
> > > which is obviously simplest for everybody). It makes it _much_ cheaper
> > > to do vmalloc. It also avoids that nasty latency issue. And it's just
> > > simpler too, and has no interesting locking issues with how/when you
> > > expose the page tables in fork() etc.
> > >
> > > So the only downside is that you do end up taking a fault in the
> > > (rare) case where you have a newly created task that didn't get an
> > > even newer vmalloc entry.
> >
> >
> > But then how did the previous tasks get this new mapping? You said
> > we don't walk through every process page tables for vmalloc.
>
> Actually we don't even need to walk the page tables in the first task
> (although we might do that). When the kernel accesses that memory we
> take the page fault, the page fault will see that this memory is vmalloc
> data and fill in the page tables for the task at that time.

Right.

> >
> > I would understand this race if we were to walk on every processes page
> > tables and add the new mapping on them, but we missed one new task that
> > forked or so, because we didn't lock (or just rcu).
> >
> >
> >
> > > And that fault can sometimes be in an
> > > interrupt or an NMI. Normally it's trivial to handle that fairly
> > > simple nested fault. But NMI has that inconvenient "iret unblocks
> > > NMI's, because there is no dedicated 'nmiret' instruction" problem on
> > > x86.
> >
> >
> > Yeah.
> >
> >
> > So the parts of the problem I don't understand are:
> >
> > - why don't we have this problem with kmalloc() ?
>
> I hope I explained that above.

Yeah :)

Thanks a lot for your explanations!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Sonic Zhang: "Re: [PATCH v2] MMC:mmc_spi: Recover from CRC error for SD read/write operation over SPI."
Previous message: Mel Gorman: "Re: [PATCH 2/7] memcg: mem_cgroup_shrink_node_zone() doesn't needsc.nodemask"
Next in thread: Steven Rostedt: "Re: [patch 1/2] x86_64 page fault NMI-safe"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]