Re: [PATCH 05/11] mm: Introduce arch_pgd_init_late()

From: Andy Lutomirski
Date: Tue Sep 22 2015 - 14:00:42 EST


On Tue, Sep 22, 2015 at 10:55 AM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Mon, Sep 21, 2015 at 11:23 PM, Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>> Add a late PGD init callback to places that allocate a new MM
>> with a new PGD: copy_process() and exec().
>>
>> The purpose of this callback is to allow architectures to implement
>> lockless initialization of task PGDs, to remove the scalability
>> limit of pgd_list/pgd_lock.
>
> Do we really need this?
>
> Can't we just initialize the pgd when we allocate it, knowing that
> it's not in sync, but just depend on the vmalloc fault to add in any
> kernel entries that we might have missed?

I really really hate the vmalloc fault thing. It seems to work,
rather to my surprise. It doesn't *deserve* to work, because of
things like the percpu TSS accesses in the entry code that happen
without a valid stack.

For all I know, there's a long history of this hitting on monster
non-SMAP systems that are all buggy and rootable but no one notices
because it's rare. On SMAP with non-malicious userspace, it's an
instant double fault. With malicious userspace, it's rootable
regardless of SMAP, but it's much harder with SMAP.

If we start every mm with a fully zeroed pgd (which is what I think
you're suggesting), then this starts affecting small systems as in
addition to monster systems.

I'd really rather go in the other directoin and completely eliminate
vmalloc faults. We could do that by eagerly initializing all pgd, or
we could do it by tracking, per-pgd, how up-to-date it is and fixing
it up in switch_mm. The latter is a bit nasty on SMP.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/