Re: [PATCH 2/4 v0.5] sched/umcg: RFC: add userspace atomic helpers

From: Andy Lutomirski
Date: Tue Sep 14 2021 - 12:52:39 EST




On Thu, Sep 9, 2021, at 2:20 PM, Jann Horn wrote:
> On Thu, Sep 9, 2021 at 9:07 PM Peter Oskolkov <posk@xxxxxxxxxx> wrote:
> > On Wed, Sep 8, 2021 at 4:39 PM Jann Horn <jannh@xxxxxxxxxx> wrote:
> >
> > Thanks a lot for the reviews, Jann!
> >
> > I understand how to address most of your comments. However, one issue
> > I'm not sure what to do about:
> >
> > [...]
> >
> > > If this function is not allowed to sleep, as the comment says...
> >
> > [...]
> >
> > > ... then I'm pretty sure you can't call fix_pagefault() here, which
> > > acquires the mmap semaphore (which may involve sleeping) and then goes
> > > through the pagefault handling path (which can also sleep for various
> > > reasons, like allocating memory for pagetables, loading pages from
> > > disk / NFS / FUSE, and so on).
> >
> > <quote from peterz@ from
> > https://lore.kernel.org/lkml/20210609125435.GA68187@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/>:
> > So a PF_UMCG_WORKER would be added to sched_submit_work()'s PF_*_WORKER
> > path to capture these tasks blocking. The umcg_sleeping() hook added
> > there would:
> >
> > put_user(BLOCKED, umcg_task->umcg_status);
> > ...
> > </quote>
> >
> > Which is basically what I am doing here: in sched_submit_work() I need
> > to read/write to userspace; and we cannot sleep in
> > sched_submit_work(), I believe.
> >
> > If you are right that it is impossible to deal with pagefaults from
> > within non-sleepable contexts, I see two options:
> >
> > Option 1: as you suggest, pin pages holding struct umcg_task in sys_umcg_ctl;
>
> FWIW, there is a variant on this that might also be an option:
>
> You can create a new memory mapping from kernel code and stuff pages
> into it that were originally allocated as normal kernel pages. This is
> done in a bunch of places, e.g.:

With a custom mapping, you don’t need to pin pages at all, I think. As long as you can reconstruct the contents of the shared page and you’re willing to do some slightly careful synchronization, you can detect that the page is missing when you try to update it and skip the update. The vm_ops->fault handler can repopulate the page the next time it’s accessed.

All that being said, I feel like I’m missing something. The point of this is to send what the old M:N folks called “scheduler activations”, right? Wouldn’t it be more efficient to explicitly wake something blockable/pollable and write the message into a more efficient data structure? Polling one page per task from userspace seems like it will have inherently high latency due to the polling interval and will also have very poor locality. Or am I missing something?

>
>
> Note that what I'm suggesting here is a bit unusual - normally only
> the vDSO is a "special mapping", other APIs tend to use mappings that
> are backed by files. But I think we probably don't want to have a file
> involved here...
>

A file would be weird — the lifetime and SCM_RIGHTS interactions may be unpleasant.

> If you decide to go this route, you should probably CC
> linux-mm@xxxxxxxxx (for general memory management) and Andy Lutomirski
> (who has tinkered around in vDSO-related code a lot).
>

Who’s that? :)

> > or
> >
> > Option 2: add more umcg-related kernel state to task_struct so that
> > reading/writing to userspace is not necessary in sched_submit_work().
> >
> > The first option sounds much better from the code simplicity point of
> > view, but I'm not sure if it is a viable approach, i.e. I'm afraid
> > we'll get a hard NACK here, as a non-privileged process will be able
> > to force the kernel to pin a page per task/thread.
>
> To clarify: It's entirely normal that userspace processes can force
> the kernel to hold on to some amounts of memory that can't be paged
> out - consider e.g. pagetables and kernel objects referenced by file
> descriptors. So an API that pins limited amounts of memory that are
> also mapped in userspace isn't inherently special. But pinning pages
> that were originally allocated as normal userspace memory can be more
> problematic because that memory might be hugepages, or file pages, or
> it might prevent the hugepaged from being able to defragment memory
> because the pinned page was allocated in ZONE_MOVABLE.
>
>
> > We may get around
> > it by first pinning a limited number of pages, then having the
> > userspace allocate structs umcg_task on those pages, so that a pinned
> > page would cover more than a single task/thread. And have a sysctl
> > that limits the number of pinned pages per MM.
>
> I think that you wouldn't necessarily need a sysctl for that if the
> kernel can enforce that you don't have more pages allocated than you
> need for the maximum number of threads that have ever been running
> under the process, and you also use __GFP_ACCOUNT so that cgroups can
> correctly attribute the memory usage.
>
> > Peter Z., could you, please, comment here? Do you think pinning pages
> > to hold structs umcg_task is acceptable?
>