Re: [PATCH v15 04/13] task_isolation: add initial support

From: Chris Metcalf
Date: Fri Sep 30 2016 - 15:33:02 EST


On 8/30/2016 2:43 PM, Andy Lutomirski wrote:
On Aug 30, 2016 10:02 AM, "Chris Metcalf" <cmetcalf@xxxxxxxxxxxx> wrote:
We really want to run task isolation last, so we can guarantee that
all the isolation prerequisites are met (dynticks stopped, per-cpu lru
cache empty, etc). But achieving that state can require enabling
interrupts - most obviously if we have to schedule, e.g. for vmstat
clearing or whatnot (see the cond_resched in refresh_cpu_vm_stats), or
just while waiting for that last dyntick interrupt to occur. I'm also
not sure that even something as simple as draining the per-cpu lru
cache can be done holding interrupts disabled throughout - certainly
there's a !SMP code path there that just re-enables interrupts
unconditionally, which gives me pause.

At any rate at that point you need to retest for signals, resched,
etc, all as usual, and then you need to recheck the task isolation
prerequisites once more.

I may be missing something here, but it's really not obvious to me
that there's a way to do this without having task isolation integrated
into the usual return-to-userspace loop.
What if we did it the other way around: set a percpu flag saying
"going quiescent; disallow new deferred work", then finish all
existing work and return to userspace. Then, on the next entry, clear
that flag. With the flag set, vmstat would just flush anything that
it accumulates immediately, nothing would be added to the LRU list,
etc.

Thinking about this some more, I was struck by an even simpler way
to approach this. What if we just said that on task isolation cores, no
kernel subsystem should do something that would require a future
interruption? So vmstat would just always sync immediately on task
isolation cores, the mm subsystem wouldn't use per-cpu LRU stuff on
task isolation cores, etc. That way we don't have to worry about the
status of those things as we are returning to userspace for a task
isolation process, since it's just always kept "pristine".

The task-isolation setting per-core is not user-customizable, and the
task-stealing scheduler doesn't even run there, so it's not like any
processes will land there and be in a position to complain about the
performance overhead of having no deferred work being created...

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com