Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration

From: Marcelo Tosatti
Date: Mon May 15 2017 - 15:16:16 EST


On Fri, May 12, 2017 at 11:57:15AM -0500, Christoph Lameter wrote:
> On Fri, 12 May 2017, Marcelo Tosatti wrote:
>
> > > What exactly is the issue you are seeing and want to address? I think we
> > > have similar aims and as far as I know the current situation is already
> > > good enough for what you may need. You may just not be aware of how to
> > > configure this.
> >
> > I want to disable vmstat worker thread completly from an isolated CPU.
> > Because it adds overhead to a latency target, target which
> > the lower the better.
>
> NOHZ already does that. I wanted to know what your problem is that you
> see. The latency issue has already been solved as far as I can tell .
> Please tell me why the existing solutions are not sufficient for you.

We don't want vmstat_worker to execute on a given CPU, even if the local
CPU updates vm-statistics.

Because:

vmstat_worker increases latency of the application
(i can measure it if you want on a given CPU,
how many ns's the following takes:

schedule_out(qemu-kvm-vcpu)
schedule_in(kworker_thread)
execute function to drain local vmstat counters to
global counters
schedule_out(kworker_thread)
schedule_in(qemu-kvm-vcpu)
x86 instruction to enter guest.
(*)

But you can see right away without numbers that the sequence
above is not desired.

Why the existing solutions are not sufficient:

1) task-isolation patchset seems too heavy for our usecase (we do
want IPIs, signals, etc).

2) With upstream linux-2.6.git, if dpdk running inside a guest happens
to trigger any vmstat update (say for example migration), we want the
statistics transferred directly from the point where they are generated,
and not the sequence (*).

> > > I doubt that doing inline updates will do much good compared to what we
> > > already have and what the dataplan mode can do.
> >
> > Can the dataplan mode disable vmstat worker thread completly on a given
> > CPU?
>
> That already occurs when you call quiet_vmstat() and is used by the NOHZ
> logic. Configure that correctly and you should be fine.

quiet_vmstat() is not called by anyone today (upstream code). Are you
talking about task isolation patches?

Those seem a little heavy to me, for example:

1)
"Each time through the loop of TIF work to do, if TIF_TASK_ISOLATION
is set, we call the new task_isolation_enter() routine. This
takes any actions that might avoid a future interrupt to the core,
such as a worker thread being scheduled that could be quiesced now
(e.g. the vmstat worker) or a future IPI to the core to clean up some
state that could be cleaned up now (e.g. the mm lru per-cpu cache).
In addition, it reqeusts rescheduling if the scheduler dyntick is
still running."

For example, what about

static void do_sync_core(void *data)
on_each_cpu(do_sync_core, NULL, 1);

You can't enable tracing with this feature?

"Prior to returning to userspace,
isolated tasks will arrange that no future kernel
activity will interrupt the task while the task is running
in userspace. By default, attempting to re-enter the kernel
while in this mode will cause the task to be terminated
with a signal; you must explicitly use prctl() to disable
task isolation before resuming normal use of the kernel."

2)

A qemu-kvm-vcpu thread, process which runs on the host system,
executes guest code through

ioctl(KVM_RUN) --> vcpu_enter_guest --> x86 instruction to execute
guest code.

So the "isolation period where task does not want to be interrupted"
contains kernel code.

3) Before using any service of the operating system, through a
syscall, the application has to clear the TIF_TASK_ISOLATION flag,
then do the syscall, and when returning to userspace, setting it again.

Now what guarantees regarding low amount of interrupts do you provide
while this task is in kernel mode?

4)

"We also support a new "task_isolation_debug" flag which forces
the console stack to be dumped out regardless. We try to catch the
original source of the interrupt, e.g. if an IPI is dispatched to a
task-isolation task, we dump the backtrace of the remote core that is
sending the IPI, rather than just dumping out a trace showing the core
received an IPI from somewhere."

KVM uses IPI's to for example send virtual interrupts and update the
guest clock at certain conditions (for example after VM migration).

So this seems a little heavy for our usecase.