Re: [RFC] [PATCH] Pre-emption control for userspace

From: Khalid Aziz
Date: Thu Mar 06 2014 - 11:10:01 EST


On 03/06/2014 02:57 AM, Peter Zijlstra wrote:
On Wed, Mar 05, 2014 at 12:58:29PM -0700, Khalid Aziz wrote:
Looking at the current problem I am trying to
solve with databases and JVM, I run into the same issue I described in my
earlier email. Proxy execution is a post-contention solution. By the time
proxy execution can do something for my case, I have already paid the price
of contention and a context switch which is what I am trying to avoid. For a
critical section that is very short compared to the size of execution
thread, which is the case I am looking at, avoiding preemption in the middle
of that short critical section helps much more than dealing with lock
contention later on.

Like others have already stated; its likely still cheaper than the
pile-up you get now. It might not be optimally fast, but it sure takes
out the worst case you have now.

The goal here is to avoid lock contention and
associated cost. I do understand the cost of dealing with lock contention
poorly and that can easily be much bigger cost, but I am looking into
avoiding even getting there.

The thing is; unless userspace is a RT program or practises the same
discipline in such an extend as that it make no practical difference,
there's always going to be the case where you fail to cover the entire
critical section, at which point you're back to your pile-up fail.

So while the limited preemption guard helps the best cast, it doesn't
help the worst case at all.

That is true. I am breaking this problem into two parts - (1) avoid pile up, (2) if pile up happens, deal with it efficiently. Worst case scenario you point out is the second part of the problem. Solutions for that can be PTHREAD_PRIO_PROTECT protocol for the threads that use POSIX threads or proxy execution. Once pile up has happened, cost of a system call to boost thread priority becomes much smaller part of overall cost of handling the pile up.

Part (1) of this problem is what my patch attempts to solve. Here the cost of system call to boost priority or do anything else is too high. The mechanism to avoid pile up has to be very light weight to be of any use.


So supposing we went with this now; you (or someone else) will come back
in a year's time and tell us that if we only just stretch this window a
little, their favourite workload will also benefit.

Where's the end of that?

And what about CONFIG_HZ; suppose you compile your kernel with HZ=100
and your 1 extra tick is sufficient. Then someone compiles their kernel
with HZ=1000 and it all comes apart.



My goal here is to help the cases where critical section is short and executes quickly as it should be for well designed critical sections in threads that want to run using CFS. I see this as an incremental improvement over current situation. With CFS, timeslice is adaptive and depends upon the workload, so it is not directly tied to CONFIG_HZ. But you are right, CONFIG_HZ does have a bearing on this. I see a critical section that can easily go over a single timeslice and cause a pile up, as a workload designed to create these problems. Such a workload needs to use SCHED_FIFO or the deadline scheduler with properly designed yield points and priorities, or live with the pile ups caused by using CFS. Trying to help such cases with CFS is not beneficial and will cause CFS to become more and more complex. What I am trying to do is help the cases where a short critical section ends up being pre-empted simply because the execution reached critical section only towards the end of current timeslice and resulted in an unintended pile up. So give these cases a tool to avoid pile ups but use of the tool comes with restrictions (yield the processor as soon as you can if you got amnesty, and pay a penalty if you don't). At this point, the two workloads I know of that fit this group are databases and JVM both of which are in significant use.

Makes sense?

Thanks,
Khalid
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/