Re: RFC for a new Scheduling policy/class in the Linux-kernel

From: Peter Zijlstra
Date: Thu Jul 16 2009 - 04:58:39 EST


On Wed, 2009-07-15 at 19:16 -0400, Ted Baker wrote:
> > > 1) The priority of a group seemed to be defined by the priority of
> > > the highest-priority thread in the group's run-queue, which means
> > > it varies dynamically according to which threads in the group are
> > > contending.
> > >
> >
> > This is true, but it also ensures that the time allocated to the group
> > is also consumed by group if it wants to.
>
> I don't see how schedulability analysis can be done with this model,
> since a single budget is being expended at varying priorities/deadlines.
>
> > > 4) On an SMP, more than one thread could be running against
> > > the same budget at the same time, resulting in budget over-charges.
> > >
> >
> > The rt group scheduler does split the budget per cpu. On expiring the
> > budget, it tries to borrow from other CPUs if possible.
>
> First, how is the splitting of the budget between CPU's controlled
> by the application?
>
> Second, I don't see how schedulabiliyt analysis could be done if
> CPU's can "borrow" budget from other CPUs, unless there is some
> mechanism in place to "pay it back". How do you do the analysis?

Right so control-groups (cgroups for short) are a form of
virtualization. Each controller is specific to a resource. We have
memory controllers, namespace controllers and also a scheduler
controller.

If you would apply all controllers to the same task groups you get a
result like chroot on steroids, or jails etc. But you can also use them
individually to control resources in creative ways.

In order to manage RT resources you want:

- a minimum bandwidth guarantee
- isolation

So ideally you want a CBS server that schedules your RT (FIFO/RR) tasks.

Now, let me first state that the current code is a hack, and I know its
nowhere near proper. But it was the best I could come up with on a short
notice -- and Fabio is now looking at doing better :-)

Furthermore the whole feature is still marked EXPERIMENTAL, basically
because I do recognize it for the hack it is -- that said, some people
might find it useful.


So a task can only belong to 1 group of any 1 controller (that is, it
can belong to multiple groups but only of different controllers) and
since there is only 1 scheduler controller, we can, for the purpose of
this discussion say it can only belong to a single group.

These groups get assigned a bandwidth through the use of a period and
runtime group parameter -- the documentation states that using different
periods is currently broken and would require a deadline server.

Therefore we can assume all periods are equal, for the rest of this
description -- and set it to 1s.


So what does it do, its a hierarchical FIFO scheduler, with the prio of
a group being that of the max prio of its children. If a group runs out
of quota it will be dequeued. When the period timer comes along and
refreshes the quota it will be requeued.

R
/ \
A B
/ \ |\
1 4 C 3
|
2

groups in letters, tasks in digits.

If we assume tasks being hogs and have descending priority relative to
their numbering, and A has 40% and B has 30% bandwidth and C has 20%.

Lets first look at UP.

Then since 1 would be the highest priority task, A would have the
priority of 1, and we'd select A->1 to run.

This would continue for 400ms, after that the whole of A will be
dequeued. Next in line would be B->C->2, which can run for 200ms before
C gets dequeued, leaving B->3 as only option, which has a remaining
100ms of B's budget left to run.

Once the refresh timer comes along things get replenished and can run
again.

SMP

the cgroup interface specified bandwidth as per a single cpu, when more
are present in the load-balance domain (see cpusets) the total bandwidth
available to the group is the specified multiplied by the number of
available cpus.

The initial distribution is such that each cpu gets equal bandwidth.

Now on 2 cpus, we'd want A->1 and B->C->2 running, since those are the
highest prio tasks on the system.

Since we have 2 cpus the budget for C is 200ms per cpu, 400ms total.

For the first 200ms we'll run 1 on cpu0 and 2 on cpu1. At that point
we'll find that cpu1's budget for C is depleted.

It will then look at other cpus in the load-balance domain (cpu0) and
transfer half of C's unused budget over to itself (with rounding up so
we can indeed leave the other cpus with 0).

This way C->2 can get to run for up to 400ms on cpu1. After that C gets
dequeued and B->3 will take over as the next highest task.

Now, after 600ms total B will have depleted its quota and the only tasks
left are A->{1,4}. A will have consumed 600 of its 800ms, and will now
have to spread this over 2 cpus. [ basically cpu0 gets to transfer less
of off cpu1 because 4 will be consuming C's quota on it ]

Locks.

Suppose they're not hogs and behave like proper tasks and complete their
work within bugdet. In this case nobody will get throttled and we all
live happily together.

Now suppose one application, say 3, has a bug and runs out of quota
whilst holding a lock.

Further suppose that 1 contends on that lock. The implemented behaviour
is that we PI boost 3. The group is temporarily placed back on the
runqueue and we allow 3 to overcommit on its budget in order to release
the lock.

This overcommit it accounted and only once the budget turns positive
again (due to sufficient refresh) will the group be requeued.


Now, why I did things like this.

Because doing a deadline CBS server
- is non trivial.
- would mean we'd have to deal with deadline inheritenace.
- would significantly complicate the load-balancing.

Is any of that an excuse? No not really but it is something useful for
some people, esp in the case of normal usage where you'd not hit the
limits, and in case you do, you only penalize the one who does.

So as a first approximation it seems to work in practise.

I'm very glad Fabio is working on improving things.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/