Re: [PATCH] 'select' failure or signal should not update timeout

From: Eric W. Biederman (ebiederm@xmission.com)
Date: Thu Jul 25 2002 - 11:35:28 EST


Linus Torvalds <torvalds@transmeta.com> writes:

> On Wed, 24 Jul 2002, Jamie Lokier wrote:
> >
> > Typical soft real-time code looks a bit like this pseudo-code (excuse
> > the bugs :-):
>
> Yup, looks familiar.
>
> The thing is, we cannot change existing select semantics, and the question
> is whether what most soft-realtime wants is actually select, or whether
> people really want a "waittimeofday()".
>
> Like your example, the only uses I've had personally (DVD playback) have
> really had an empty select, so it wasn't really select itself that was
> horribly important.

Baring minor quibbles waittimeofday is essentially what we have
today. Fixing up the interface to take an absolute time, from an
absolute timer cleans up some races but doesn't attack the fundamental
problem.

There are two fundamental problems with the current interface. The
timer granularity is much to large, and we don't know the granularity
that user space cares about.

The posted wait_for_time implementation had one very interesting
aspect the timer granularity of the kernel (HZ) was known to the
application, and it very deliberately rounded the sleep interval down
based on the kernel timer granularity, so it could busy wait all by
itself for the rest. Problem, user space applications can only get
what they want through busy waiting. But they can tell when they have
gotten what they want because the gettimeofday resolution is much
better than the kernel timer resolution.

There are two states a unix box can be in. cpu load_average < 1. In
this state multiple processes run per scheduler quantum. They run for
short periods of time and then go back to sleep. Latency is very good
because the other processes get out of the way, and sleeping process
can count on running at the next timer tick. cpu load_average > 1.
In this state one or several processes run for their full cpu
quantum. Latency is bad, and opportunistically the timer resolution
does not get better.

When we have a cpu load average < 1, it is trivial to increase the
timer granularity to something resembling the gettimeofday resolution
simply by internally doing gettimeofday when schedule is called, and
adding those processes that have just become runnable to the run
queue. To get the most out of this the idle task would need to busy
wait looking for timer events, when we have an event scheduled before
the next timer tick.

The goal is twofold, to remove the need for user space applications to
busy wait, so sometimes the system can get something done another
process is waiting, and to increase the internal kernel timer
granularity to the point where user space doesn't care anymore. With
the only timer we sleep past the desired time is when the kernel
decides there is some higher priority task to run.

On the timer queue I would use either microseconds (the resolution of
struct timeval), or the natural resolution of the timer, instead of
something artificial like HZ. This allows faster machines under the
same load as slower machines to become more precise, with the same
code. Polling for timers only in schedule and the idle task means
that when the load is low the kernel offers more precision than when
the load is high. The frequency of timer interrupts would have to up,
at some point to handle some loads, but just keeping the load light
would allow the program to work as expected even without a kernel
recompile.

Having user space know HZ and use it for their internal calculations
is dangerous because HZ will change as time goes by. But having user
space specify it's desired timer resolution to the kernel will allow
the kernel to round the time to a place where it can efficiently
handle the timer, and still meet the user space deadline.

The most interesting use I have seen is a high performance local area
data transfer utility, that would do short sleeps in between sending
packets to avoid pushing the switch to the point where it would drop
packets. But it was perfectly fine if a new packet came in before it
was done waiting for the old packet to go out the wire.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue Jul 30 2002 - 14:00:21 EST