Re: lmbench regression due to cond_resched nullification change26-rc5 vs. 25

From: Linus Torvalds
Date: Fri Jun 20 2008 - 15:11:48 EST




On Fri, 20 Jun 2008, Christian Borntraeger wrote:
>
> On a 6-way s390 I have seen some interesting regression in 2.6.26-rc5 vs.
> 2.6.25 for the lmbench benchmark.
>
> For example select file 500:
> 23 microseconds
> 32 microseconds
>
> Several lmbench tests show a regression but I only bisected the select test
> case so far:
> -------------------------<snip---------------------
>
> commit c714a534d85576af21b06be605ca55cb2fb887ee
>
> Make 'cond_resched()' nullification depend on PREEMPT_BKL
>
> Reverting that patch gives me the 2.6.25 performance.
>
> I think the patch is fine from the correctness point of view (do resched
> inside BKL protected zones if its safe) but I dont understand why it has a
> large impact on the select microbenchmark. Any ideas? Is it simply the
> overhead of _cond_resched?

Yeah, I do think that it really is simply the overhead of _cond_resched.

Most places that use cond_resched() and friends do it in fairly heavy
loops. That inner select() loop, in contrast, is rather heavily optimized,
partly exactly because it's one of the hottest loops in one of the lmbench
benchmarks.

Of course, it *can* also be simply that it now reschedules more, and a lot
of lmbench benchmarks are very sensitive to scheduling behaviour.
Batch-mode is almost invariably more efficient than any fine-grained
scheduling can ever be, so there is always a performance count in having
low latency.

There's also a third (and rather funny) reason the "cond_resched()" may be
unnecessarily expensive in that loop: the "need_resched" bit may be set
because that very same running thread was woken up! So we may end up
deciding to reschedule because the thread thinks it needs to wake itself
up ;)

[ That third case happens because the way we avoid race conditions with
going to sleep while a wakeup event happens is that we *mark* ourselves
as being sleeping before we actually do all the tests in lmbench. ]

So there's a few different things that may make that conditional
reschedule a bad thing in the poll() loop.

But quite frankly, regardless of exactly why it happens, it absolutely
makes no sense to even have that cond_resched() in the _innermost_ loop -
the one that is called for every single fd. It's much better to move the
conditional reschedule out a bit.

That inner loop was very much designed to compile into nice assembly
language, and it's possible that the cond_resched() simply causes extra
register pressure and keeps us from keeping the bitmasks in registers etc.

So this trivial patch, which moves the cond_resched() to be outside the
single-word select() loop, would be a good idea _regardless_. Does it make
any difference for you? If it's purely a "code generation got worse"
issue, it should help a bit.

Linus

---
fs/select.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 8dda969..da0e882 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -249,7 +249,6 @@ int do_select(int n, fd_set_bits *fds, s64 *timeout)
retval++;
}
}
- cond_resched();
}
if (res_in)
*rinp = res_in;
@@ -257,6 +256,7 @@ int do_select(int n, fd_set_bits *fds, s64 *timeout)
*routp = res_out;
if (res_ex)
*rexp = res_ex;
+ cond_resched();
}
wait = NULL;
if (retval || !*timeout || signal_pending(current))
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/