Re: Linux 2.6.29

From: Ingo Molnar
Date: Fri Apr 03 2009 - 04:16:13 EST

Next message: KAMEZAWA Hiroyuki: "[RFC][PATCH 6/9] active inactive ratio for private"
Previous message: KAMEZAWA Hiroyuki: "[RFC][PATCH 5/9] add more hooks and check in lazy manner"
In reply to: Jens Axboe: "Re: Linux 2.6.29"
Next in thread: Bill Davidsen: "Re: Linux 2.6.29"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

* Jens Axboe <jens.axboe@xxxxxxxxxx> wrote:

> On Thu, Apr 02 2009, Linus Torvalds wrote:

> > On Fri, 3 Apr 2009, Lennart Sorensen wrote:

> > > So so far I would rank anticipatory at about 1000x better than
> > > cfq for my work load. It sure acts a lot more like it used to
> > > back in 2.6.18 times.
[...]

> > Jens - remind us what the problem with AS was wrt CFQ?
>
> CFQ was just faster, plus it supported things like io priorities
> that AS does not.

btw., while pluggable IO schedulers have their upsides:

- They are easier to test during development and deployment.

- The uptick of a new, experimental IO scheduler is faster due to
easier availability.

- Regressions in the primary IO scheduler are easier to prove.

And the technical case for pluggable IO schedulers is much stronger
than the case for pluggable process schedulers:

- Persistent media has persistent workloads - and each workload has
different access patterns.

- The inefficiencies of mixed workloads on the same rotating media
have forced a clear separation of the 'one disk, one workload'
usage model, and has hammered this down people's minds. (Nobody
in their right mind is going to put a big Oracle and SAP
installation on the same [rotating] disk.)

- the 'NOP' scheduler makes sense on media with RAM-like
properties. 90% of CFQ's overhead is useless fluff on such media.

- [ These properties are not there for CPU schedulers: CPUs are
data processors not persistent data storage so they are
fundamentally shared by all workloads and have a lot less
persistent state - so mixing workloads on CPUs is common and
having one good scheduler is paramount. ]

At the risk of restarting the "to plug or not to plug" scheduler
flamewars ;-), the pluggable IO scheduler design has its very clear
downsides as well:

- 99% of users use CFQ, so any bugs in it will hit 99% of the Linux
community and we have not actually won much in terms of helping
real people out in the field.

- We are many years down the road of having replaced AS with the
supposedly better CFQ - and AS is still (or again?) markedly
better for some common tests.

- The 1% of testers/users who find that CFQ sucks and track it down
to CFQ can easily switch back to another IO scheduler: NOP or AS.

This dillutes the quality of _CFQ_, our crown jewel IO scheduler:
as it removes critical participiants from the pool of testers.
They might be only 1% of all Linux users, but they are the 1% who
make things happen upstream.

The result: even if CFQ sucks for some important workloads, the
combined social pressure is IMO never strong enough on upstream
to get our act together. While we might fix the bugs reported
here, the time to realize and address these bugs was way too
long. Power-users configure they way out and go the path of least
resistance and the rest suffers in silence.

- There's not even any feedback in the common case: people think
"hey, what I'm doing must be some oddball thing" and leave it at
that. Even if that oddball thing is not odd at all. Furthermore,
getting feedback _after_ someone has solved their problems by
switching to AS is a lot harder than getting feedback while they
are still hurting and cursing. Yesterday's solved problem is
boring and a lot less worthy to report than today's high-prio
ticket.

- It is _too easy_ to switch to AS, and shops with critical data
will not be as eager to report CFQ problems, and will not be as
eager to test experimental kernel patches that fix CFQ problems,
if they can switch to AS at the flip of a switch.

Ergo, i think pluggable designs for something as critical and as
central as IO scheduling has its clear downsides as it created two
mediocre schedulers:

- CFQ with all the modern features but performance problems on
certain workloads

- Anticipatory with legacy features only but works (much!) better
on some workloads.

... instead of giving us just a single well-working CFQ scheduler.

This, IMHO, in its current form, seems to trump the upsides of IO
schedulers.

So i do think that late during development (i.e. now), _years_ down
the line, we should make it gradually harder for people to use AS.

I'd not remove the AS code per se (it _is_ convenient to test it
without having to patch the kernel - especially now that we _know_
that there is a common problem, and there _are_ genuinely oddball
workloads where it might work better due to luck or design), but
still we should:

- Make it harder to configure in.

- Change the /sys switch-to-AS method to break any existing scripts
that switched CFQ to AS. Add a warning to the syslog if an old
script uses the old method and document the change prominetly but
do _not_ switch the IO scheduler to AS.

- If the user still switched to AS, emit some scary warning about
this being an obsolete IO scheduler, that it is not being tested
as widely as CFQ and hence might have bugs, and that if the user
still feels absolutely compelled to use it, to report his problem
to the appropriate mailing lists so that upstream can fix CFQ
instead.

By splintering the pool of testers and by removing testers from that
pool who are the most important in getting our default IO scheduler
tested we are not doing ourselves any favors.

Btw., my personal opinion is that even such extreme measures dont
work fully right due to social factors, so _my_ preferred choice for
doing such things is well known: to implement one good default
scheduler and to fix all bugs in it ;-)

For IO schedulers i think there's just two sane technical choices
for plugins: one good default scheduler (CFQ) or no IO scheduler at
all (NOP).

The rest is development fuzz or migration fuzz - and such fuzz needs
to be forced to zero after years of stabilization.

What do you think?

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: KAMEZAWA Hiroyuki: "[RFC][PATCH 6/9] active inactive ratio for private"
Previous message: KAMEZAWA Hiroyuki: "[RFC][PATCH 5/9] add more hooks and check in lazy manner"
In reply to: Jens Axboe: "Re: Linux 2.6.29"
Next in thread: Bill Davidsen: "Re: Linux 2.6.29"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]