RE: Mainline kernel OLTP performance update

From: Ma, Chinang
Date: Thu Jan 15 2009 - 02:12:23 EST


Trying to answer to some of the question below:

-Chinang

>-----Original Message-----
>From: Steven Rostedt [mailto:srostedt@xxxxxxxxxx]
>Sent: Wednesday, January 14, 2009 6:27 PM
>To: Andrew Morton
>Cc: Matthew Wilcox; Wilcox, Matthew R; Ma, Chinang; linux-
>kernel@xxxxxxxxxxxxxxx; Tripathi, Sharad C; arjan@xxxxxxxxxxxxxxx; Kleen,
>Andi; Siddha, Suresh B; Chilukuri, Harita; Styner, Douglas W; Wang, Peter
>Xihong; Nueckel, Hubert; chris.mason@xxxxxxxxxx; linux-scsi@xxxxxxxxxxxxxxx;
>Andrew Vasquez; Anirban Chakraborty; Ingo Molnar; Thomas Gleixner; Peter
>Zijlstra; Gregory Haskins
>Subject: Re: Mainline kernel OLTP performance update
>
>(added Ingo, Thomas, Peter and Gregory)
>
>On Wed, 2009-01-14 at 18:04 -0800, Andrew Morton wrote:
>> On Wed, 14 Jan 2009 18:21:47 -0700 Matthew Wilcox <matthew@xxxxxx> wrote:
>>
>> > On Wed, Jan 14, 2009 at 04:35:57PM -0800, Andrew Morton wrote:
>> > > On Tue, 13 Jan 2009 15:44:17 -0700
>> > > "Wilcox, Matthew R" <matthew.r.wilcox@xxxxxxxxx> wrote:
>> > > >
>> > >
>> > > (top-posting repaired. That @intel.com address is a bad influence ;))
>> >
>> > Alas, that email address goes to an Outlook client. Not much to be
>done
>> > about that.
>>
>> aspirin?
>>
>> > > (cc linux-scsi)
>> > >
>> > > > > This is latest 2.6.29-rc1 kernel OLTP performance result. Compare
>to
>> > > > > 2.6.24.2 the regression is around 3.5%.
>> > > > >
>> > > > > Linux OLTP Performance summary
>> > > > > Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle%
>iowait%
>> > > > > 2.6.24.2 1.000 21969 43425 76 24 0
>0
>> > > > > 2.6.27.2 0.973 30402 43523 74 25 0
>1
>> > > > > 2.6.29-rc1 0.965 30331 41970 74 26 0
>0
>> >
>> > > But the interrupt rate went through the roof.
>> >
>> > Yes. I forget why that was; I'll have to dig through my archives for
>> > that.
>>
>> Oh. I'd have thought that this alone could account for 3.5%.
>>
>> > > A 3.5% slowdown in this workload is considered pretty serious, isn't
>it?
>> >
>> > Yes. Anything above 0.3% is statistically significant. 1% is a big
>> > deal. The fact that we've lost 3.5% in the last year doesn't make
>> > people happy. There's a few things we've identified that have a big
>> > effect:
>> >
>> > - Per-partition statistics. Putting in a sysctl to stop doing them
>gets
>> > some of that back, but not as much as taking them out (even when
>> > the sysctl'd variable is in a __read_mostly section). We tried a
>> > patch from Jens to speed up the search for a new partition, but it
>> > had no effect.
>>
>> I find this surprising.
>>
>> > - The RT scheduler changes. They're better for some RT tasks, but not
>> > the database benchmark workload. Chinang has posted about
>> > this before, but the thread didn't really go anywhere.
>> > http://marc.info/?t=122903815000001&r=1&w=2
>
>I read the whole thread before I found what you were talking about here:
>
>http://marc.info/?l=linux-kernel&m=122937424114658&w=2
>
>With this comment:
>
>"When setting foreground and log writer to rt-prio, the log latency reduced
>to 4.8ms. \
>Performance is about 1.5% higher than the CFS result.
>On a side note, we had been using rt-prio on all DBMS processes and log
>writer ( in \
>higher priority) for the best OLTP performance. That has worked pretty well
>until \
>2.6.25 when the new rt scheduler introduced the pull/push task for lower
>scheduling \
>latency for rt-task. That has negative impact on this workload, probably
>due to the \
>more elaborated load calculation/balancing for hundred of foreground rt-
>prio \
>processes. Also, there is that question of no production environment would
>run DBMS \
>with rt-prio. That is why I am going back to explore CFS and see whether I
>can drop \
>rt-prio for good."
>

>A couple of questions:
>
>1) how does the latest rt scheduler compare? There has been a lot of
>improvements.

It is difficult for me to isolate the recent rt scheduler improvement as so many other changes were introduced to the kernel at the same time. A more accurate comparison should just revert the rt-scheduler back to the previous version and test the delta. I am not sure how to get that done.

>2) how many rt tasks?
Around 250 rt tasks.

>3) what were the prios, producer compared to consumers, not actual numbers
I suppose the single log writer is the main producer (rt-prio 49, higheset rt-prio in this workload) which wake up all foreground process when the log write is done. The 240 foreground processes are the consumer (rt-prio 48). At any given time some number of the 240 foreground was waiting for log writer to finish flushing out the log data.

>4) have you tried pinning tasks?
>
We did try pin foreground rt-process to cpu. That recovered about 1% performance but introduce idle time in some cpu. Without load balancing, my solution is to pin more processes to the idle cpu. I don't think this is a practical solution for the idle time problem as the process distribution need to be adjusted again when upgrade to a different server.

>RT is more about determinism than performance. The old scheduler
>migrated rt tasks the same as other tasks. This helps with performance
>because it will keep several rt tasks on the same CPU and cache hot even
>when a rt task can migrate. This helps performance, but kills
>determinism (I was seeing 10 ms wake up times from the next-highest-prio
>task on a cpu, even when another CPU was available).
>
>If you pin a task to a cpu, then it skips over the push and pull logic
>and will help with performance too.
>
>-- Steve
>
>
>
>>
>> Well. It's more a case that it wasn't taken anywhere. I appear to
>> have recently been informed that there have never been any
>> CPU-scheduler-caused regressions. Please persist!
>>
>> > SLUB would have had a huge negative effect if we were using it -- on
>the
>> > order of 7% iirc. SLQB is at least performance-neutral with SLAB.
>>
>> We really need to unblock that problem somehow. I assume that
>> enterprise distros are shipping slab?
>>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/