Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock

From: Frederic Weisbecker
Date: Tue Oct 01 2013 - 10:50:01 EST


2013/10/1 Frederic Weisbecker <fweisbec@xxxxxxxxx>:
> 2013/10/1 Frederic Weisbecker <fweisbec@xxxxxxxxx>:
>> On Wed, Aug 21, 2013 at 06:41:46PM +0200, Oleg Nesterov wrote:
>>> On 08/21, Peter Zijlstra wrote:
>>> >
>>> > The other consideration is that this adds two branches to the normal
>>> > schedule path. I really don't know what the regular ratio between
>>> > schedule() and io_schedule() is -- and I suspect it can very much depend
>>> > on workload -- but it might be a net loss due to that, even if it makes
>>> > io_schedule() 'lots' cheaper.
>>>
>>> Yes, agreed. Please ignore it for now, I didn't try to actually suggest
>>> this change. And even if this is fine perfomance wise, this needs some
>>> benchmarking.
>>>
>>> Well. actually I have a vague feeling that _perhaps_ this change can
>>> help to solve other problems we are discussing, but I am not sure and
>>> right now I can't even explain the idea to me.
>>>
>>> In short: please ignore ;)
>>>
>>> Oleg.
>>>
>>
>> Ok, the discussion is hard to synthesize but I think I now have more
>> clues to send a better new iteration.
>>
>> So we have the choice between keeping atomics or using rq->lock. It seems
>> that using rq->lock is going to be worrisome for several reasons. So let's
>> stick to atomics (but tell me if you prefer the other way).
>>
>> So the idea for the next try is to do something along the lines of:
>>
>> struct cpu_idletime {
>> nr_iowait,
>> seqlock,
>> idle_start,
>> idle_time,
>> iowait_time,
>> } __cacheline_aligned_in_smp;
>>
>> DEFINE_PER_CPU(struct cpu_idletime, cpu_idletime);
>>
>> io_schedule()
>> {
>> int prev_cpu;
>>
>> preempt_disable();
>> prev_cpu_idletime = __this_cpu_ptr(&cpu_idletime);
>> atomic_inc(prev_cpu_idletime->nr_iowait);
>> WARN_ON_ONCE(is_idle_task(current));
>> preempt_enable_no_resched();
>>
>> schedule();
>>
>> write_seqlock(prev_cpu_idletime->seqlock)
>> if (!atomic_dec_return(prev_cpu_idletime->nr_iowait))
>> flush_cpu_idle_time(prev_cpu_idletime, 1)
>
> I forgot...
> cpu_idletime->idle_start;
>
> after the update.
>
> Also now I wonder if we actually should lock the inc part. Otherwise
> it may be hard to get the readers right...
>
> Thanks.
>
>> write_sequnlock(prev_cpu_idletime->seqlock)
>>
>> }
>>
>> flush_cpu_idle_time(cpu_idletime, iowait)
>> {
>> if (!cpu_idletime->idle_start)
>> return;
>>
>> if (nr_iowait)
>> cpu_idletime->iowait_time = NOW() - cpu_idletime->idle_start;
>> else
>> cpu_idletime->idle_time = NOW() - cpu_idletime->idle_start;
>> }
>>
>> idle_entry()
>> {
>> write_seqlock(this_cpu_idletime->seqlock)
>> this_cpu_idletime->idle_start = NOW();
>> write_sequnlock(iowait(cpu)->seqlock)
>> }
>>
>> idle_exit()
>> {
>> write_seqlock(this_cpu_idletime->seqlock)
>> flush_cpu_idle_time(this_cpu_idletime, atomic_read(&this_cpu_idletime->nr_iowait));
>> this_cpu_idletime->idle_start = 0;
>> write_sequnlock(this_cpu_idletime->seqlock)
>> }
>>
>>
>> Now this all realy on the fact that atomic_inc(cpu_idletime->nr_iowait) can't happen
>> in a CPU that is already idle. So it can't happen between idle_entry() and idle_exit().
>> Hence the WARN_ON(is_idle_task(current)) below after the inc.
>>
>> Hmm?

Actually if we record the source of the idle entry time (whether we
enter idle as iowait or idlewait) and protect it inside the seqlock,
we can probably make the readers safe. So they would rely on that
instead of the nr_io_wait which is only used by the updaters.

I'm probably forgetting something obvious. Anyway I'll try to make a
patchset out of that, unless somebody points me out the mistakes I'm
missing here.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/