Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context

From: Milton Miller
Date: Fri Jan 13 2012 - 15:58:48 EST


On Thu, 12 Jan 2012 about 14:14:57 -0500 (EST), Gilad Ben-Yossef wrote:
> I also think there is still some problems with IPIs somewhere that
> cause some corruption when a lot of IPIs are sent and that
> the patch simply lowered the very big number of IPIs that are sent
> via the direct reclaim code path so the problem
> is hidden, not solved by this patch.
>
> I've seen something related when trying to test the IPI reduction
> patches. Interesting enough it was not related to CPU hotplug at all -

> When a lot of IPIs are being sent, I sometime got an assert from low
> level platform code that I'm trying to send IPIs with an empty mask
> although the mask was NOT empty. I didn't manage to debug it then but
> I did manage to recreate it quite easily.

That is a known scenario and expected race, the check must be removed
from the platform code.

>
> I will see if I can recreate it with recent master and report.
>

The code in kernel/smp checks the mask, and looks for opportunities to
convert smp_call_function_many to call_function_single. But it now
tolerates callers passing a mask that other cpus are changing. So
while it tries to make sure the mask has > 1 cpu, its not guaranteed.
But that is not what causes the assert to fire.

There are a few opportunities for the architecture code to detect an
empty mask. The first case is a cpu started processing the list due
to an interrupt generated by a cpu other than the requester between
the time the requester put its work element in the list and when it
executed the architecture code to send the interrupt. A second
case is a work element is deleted by a third cpu as its ref count
goes to zero and then that element gets reused by the owning cpu,
so we process part of the request list twice.

The solution is to tell the architecture code its not an error for
the mask to be empty.

Here is a walk though of how smp_call_function_many works. Each cpu
has a queue element in per-cpu space that can be placed on the queue.
When a cpu calls smp_call_function_many, it does a quick check to see
if it could be converted to smp_call_function_single, but at this point
other cpus can still be changing the mask (we only guarantee cpus that
remain in the mask from the start of the call through the list copy
will be called). Once it decides multiple cpus are needed, it first
waits for its per-cpu data to become free (if the previous caller said
not to wait, it could still be in use). After its free it copies
the mask supplied by the caller to its per-cpu data work element.
At that point it removes itself from the list then counts the bits.
If the count is 0 it is done, but otherwise it obtains the lock,
adds its work to the front of the global list, and stores the count
of cpus. Starting at this point other cpus could start doing the
work requested, even though they have not received an interrupt yet,
nor have we unlocked the list manipulation lock). The requesting
cpu proceeds to unlock the list and then calls the architecture code
to request all cpus in the mask be notified there is work for them.
The low level code does its processing and then iterates over the
list sending IPIs to each cpu that is still in the mask in the work
element. If the caller requested to wait for all (other) cpus to
perform the work before returning, then the requesting cpu waits for
its per-cpu-data work request element be unlocked by the last cpu
who does the work, after it has removed it from the list.

Whenever a cpu gets an IPI for generic_call_function_interrupt, it
starts at the top of the list and works its way down. If a given
element has the cpu's bit set in the cpu mask in the work element,
then there will be work to do in the future. It then proceeds to check
references, to make sure the element has been placed on the list and is
allowed to be processed. If not, we are guaranteed the requester has
not unlocked the list and therefore has not called the architecture
notifier, so the cpu will get another interrupt in the future and
proceeds to the next element. If the bit is set and references
are set, then the code will execute the requested function with the
supplied data argument. This function must not enable interrupts!
(There is a defensive check, but that will only trigger if we take
another call to smp_call_function_interrupt.) After the call, the
cpu removes itself from the cpu mask and decrements the ref count.
If it removes the last reference, it grabs the list manipulation lock,
removes the element with list_del_rcu, unlock the list (flushing the
list manipulation writes), and then unlocks the element, allowing
the requesting cpu to reuse the element. It then goes on to the
next element.

If the cpu suddenly becomes slow here and the element owning cpu
see the unlock and is able to reuse the element, our traversal of
next will lead us to the start of the list and we will check some
entries an additional time. This means we can see entries that were
added to the list after we started processing the list before we get
the interrupt. Eventually we wrap back to the head of the list and
return from the interrupt.

It would be easy to write some debugging assertions for the list,
for instance: After grabbing the list walk it and mark which cpus
per_cpu data is in the list. Then still under the lock verify
(1) if refs is set the element is on the list (2) refs is <= the
number of cpus in the cpumask (3) any element with refs set is
under csd_lock. Things that are valid tranisent states: (1)
refs < bits in mask (short transient), (2) refs = 0 but on list
(waiting for lock to remove) (3) off list but csd_lock (either
short trasnient or owning cpu preparing or waiting to put it
on the list).

For added saftey under the lock we could count the traversals it should
never take more than one above nr_cpu_ids (or num_possible_cpus()
for that matter) to get back to the list head.

We could also put an assert type check before taking a cpu
dead that would check all per_cpu copies that our cpu bit
is not set in any cpus element. It should never be set
even transiently if we are offline.

milton
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/