Re: [PATCH] cpuset semaphore depth check optimize

From: Paul Jackson
Date: Tue Sep 13 2005 - 12:37:58 EST


Thank-you, Roman, for looking into cpuset locks.

I am starting to see the picture you are painting, with a global
cpuset_sem to guard changes in the cpuset hierarchy (add/remove)
and a per-cpuset spinlock to briefly pin down an active cpuset
and guard select values (such as mems_allowed) for access from
allocator callbacks and other tasks.

Roman wrote:
> If I read the source correctly, a cpuset cannot be removed or moved while
> it's attached to a task, which makes it a lot simpler.

Yes - a cpuset cannot be removed while attached (count > 0). And there
is no 'move' that I know of. A rename(2) system call on a cpuset in
the cpuset filesystem gets the vfs default -EINVAL response.

So, yes, if I can pin a cpuset with a per-cpuset spinlock on it, then
its parent chain (and whatever else I take care to guard with that
spinlock) is held as well (the cpuset->parent chain is pinned). I
guess this some of what you meant by your phrase "makes it a lot
simpler".


> This means you have to take the second lock ...

By "second lock", did you mean what you described in your earlier
message as:

> low-level lock (maybe even a spinlock) which manages the state of an
> active cpuset.


You also wrote:
> You can BTW avoid locking in cpuset_exit() completely in the common case:
>
> tsk->cpuset = NULL;
> if (atomic_dec_and_test(&cs->count) && notify_on_release(cs)) {

I don't think that works. And I suspect you are proposing the same bug
that I had, and fixed with the following patch:

> changeset: 1793:feaa2e5b1e0026994365daf8d1ad3bd745ba8a14
> user: Paul Jackson <pj@xxxxxxx>
> date: Fri May 27 08:07:26 2005 +0019
> files: kernel/cpuset.c
> description:
> [PATCH] cpuset exit NULL dereference fix
>
> There is a race in the kernel cpuset code, between the code
> to handle notify_on_release, and the code to remove a cpuset.
> The notify_on_release code can end up trying to access a
> cpuset that has been removed. In the most common case, this
> causes a NULL pointer dereference from the routine cpuset_path.
> However all manner of bad things are possible, in theory at least.
>
> The existing code decrements the cpuset use count, and if the
> count goes to zero, processes the notify_on_release request,
> if appropriate. However, once the count goes to zero, unless we
> are holding the global cpuset_sem semaphore, there is nothing to
> stop another task from immediately removing the cpuset entirely,
> and recycling its memory.


You also wrote:
> There may be a subtle problem with cpuset_fork()

Hmmm ... interesting. I will think about this some more.

> The only (simple) solution I see is to do this:
>
> lock();
> tsk->cpuset = current->cpuset;
> atomic_inc(&tsk->cpuset->count);
> unlock();

What "lock()" and "unlock()" is this? Your "second lock",
aka "low-level lock (maybe even a spinlock)" ?

Separate question:

When cpuset_excl_nodes_overlap() or cpuset_zone_allowed()
callbacks are walking backup the cpuset hierarchy in the
nearest_exclusive_ancestor() code, how do they handle the
per-cpuset spinlocks?

My assumption is that it would be like this. Say my current task
is attached to cpuset "/dev/cpuset/A/B/C", and it happens I will
have to walk all the way to the top to find the exclusive ancestor
that I am searching for. I'd expect the spinlocks to get taken
and released in the following order on these cpusets:

lock /A/B/C
lock /A/B
lock /A
lock /
==> found what I was searching for
unlock /
unlock /A
unlock /A/B
unlock /A/B/C

If at any point, perhaps in the creation of a cpuset, I had reason
to want to get a child cpusets spinlock while already holding the
parents, I would have to release the parent, and relock starting
with the child first, working upward. The proper order to obtain
these locks would always be bottom up.

Does this resemble what you are suggesting, Roman?

Your review of this is most appreciated. Once again - thanks.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@xxxxxxx> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/