Re: epoll oops.

From: Pekka Enberg
Date: Wed Oct 23 2013 - 05:08:38 EST


Hi Linus,

On Mon, Oct 14, 2013 at 10:57 PM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> [ Adding Pekka to verify the SLAB_DESTROY_BY_RCU semantics, and Peter
> Hurley due to the possible tty association ]
>
> On Mon, Oct 14, 2013 at 10:31 AM, Linus Torvalds
> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>>
>> Oleg, does this trigger any memory for you? Commit 971316f0503a
>> ("epoll: ep_unregister_pollwait() can use the freed pwq->whead") just
>> makes me go "Hmm, this is *exactly* that that commit is talking
>> about.."
>
> Ok, Oleg, going back to that whole thread, I think that old bug went like this:
>
> (a) normally all the wait-queues that epoll accesses are associated
> with files, and as such they cannot go away for any normal file
> activity. If the file exists, the waitqueue used for poll() on that
> file must exist.
>
> (b) signalfd is special, and it does a
>
> poll_wait(file, &current->sighand->signalfd_wqh);
>
> which means that the wait-queue isn't associated with the file
> lifetime at all. It cleans it up with signalfd_cleanup() if the signal
> handlers are removed. Normal (non-epoll) handling is safe, because
> "current->sighand" obviously cannot go away as long as the current
> thread (doing the polling) is in its poll/select handling.
>
> (c) as a result, epoll and exit() can race, since the normal epoll
> cleanup() is serialized by the file being closed, and we're missing
> that for the case of sighand going away.
>
> (d) we have this magic POLLFREE protocol to make signal handling
> cleanup inform the epoll logic that "oops, this is going away", and we
> depend on the underlying sighand data not going away thanks to the
> eventual destruction of the slab being delayed by RCU.
>
> (e) we are also very careful to only ever initialize the signalfd_wqh
> entry in the SLAB *constructor*, because we cannot do it at every
> allocation: it might still be in reused as long as it exists in the
> slab cache: the SLAB_DESTROY_BY_RCU flag does *not* delay individual
> slab entries, it only delays the final free of the underlying memory
> allocation.
>
> (f) to make things even more exciting, the SLAB_DESTROY_BY_RCU depend
> on the slab implementation: slub and slob seem to delay each
> individual allocation (and do ctor/dtor on every allocation), while
> slab does that "delay only the underlying big page allocator" thing.

So I'm not completely sure what you wanted me to verify Linus but yes
SLAB_DESTROY_BY_RCU only guarantees that the underlying page doesn't
go away for RCU but we're free to reuse the object. Anyone using the
object passed to kmem_cache_free() with SLAB_DESTROY_BY_RCU must check
that it's in fact the object we're interested in.

There's example code in a SLAB_DESTROY_BY_RCU comment in
<linux/slab.h> added by PeterZ.

Pekka
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/