Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

From: Benjamin Herrenschmidt
Date: Sat Jun 30 2018 - 23:49:50 EST


On Sat, 2018-06-30 at 19:18 -0700, Linus Torvalds wrote:
> On Sat, Jun 30, 2018 at 7:07 PM Linus Torvalds
> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > Those locks won't protect kobj races in _general_ (ie there is no
> > locking between two totally unrelated buses), but they *should*
> > serialize the case of a device being added within one class. No?
>
> Side note: there had *better* be some locking whenever there is a way
> to find an object, because otherwise you have a fundamental lifetime
> problem: one thread finding the object at the same time another thread
> frees it for the last time. Even the "unless_zero()" won't fix it,
> because the final free will release the underlying object itself, so
> the "zero" state is ephemeral.

Right. We agree. The problem here relates to that built-in cleanup in
kobject_release() which violates that rule: it removes the object from
sysfs (and from any kset) after the reference has been dropped.

Users such as the gluedir code try to workaround this by using outer
mutexes or locks, but it's fundamentally racy if there's any chance
that another 3rd party holds the last reference, since in that case,
the cleanup happens outside of any added locking.

That's why I tend to think that anything that relies on that "late
removal from sysfs" is broken by design :) Anything that uses kobject
should ensure they are taken out of sysfs and their respective ksets
before the last reference drops.

Now it's probably all fine for most cases, I assume sysfs itself checks
the reference to be non-0 in lookup path (I haven't checked), at least
to prevent use-after-free. This is still open to the duplicate name
issue. The bigger issue of use-after-free only happens in the rare
cases where there's a separate list being maintained and cleaned up
late.

The gluedir is such an example because it "manually" thrawls the kset
list.

So you are right, kobject_get should probably always be _unless_zero,
but my point is we even with that, we probably still need to ensure the
gluedir is taken out of the kset before we drop the last reference, so
this can be done with the appropriate upper layer subsystem locking,
and thus without racing with a new device being bound.

> That locking might be just RCU during lookup, and rcu-delaying the
> release, of course. I think that's all the sysfs code needs, for
> example, since that's what lookup uses.
>
> And for any other embedded kobj cases, where you can reach the object
> using some random subsystem pointers, there had better be other
> locking in place for that pointer lookup vs the last removal.
>
> kobject itself doesn't provide that locking, it only provides the
> reference counting. But that's partly why it really has to disallow
> any kobject_get() of a zero object, because it means that the
> tear-down has been started, but the tear-down itself may not have had
> time to get the lock yet (ie kobject_release() may be just about to
> call the t->release() function).
>
> But maybe I'm missing something subtle.
>
> Linus