Re: lockdep: holding locks across syscall boundaries

From: Peter Zijlstra
Date: Fri Oct 27 2023 - 12:12:44 EST


On Fri, Oct 27, 2023 at 10:06:33AM -0600, Jens Axboe wrote:
> On 10/27/23 9:59 AM, Peter Zijlstra wrote:
> > On Fri, Oct 27, 2023 at 09:14:53AM -0600, Jens Axboe wrote:
> >> Hi,
> >>
> >> Normally we'd expect locking state to be clean and consistent across
> >> syscall entry and exit, as that is always the case for sync syscalls.
> >
> >> We currently have a work-around for holding a lock from aio, see
> >> kiocb_start_write(), which pretends to drop the lock from lockdeps
> >> perspective, as it's held from submission to until kiocb_end_write() is
> >> called at completion time.
> >
> > I was not aware of this, the only such hack I knew about was the
> > filesystem freezer thing.
> >
> > The problem with holding locks past the end of a syscall is that you'll
> > nest whatever random lock hierarchies possibly by every other syscall
> > under that lock.
>
> Can you expand on that bit, not quite sure I follow. Do we reset the
> locking dependencies between syscalls?

What I'm saying is that if syscall-a (see below) returns with lock-A
held, and userspace goes on and does syscall-b syscall-c ... syscall-z
before the lock gets released. Then all the lock chains of
syscall-[b..z] will be under lock-a.

This very easily leads to inversions, and is thus a strong reason to not
allow syscalls to 'leak' locks.

> >> This is a bit of an ugly work-around, and defeats the purpose of
> >> lockdep.
> >>
> >> Since I've now got another case where I want to hold a resource across
> >> syscalls, is there a better way to do this?
> >>
> >> This is for inode_dio_start(), which increments an inode int count, and
> >> inode_dio_end() which decrements it. If a task is doing
> >> inode_dio_start() and then inode_dio_wait(), I want to trigger this. I
> >> have a hack that does this, but it disables lockdep_sys_exit() as
> >> otherwise I just get that warning rather than the more useful one.
> >
> > Suppose syscall-a returns with your kiocb thing held, call it lock A
> > Suppose syscall-b returns with your inode thing held, call it lock B
> >
> > Then userspace does:
> >
> > syscall-a
> > syscall-b
> >
> > while it also does:
> >
> > syscall-b
> > syscall-a
> >
> > and we're up a creek, no?
>
> Should this not get caught by the usual lock ordering rules? Because

Well, yes, lockdep will yell, and a machine without lockdep will
deadlock.

This is bad syscall design if you can deadlock like this. We must assume
userspace is out to get us, if possible it will do this.

> that is obviously a bug, ordering would have to be consistent, just like
> if we have:
>
> syscall-a
> lock(a);
> lock(b);
>
> syscall-b
> lock(b);
> lock(a)

The difference is that in this case the full lock order is determined by
kernel code (under our full control), while in the earlier example, the
lock order is determined by syscall order -- out of our control.