Re: Race-free block device opening

From: Demi Marie Obenour
Date: Sat May 07 2022 - 07:40:46 EST


On Wed, Apr 27, 2022 at 09:29:12AM -0400, James Bottomley wrote:
> On Tue, 2022-04-26 at 14:12 -0400, Demi Marie Obenour wrote:
> > Right now, opening block devices in a race-free way is incredibly
> > hard.
>
> Could you be more specific about what the race you're having problems
> with is? What is racing.

If I open /dev/mapper/qubes_dom0-vm--sys--net--private, it is possible
that something has destroyed the corresponding device and created a new
one with the same kernel name, *before* udev has managed to unlink the
device node. As a result, I wind up opening the wrong device.

> > The only reasonable approach I know of is sd_device_new_from_path() +
> > sd_device_open(), and is only available in systemd git main. It also
> > requires waiting on systemd-udev to have processed udev rules, which
> > can be a bottleneck.
>
> This doesn't actually seem to be in my copy of systemd.

That’s because it is not in any release yet.

> > There are better approaches in various special cases, such as using
> > device-mapper ioctls to check that the device one has opened still
> > has the name and/or UUID one expects. However, none of them works
> > for a plain call to open(2).
>
> Just so we're clear: if you call open on, say /dev/sdb1 and something
> happens to hot unplug and then replug a different device under that
> node, the file descriptor you got at open does *not* point to the new
> node. It points to a dead device responder that errors everything.
>
> The point being once you open() something, the file descriptor is
> guaranteed to point to the same device (or error).

That doesn’t help if the unplug and replug happens between passing the
path and udev having purged the now-stale symlink.

> > A much better approach would be for udev to point its symlinks at
> > "/dev/disk/by-diskseq/$DISKSEQ" for non-partition disk devices, or at
> > "/dev/disk/by-diskseq/${DISKSEQ}p${PARTITION}" for partitions. A
> > filesystem would then be mounted at "/dev/disk/by-diskseq" that
> > provides for race-free opening of these paths. This could be
> > implemented in userspace using FUSE, either with difficulty using the
> > current kernel API, or easily and efficiently using a new kernel API
> > for opening a block device by diskseq + partition. However, I think
> > this should be handled by the Linux kernel itself.
> >
> > What would be necessary to get this into the kernel? I would like to
> > implement this, but I don’t have the time to do so anytime soon. Is
> > anyone else interested in taking this on? I suspect the kernel code
> > needed to implement this would be quite a bit smaller than the FUSE
> > implementation.
>
> So it sounds like the problem is you want to be sure that the device
> doesn't change after you've called libblkid to identify it but before
> you call open? If that's so, the way you do this in userspace is to
> call libblkid again after the open. If the before and after id match,
> you're as sure as you can be the open was of the right device.

The devices I am working with are raw-format VM disks that contain
untrusted data. They are identified not by their content, which the VM
has complete control over, but by various sysfs attributes such as
dm/name and dm/uuid. And they need to be passed to interfaces, such as
libvirt and cryptsetup, that only accept device paths.

I can work around this in the case of cryptsetup by using the
libcryptsetup library and/or holding a file descriptor open, but neither
of those will work for libvirt since libvirtd is a separate process and
I cannot pass a file descriptor to it. Furthermore, there is no way to
make libvirtd do any post-open() checking on the file descriptor it has
obtained. While I plan to add a workaround in libxl and blkback for
loop and device-mapper devices, it is not reasonable to expect every
userspace tool to do the same.

The approach I am suggesting avoids this problem entirely, because
/dev/mapper/qubes_dom0-vm--sys--net--private is now a symlink to a
device node under /dev/disk/by-diskseq/$DISKSEQ. Those are never, ever
reused. When the device goes away, the device node goes away too, and
so any attempt to open the symlink (without O_PATH|O_NOFOLLOW) gets
-ENOENT as it should.
--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

Attachment: signature.asc
Description: PGP signature