Re: [PATCH] udevadm-info: Don't access sysfs 'resource<N>' files

From: Myron Stowe
Date: Tue Mar 19 2013 - 13:07:17 EST


On Tue, 2013-03-19 at 10:57 -0600, Myron Stowe wrote:
> On Mon, 2013-03-18 at 12:59 -0600, Alex Williamson wrote:
> > On Mon, 2013-03-18 at 19:25 +0100, BjÃrn Mork wrote:
> > > Alex Williamson <alex.williamson@xxxxxxxxxx> wrote:
> > >
> > > >On Mon, 2013-03-18 at 18:20 +0100, BjÃrn Mork wrote:
> > > >> Alex Williamson <alex.williamson@xxxxxxxxxx> writes:
> > > >>
> > > >> > At least for KVM the kernel fix is the addition of the vfio driver
> > > >which
> > > >> > gives us a non-sysfs way to do this. If this problem was found a
> > > >few
> > > >> > years later and we were ready to make the switch I'd support just
> > > >> > removing these resource files. In the meantime we have userspace
> > > >that
> > > >> > depends on this interface, so I'm open to suggestions how to fix
> > > >it.
> > > >>
> > > >> I am puzzled by a couple of things in this discussion:
> > > >>
> > > >> 1) do you seriously mean that a userspace application (any, not just
> > > >> udevadm or qemu or whatever) should be able to read and write
> > > >these
> > > >> registers while the device is owned by a driver? How is that ever
> > > >> going to work?
> > > >
> > > >The expectation is that the user doesn't mess with the device through
> > > >pci-sysfs while it's running. This is really no different than config
> > > >space or MMIO space in that respect.
> > >
> > > But it is. That's the problem. As a user I expect to be able to run
> > > e.g "grep . /sys/devices/whatever/*" with no ill effects. This holds
> > > for config space or MMIO space. It does not for any reset-on-read
> > > register.
> >
> > As a non-admin user you can
> >
> > > > You can use setpci to break your
> > > >PCI card while it's used by the driver today. The difference is that
> > > >MMIO spaces side-step the issue by only allowing mmap and config space
> > > >is known not to have read side-effects.
> > >
> > > Yes. And that is why there is no problem exporting those. This
> > > difference is fundamental.
> >
> > So how do we side-step the problem with I/O port registers? If we
> > remove them then KVM needs to run with iopl which is a pretty serious
> > security hole should QEMU be exploited. We could activate the resource
> > files only when the device is bound to pci-assign, but that only limits
> > the scope and might break UIO drivers. We could modify the file to have
> > an enable sequence, but we can't do this without breaking current
> > userspace. As I mentioned, the VFIO driver is intended to replace KVM's
> > use of these files, but we're not ready to rip it out, perhaps not even
> > ready to declare it deprecated.
> >
> > > >> 2) is it really so that a device can be so fundamentally screwed up
> > > >by
> > > >> reading some registers, that a later driver probe cannot properly
> > > >> reinitialize it?
> > > >
> > > >Never underestimate how broken hardware can be,
> > >
> > > True :)
> > >
> > > > though in this case
> > > >reading a device register seems to be causing a system hang/reset.
> > >
> > > I understand that it does so if the ahci driver is bound to the device
> > > while reading the registers, but does it also hang the system with no
> > > bound driver? How does it do that? By killing the bus?
> >
> > I don't know, Myron?
>
> Yes - the system hangs when BAR1's (and likely BAR3's) I/O port space is
> read.
Sorry - that wasn't very explicit. Just accessing BAR1's region as
udevadm does is enough to hang the system - even when no driver is
bound.
>
> Here are the details that I've been able to put together from the two
> linux-pci threads and various online sources -
>
>
> From Robert Hancock - "... BAR5 is the MMIO region used by the AHCI
> driver. BARs 0-4 are the legacy SFF-compatible ATA ports. Nothing
> should be messing with those IO ports while AHCI is enabled. ..." This
> likely explains why the system boots and runs fine as long as the
> 'udevadm ...' command is *not* ran (i.e. the driver never accesses the
> I/O port BARs).
>
> Using a SATA controller I have access to as an example for the details
> (Note: I do not have access to a system with the Marvell 9125 device):
> 00:1f.2 SATA controller: Intel Corporation 5 Series/3400 Series Chipset 6 port SATA AHCI Controller (rev 06) (prog-if 01 [AHCI 1.0])
> Subsystem: Lenovo Device 2168
> Region 0: I/O ports at 1860 [size=8]
> Region 1: I/O ports at 1814 [size=4]
> Region 2: I/O ports at 1818 [size=8]
> Region 3: I/O ports at 1810 [size=4]
> Region 4: I/O ports at 1840 [size=32]
> Region 5: Memory at f2827000 (32-bit, non-prefetchable) [size=2K]
>
> I/O port registers [1][2]:
> Primary IDE controller [0x1860-0x1867; 0x1814-0x1817]
> BAR0 Base address for the command block registers for ATA Channel X
> 0x1860 (Read/Write): Data Register
> 0x1861 (Read): Error Register
> 0x1861 (Write): Features Register
> 0x1862 (Read/Write): Sector Count Register
> 0x1863 (Read/Write): LBA Low Register
> 0x1864 (Read/Write): LBA Mid Register
> 0x1865 (Read/Write): LBA High Register
> 0x1866 (Read/Write): Drive/Head Register
> 0x1867 (Read): Status Register
> 0x1867 (Write): Command Register
> BAR1* Base address for the control register for ATA Channel X
> 0x1814 Reserved
> 0x1815 Reserved
> 0x1816 (Read): Alternate Status Register
> 0x1816 (Write): Device Control Register
> 0x1817 Reserved
>
> * The base must be Dword aligned; a PCI requirement. The Device Control
> and Alternate Status Registers are at ofset 0x2 from this base.
>
> [1] www.t13.org/documents/UploadedDocuments/project/d1510r1-Host-Adapter.pdf
> [2] lateblt.tripod.com/atapi.htm
>
> From Xiangliang - executing 'udevadm ...' causes a 32-bit I/O port read
> to BAR1's region. This is shown by the BE (Byte Enable) value of
> 0x1111. So apparently reads to this region that include any of reserved
> Bytes causes "the chip will go bad."
>
> So, only a Byte access at offset 2 is successful. I have not been able
> to get any more details as to the exact cause of the hang. I would have
> thought that the PCI transaction would have just timed out, or errored
> out, or something but apparently the platform ends up hanging.
>
> It appears that this device did not implement the reserved registers
> such that they would return 0 on reads or something more similarly sane.
>
> Since BARs 2 and 3 are not 0, indicating the device only supports one
> channel, I expect the same issue will occur when accessing BAR3. Again,
> I do not have access to a system with this device to test with.
>
> >
> > > >> I would have thought that the solution to all this was to return
> > > >-EINVAL
> > > >> on any attemt to read or write these files while a driver is bound to
> > > >> the device. If userspace is going to use the API, then the
> > > >application
> > > >> better unbind any driver first.
> > > >>
> > > >> Or? Am I missing something here?
> > > >
> > > >That doesn't really solve anything though. Let's pretend the resource
> > > >files only work while the device is bound to pci-stub. Now what
> > > >happens
> > > >when you run this udevadm command as admin while it's in use by the
> > > >userspace driver? All we've done is limit the scope of the problem.
> > >
> > > Assuming that the system hangs without driver help and that this
> > > brokenness is widespread. I don't think any of those assumptions hold.
> > > Do they?
> >
> > I thought it was true that for this device a system hang happened
> > regardless of the host driver, but haven't seen the original bug report.
> > As for widespread, this is the first I've heard of problems in the 2.5+
> > years that we've supported these I/O port resource files. The rest is
> > probably just FUD about random userspace apps trolling through device
> > registers.
> >
> > > >> > If we want to blacklist this specific device, that's fine, but as
> > > >others
> > > >> > have pointed out it's really a class problem. Perhaps we report 1
> > > >byte
> > > >> > extra for the file length where EOF-1 is an enable byte? Is there
> > > >> > anything else in file ops that we could use to make it slightly
> > > >more
> > > >> > complicated than open(), read() to access the device? Thanks,
> > > >>
> > > >> If there really are devices which cannot handle reading at all, and
> > > >> cannot be reset to a sane state by later driver initialization, then
> > > >a
> > > >> blacklist could be added for those devices. This should not be a
> > > >common
> > > >> problem.
> > > >
> > > >Yes, if these are dead registers, let's blacklist and move along. I
> > > >suspect though that these registers probably work fine if you access
> > > >them according to the device programming model, so blacklisting just
> > > >prevents full use through something like KVM device assignment.
> > >
> > > Well, if the device is that broken then I think it will require the
> > > kernel to police the device programming. I don't see how you can leave
> > > a bomb like that because it might be useful in a rare and very
> > > theoretical case.
> > >
> > > Easier to just blacklist it...
> >
> > Easier, yes. But it likely just kicks the problem down the road until
> > the next device. Thanks,
> >
> > Alex
> >
> >
>
>


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/