Re: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported

From: Alex Williamson
Date: Fri May 13 2016 - 01:32:57 EST


On Fri, 13 May 2016 02:33:18 +0000
"Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote:

> > From: Alex Williamson [mailto:alex.williamson@xxxxxxxxxx]
> > Sent: Friday, May 13, 2016 1:48 AM
> >
> > On Thu, 12 May 2016 04:53:19 +0000
> > "Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote:
> >
> > > > From: Alex Williamson [mailto:alex.williamson@xxxxxxxxxx]
> > > > Sent: Thursday, May 12, 2016 10:21 AM
> > > >
> > > > On Thu, 12 May 2016 01:19:44 +0000
> > > > "Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote:
> > > >
> > > > > > From: Alex Williamson [mailto:alex.williamson@xxxxxxxxxx]
> > > > > > Sent: Wednesday, May 11, 2016 11:54 PM
> > > > > >
> > > > > > On Wed, 11 May 2016 06:29:06 +0000
> > > > > > "Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote:
> > > > > >
> > > > > > > > From: Alex Williamson [mailto:alex.williamson@xxxxxxxxxx]
> > > > > > > > Sent: Thursday, May 05, 2016 11:05 PM
> > > > > > > >
> > > > > > > > On Thu, 5 May 2016 12:15:46 +0000
> > > > > > > > "Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote:
> > > > > > > >
> > > > > > > > > > From: Yongji Xie [mailto:xyjxie@xxxxxxxxxxxxxxxxxx]
> > > > > > > > > > Sent: Thursday, May 05, 2016 7:43 PM
> > > > > > > > > >
> > > > > > > > > > Hi David and Kevin,
> > > > > > > > > >
> > > > > > > > > > On 2016/5/5 17:54, David Laight wrote:
> > > > > > > > > >
> > > > > > > > > > > From: Tian, Kevin
> > > > > > > > > > >> Sent: 05 May 2016 10:37
> > > > > > > > > > > ...
> > > > > > > > > > >>> Acutually, we are not aimed at accessing MSI-X table from
> > > > > > > > > > >>> guest. So I think it's safe to passthrough MSI-X table if we
> > > > > > > > > > >>> can make sure guest kernel would not touch MSI-X table in
> > > > > > > > > > >>> normal code path such as para-virtualized guest kernel on PPC64.
> > > > > > > > > > >>>
> > > > > > > > > > >> Then how do you prevent malicious guest kernel accessing it?
> > > > > > > > > > > Or a malicious guest driver for an ethernet card setting up
> > > > > > > > > > > the receive buffer ring to contain a single word entry that
> > > > > > > > > > > contains the address associated with an MSI-X interrupt and
> > > > > > > > > > > then using a loopback mode to cause a specific packet be
> > > > > > > > > > > received that writes the required word through that address.
> > > > > > > > > > >
> > > > > > > > > > > Remember the PCIe cycle for an interrupt is a normal memory write
> > > > > > > > > > > cycle.
> > > > > > > > > > >
> > > > > > > > > > > David
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > If we have enough permission to load a malicious driver or
> > > > > > > > > > kernel, we can easily break the guest without exposed
> > > > > > > > > > MSI-X table.
> > > > > > > > > >
> > > > > > > > > > I think it should be safe to expose MSI-X table if we can
> > > > > > > > > > make sure that malicious guest driver/kernel can't use
> > > > > > > > > > the MSI-X table to break other guest or host. The
> > > > > > > > > > capability of IRQ remapping could provide this
> > > > > > > > > > kind of protection.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > With IRQ remapping it doesn't mean you can pass through MSI-X
> > > > > > > > > structure to guest. I know actual IRQ remapping might be platform
> > > > > > > > > specific, but at least for Intel VT-d specification, MSI-X entry must
> > > > > > > > > be configured with a remappable format by host kernel which
> > > > > > > > > contains an index into IRQ remapping table. The index will find a
> > > > > > > > > IRQ remapping entry which controls interrupt routing for a specific
> > > > > > > > > device. If you allow a malicious program random index into MSI-X
> > > > > > > > > entry of assigned device, the hole is obvious...
> > > > > > > > >
> > > > > > > > > Above might make sense only for a IRQ remapping implementation
> > > > > > > > > which doesn't rely on extended MSI-X format (e.g. simply based on
> > > > > > > > > BDF). If that's the case for PPC, then you should build MSI-X
> > > > > > > > > passthrough based on this fact instead of general IRQ remapping
> > > > > > > > > enabled or not.
> > > > > > > >
> > > > > > > > I don't think anyone is expecting that we can expose the MSI-X vector
> > > > > > > > table to the guest and the guest can make direct use of it. The end
> > > > > > > > goal here is that the guest on a power system is already
> > > > > > > > paravirtualized to not program the device MSI-X by directly writing to
> > > > > > > > the MSI-X vector table. They have hypercalls for this since they
> > > > > > > > always run virtualized. Therefore a) they never intend to touch the
> > > > > > > > MSI-X vector table and b) they have sufficient isolation that a guest
> > > > > > > > can only hurt itself by doing so.
> > > > > > > >
> > > > > > > > On x86 we don't have a), our method of programming the MSI-X vector
> > > > > > > > table is to directly write to it. Therefore we will always require QEMU
> > > > > > > > to place a MemoryRegion over the vector table to intercept those
> > > > > > > > accesses. However with interrupt remapping, we do have b) on x86, which
> > > > > > > > means that we don't need to be so strict in disallowing user accesses
> > > > > > > > to the MSI-X vector table. It's not useful for configuring MSI-X on
> > > > > > > > the device, but the user should only be able to hurt themselves by
> > > > > > > > writing it directly. x86 doesn't really get anything out of this
> > > > > > > > change, but it helps this special case on power pretty significantly
> > > > > > > > aiui. Thanks,
> > > > > > > >
> > > > > > >
> > > > > > > Allowing guest direct write to MSI-x table has system-wide impact.
> > > > > > > As I explained earlier, hypervisor needs to control "interrupt_index"
> > > > > > > programmed in MSI-X entry, which is used to associate a specific
> > > > > > > IRQ remapping entry. Now if you expose whole MSI-x table to guest,
> > > > > > > it can program random index into MSI-X entry to associate with
> > > > > > > any IRQ remapping entry and then there won't be any isolation per se.
> > > > > > >
> > > > > > > You can check "5.5.2 MSI and MSI-X Register Programming" in VT-d
> > > > > > > spec.
> > > > > >
> > > > > > I think you're extrapolating beyond the vfio interface. The change
> > > > > > here is to remove the vfio protection of the MSI-X vector table when
> > > > > > the system provides interrupt isolation. The argument is that this is
> > > > > > safe to do because the hardware protects the host from erroneous and
> > > > > > malicious user programming, but it is not meant to provide a means to
> > > > > > program MSI-X directly through the vector table. This is effectively
> > > > >
> > > > > Sorry I didn't get this point. Once we allow userspace to mmap MSI-X
> > > > > table, isn't it equivalent to allowing userspace directly program vector
> > > > > table? Or is there other mechanism to prevent direct programming?
> > > >
> > > > This would remove the mechanism that prevents direct programming.
> > > > Users can configure the device to perform DMA w/o setting up an IOMMU
> > > > mapping, which generates a fault. Users can incorrectly manipulate the
> > > > MSI-X vector table instead of using the proper vfio programming
> > > > interface, which generates a fault.
> > > >
> > > > > > the same as general DMA programming, if the vfio programming model is
> > > > > > not followed the device generates iommu faults. I do have a concern
> > > > > > that userspace driver writers are going to more often presume they can
> > > > > > use the vector table directly because of this change, but I don't know
> > > > > > that that is sufficient reason to prevent such a change. They'll
> > > > > > quickly discover the device generates faults on interrupt rather than
> > > > > > working as expected.
> > > > >
> > > > > If userspace can actually program vector table directly, there is not
> > > > > always fault triggered. As long as MSI-X table is fully under control
> > > > > of userspace, any interrupt index can be used here which may link
> > > > > to a IRQ remapping entry allocated for other devices.
> > > >
> > > > Is not part of the vector lookup comparing the source ID of the DMA
> > > > write? If not then VT-d interrupt remapping offers us no protection
> > > > from a malicious guest performing DMA to the same address. Thanks,
> > > >
> > >
> > > For x86 IRQ remapping doesn't rely on existing DMAR remapping table for
> > > DMA. Instead IRQ remapping table is a global table per VT-d engine, shared
> > > by all devices behind this VT-d engine. Each IRQ remapping table entry
> > > (IRTE) can specify:
> > >
> > > - whether to validate source-id of the interrupt requests;
> > > - whether to verify the whole source id;
> > > - whether to verify some bits of the source id;
> > > ...
> > > (section 9.10 Interrupt Remapping Table Entry (IRTE) in VT-d spec)
> > >
> > > IRTE index is programmed into corresponding MSI-X entry of a device.
> > > When an interrupt message is triggered from that device, IRQ remapping
> > > hardware will use the index to find the IRTE entry.
> > >
> > > As long as host kernel has strict control of MSI-X table, it can choose to
> > > not validate source-id as long as IRTE index is trusted.
> > >
> > > That is why I concerned you cannot do this simply based on whether
> > > IRQ remapping is available on x86. If such usage is really required,
> > > you'll need some more accurate indicator per iommu structure to tell
> > > whether safe to allow mmap MSI-X to user space.
> >
> > As argued previously in this thread, there's nothing special about a
> > DMA write to memory versus a DMA write to a special address that
> > triggers an MSI vector. If the device is DMA capable, which we assume
> > all are, it can be fooled into generating those DMA writes regardless
> > of whether we actively block access to the MSI-X vector table itself.
>
> But with DMA remapping above can be blocked.

How? VT-d explicitly ignores DMA writes to 0xFEEx_xxxx, section 3.13:

Write requests without PASID of DWORD length are treated as interrupt
requests. Interrupt requests are not subjected to DMA remapping[...]
Instead, remapping hardware can be enabled to subject such interrupt
requests to interrupt remapping.

> > With respect to the IRTE, these entries are always under host control,
> > and aside from the potential opt-in gap when using the intremap=nosid
> > boot option, they always make use of source-id validation as Linux is
> > written today. Compatibility mode support is also disabled. Nothing
> > here changes that. In fact, I suspect the only advantage to blocking
>
> VFIO as a standalone module shouldn't make assumption on how
> IRQ remapping is actually implemented in kernel (especially when
> VT-d spec allows no source ID verification).

vfio is NOT a stand alone module, it is part of the kernel that happens
to be configurable as a module. We have the ability to change with the
kernel and influence how the kernel behaves.

> > MSI-X vector table access w/o interrupt remapping is to avoid obvious
> > collisions if it were to be programmed directly, it doesn't actually
> > prevent an identical DMA transaction from being generated by other
>
> Kernel can enable DMA remapping but disable IRQ remapping. In such
> case identical DMA transaction can be prevented.

Not according to the VT-d spec as quoted above. If so, how?

> > means. The MSI-X vector table of a device is always considered
> > untrusted which is why we require user opt-ins to subvert that
> > protection. Thanks,
> >
>
> I only partially agree with this statement since there is different
> trust level for kernel space driver and user space driver.
>
> OS may choose to trust kernel space driver then it can enable IRQ
> remapping only for load balance purpose w/o source id verfification.
> In such case MSI-X vector table is trusted and fully under kernel
> control. Allowing to mmap MSI-X table in user space definitely
> breaks that boundary.

The OS may choose to do this, but we're part of the OS and it doesn't
do this and has no reason to do this.

> Anyway my point is simple. Let's ignore how Linux kernel implements
> IRQ remapping on x86 (which may change time to time), and just
> focus on architectural possibility. Non-x86 platform may implement
> IRQ remapping completely separate from device side, then checking
> availability of IRQ remapping is enough to decide whether mmap
> MSI-X table is safe. x86 with VT-d can be configured to a mode
> requiring host control of both MSI-X entry and IRQ remapping hardware
> (without source id check). In such case it's insufficient to make
> decision simply based on IRQ remapping availability. We need a way
> to query from IRQ remapping module whether it's actually safe to
> mmap MSI-X.

We're going in circles here. This patch is attempting to remove
protection from the MSI-X vector table that is really nothing more than
security theater already. That "protection" only actually prevents
casual misuse of the API which is really only a problem when the
platform offers no form of interrupt isolation, such as VT-d with DMA
remapping but not interrupt remapping. Disabling source-id checking in
VT-d should be handled as the equivalent of disabling interrupt
remapping altogether as far as the IOMMU API is concerned. That's
a trivial gap that should be fixed. There is no such thing as a secure
MSI-X vector table when untrusted userspace drivers are involved, we
must always assume that a device can generate DMA writes that are
indistinguishable from actual interrupt requests and if the platform
does not provide interrupt isolation we should require the user to
opt-in to an unsafe mode.

Simply denying direct writes to the vector table or preventing mapping
of the vector table into the user address space does not provide any
tangible form of protection. Many devices make use of window registers
that allow backdoors to arbitrary device registers. Some drivers even
use this as the primary means for configuring MSI-X, which makes them
incompatible with device assignment without device specific quirks to
enable virtualization of these paths.

If you have an objection to this patch, please show me how preventing
direct CPU access to the MSI-X vector table provides any kind of
security guarantee of the contents of the vector table and also prove
to me that a device cannot spoof a DMA write that is indistinguishable
from one associated with an actual interrupt, regardless of the
contents of the MSI-X vector table. Thanks,

Alex