Re: Question about: AMD-Vi: Event logged [IO_PAGE_FAULT ...

From: Joerg Roedel
Date: Mon Jan 05 2015 - 11:49:23 EST


Hello Raimonds,

On Mon, Jan 05, 2015 at 05:25:25PM +0200, Raimonds Cicans wrote:
> After kernel upgrade (3.13 => 3.17) I started to receive following
> string in my logs:
> AMD-Vi: Event logged [IO_PAGE_FAULT device=08:00.0 domain=0x001c
> address=0x0000000001355000 flags=0x0000]
>
> I would like to deeper understand this problem, so it
> would be nice if some body can fix my assumptions and
> answer my questions.
>
>
> Assumptions:
>
> 1) This message is generated by AMD IOMMU subsystem
> because PCIe device 08:00.0 tried to access memory
> region which was not mapped to any real memory
> (lspci show that this device is DVB-S2 receiver card
> TBS 6981)
>
> 2) Because flags are 0 and because in general receivers
> write to memory not read from memory it is memory
> write operation

Almost right, but flags are 0 for this fault which means it was a read
operation. The operation was to a page marked as non-present. This
caused the fault.

> 3) Possible causes:
> a) memory region was never mapped
> b) device accessed memory region before it was mapped
> c) device accessed memory region after it was unmapped

I'd vote for option c) The address reported in the fault is a device
virtual address. The value looks like it was handed out from the
DMA-address allocator in the AMD IOMMU driver, which means the address
was once mapped for the device.

>
> 3) Suspects:
> a) kernel's DMA subsystem: very unlikely
> b) kernel's IOMMU subsystem: very unlikely
> c) AMD IOMMU driver: unlikely? - i had problems with AMD IOMMU
> itself in kernels 3.14 - 3.17 (AMD-Vi: Completion-Wait loop
> timed out)
> So maybe this problem not fully fixed?

IO_PAGE_FAULTs are almost always a bug in the device driver for the
peripheral (or a bug in the firmware, but that is unlikely here).

But the "Completion-Wait loop timed out" message is also worrying. It
usually indicates broken firmware or broken hardware.

> d) Receiver's driver: likely

Yes, my guess is that the driver for the receiver device calls
dma_unmap_$foo on a memory region it still uses for DMA. But the call
lets the AMD IOMMU driver unmap the region and DMA fails with the
message you see.

> Questions:
> 1) What 'domain=0x001c' mean?

This is just an internal handle and means the domain-id. It is reported
in the fault structure by the hardware and indicates whether the device
has been attached to a DMA domain at all.

> 2) Where I can find definition of possible flags?

In the AMD IOMMU specification, look for the IO_PAGE_FAULT reporting
structure. The flags reported in the kernel message are bits 16-27 of
the second 32bit value.

> 3) What kind of address is written in message?
> - physical?
> - virtual?
> - address from devices point of view?

It is a device virtual address, the address the device tried to access
but which was not mapped.


HTH,

Joerg

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/