Re: [PATCH v2 1/1] powerpc/iommu: Enable remaining IOMMU Pagesizes present in LoPAR

From: Alexey Kardashevskiy
Date: Fri Apr 09 2021 - 00:36:27 EST




On 08/04/2021 19:04, Michael Ellerman wrote:
Alexey Kardashevskiy <aik@xxxxxxxxx> writes:
On 08/04/2021 15:37, Michael Ellerman wrote:
Leonardo Bras <leobras.c@xxxxxxxxx> writes:
According to LoPAR, ibm,query-pe-dma-window output named "IO Page Sizes"
will let the OS know all possible pagesizes that can be used for creating a
new DDW.

Currently Linux will only try using 3 of the 8 available options:
4K, 64K and 16M. According to LoPAR, Hypervisor may also offer 32M, 64M,
128M, 256M and 16G.

Do we know of any hardware & hypervisor combination that will actually
give us bigger pages?


On P8 16MB host pages and 16MB hardware iommu pages worked.

On P9, VM's 16MB IOMMU pages worked on top of 2MB host pages + 2MB
hardware IOMMU pages.

The current code already tries 16MB though.

I'm wondering if we're going to ask for larger sizes that have never
been tested and possibly expose bugs. But it sounds like this is mainly
targeted at future platforms.


I tried for fun to pass through a PCI device to a guest with this patch as:

pbuild/qemu-killslof-aiku1904le-ppc64/qemu-system-ppc64 \
-nodefaults \
-chardev stdio,id=STDIO0,signal=off,mux=on \
-device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
-mon id=MON0,chardev=STDIO0,mode=readline \
-nographic \
-vga none \
-enable-kvm \
-m 16G \
-kernel ./vmldbg \
-initrd /home/aik/t/le.cpio \
-device vfio-pci,id=vfio0001_01_00_0,host=0001:01:00.0 \
-mem-prealloc \
-mem-path qemu_hp_1G_node0 \
-global spapr-pci-host-bridge.pgsz=0xffffff000 \
-machine cap-cfpc=broken,cap-ccf-assist=off \
-smp 1,threads=1 \
-L /home/aik/t/qemu-ppc64-bios/ \
-trace events=qemu_trace_events \
-d guest_errors,mmu \
-chardev socket,id=SOCKET0,server=on,wait=off,path=qemu.mon.1_1_0_0 \
-mon chardev=SOCKET0,mode=control


The guest created a huge window:

xhci_hcd 0000:00:00.0: ibm,create-pe-dma-window(2027) 0 8000000 20000000 22 22 returned 0 (liobn = 0x80000001 starting addr = 8000000 0)

The first "22" is page_shift in hex (16GB), the second "22" is window_shift (so we have 1 TCE).

On the host side the window#1 was created with 1GB pages:
pci 0001:01 : [PE# fd] Setting up window#1 800000000000000..80007ffffffffff pg=40000000


The XHCI seems working. Without the patch 16MB was the maximum.



diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 9fc5217f0c8e..6cda1c92597d 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -53,6 +53,20 @@ enum {
DDW_EXT_QUERY_OUT_SIZE = 2
};

A comment saying where the values come from would be good.

+#define QUERY_DDW_PGSIZE_4K 0x01
+#define QUERY_DDW_PGSIZE_64K 0x02
+#define QUERY_DDW_PGSIZE_16M 0x04
+#define QUERY_DDW_PGSIZE_32M 0x08
+#define QUERY_DDW_PGSIZE_64M 0x10
+#define QUERY_DDW_PGSIZE_128M 0x20
+#define QUERY_DDW_PGSIZE_256M 0x40
+#define QUERY_DDW_PGSIZE_16G 0x80

I'm not sure the #defines really gain us much vs just putting the
literal values in the array below?

Then someone says "uuuuu magic values" :) I do not mind either way. Thanks,

Yeah that's true. But #defining them doesn't make them less magic, if
you only use them in one place :)

Defining them with "QUERY_DDW" in the names kinda tells where they are from. Can also grep QEMU using these to see how the other side handles it. Dunno.

btw the bot complained about __builtin_ctz(SZ_16G) which should be __builtin_ctzl(SZ_16G) so we have to ask Leonardo to repost anyway :)



--
Alexey