Re: [PATCH] scatterlist: Validate page before calling PageSlab()

From: Alan Mikhak
Date: Mon Oct 07 2019 - 17:14:07 EST


On Mon, Oct 7, 2019 at 9:44 AM Alan Mikhak <alan.mikhak@xxxxxxxxxx> wrote:
>
> On Sun, Oct 6, 2019 at 11:13 PM Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:
> >
> > On Mon, Sep 30, 2019 at 04:22:35PM -0700, Alan Mikhak wrote:
> > > From: Alan Mikhak <alan.mikhak@xxxxxxxxxx>
> > >
> > > Modify sg_miter_stop() to validate the page pointer
> > > before calling PageSlab(). This check prevents a crash
> > > that will occur if PageSlab() gets called with a page
> > > pointer that is not backed by page struct.
> > >
> > > A virtual address obtained from ioremap() for a physical
> > > address in PCI address space can be assigned to a
> > > scatterlist segment using the public scatterlist API
> > > as in the following example:
> >
> > As Jason pointed out that is not a valid use of scatterlist. What
> > are you trying to do here at a higher level?
>
> I am developing a PCI endpoint framework function driver to bring-up
> an NVMe device over PCIe. The NVMe endpoint function driver connects
> to an x86_64 or other root-complex host over PCIe. Internally, the
> NVMe endpoint function driver connects to the unmodified Linux NVMe
> target driver running on the embedded CPU. The Linux NVMe target
> operates an NVMe namespace as determined for the application.
> Currently, the Linux NVMe target code operates a file-based namespace
> which is backed by the loop device. However, the application can be
> expanded to operate on non-volatile storage such as flash or
> battery-backed RAM. Currently, I am able to mount such an NVMe
> namespace from the x86_64 Debian Linux host across PCIe using the
> Disks App and perform Partition Benchmarking. I am also able to save
> and load files, such as trace files for debugging the NVMe endpoint
> with KernelShark, on the NVMe namespace partition nvme0n1p1.
>
> My goal is to not modify the Linux NVMe target code at all. The NVMe
> endpoint function driver currently does the work that is required. It
> maps NVMe PRPs and PRP Lists from the host, formats a scatterlist that
> NVMe target driver can consume, and executes the NVMe command with the
> scatterlist on the NVMe target controller on behalf of the host. The
> NVMe target controller can therefore read and write directly to host
> buffers using the scatterlist as it does if the scatterlist had
> arrived over the NVMe fabric.
>
> In my current platform, there are no page struct backing for the PCIe
> memory address space. Nevertheless, I am able to feed the virtual
> addresses I obtain from ioremap() to the scatterlist as shown in my
> example earlier. The scatterlist code has no problem traversing the
> scatterlist that is formed from such addresses that were obtained from
> ioremap(). The only place the scatterlist code prevents such usage is
> in sg_miter_stop() when it calls PageSlab() to decide if it should
> flush the page. I added a check to see if the page is valid and not
> call PageSlab() if it is not a page struct backed page. That is all I
> had to do to be able to pass scatterlists to the NVMe target.
>
> Given that the PCIe memory address space is large, the cost of adding
> page structs for that region is substantial enough for me to ask that
> it be considered here to modify scatterlist code to support such
> memory pointers that were obtained from ioremap(). If not acceptable,
> the solution would be to pay to price and add page structs for the
> PCIe memory address space.
>

Please consider the following information and cost estimate in
bytes for requiring page structs for PCI memory if used with
scatterlists. For example, a 128GB PCI memory address space
could require as much as 256MB of system memory just for
page struct backing. In a 1GB 64-bit system with flat memory
model, that consumes 25% of available memory. However,
not all of the 128GB PCI memory may be mapped for use at
a given time depending on the application. The cost of PCI
page structs is an upfront cost to be paid at system start.

pci memory start: 0x2000000000
pci memory size: 128GB 0x2000000000

pci page_size: 64KB 0x10000
pci page_shift: 16 0x10
pci pages: 2MB 0x200000
pci epc bitmap_size: 256KB 0x40000

pci page_size: 4KB 0x1000
pci page_shift: 12 0xc
pci pages: 32MB 0x2000000
pci epc bitmap_size: 4MB 0x400000

system page size: 4KB 0x1000
linux page struct size: 8B
pci page struct cost: 256MB 0x10000000

Regards,
Alan