Re: Phyr Starter

From: John Hubbard
Date: Tue Jan 11 2022 - 03:17:27 EST


On 1/10/22 11:34, Matthew Wilcox wrote:
TLDR: I want to introduce a new data type:

struct phyr {
phys_addr_t addr;
size_t len;
};

and use it to replace bio_vec as well as using it to replace the array
of struct pages used by get_user_pages() and friends.

---

This would certainly solve quite a few problems at once. Very compelling.

Zooming in on the pinning aspect for a moment: last time I attempted to
convert O_DIRECT callers from gup to pup, I recall wanting very much to
record, in each bio_vec, whether these pages were acquired via FOLL_PIN,
or some non-FOLL_PIN method. Because at the end of the IO, it is not
easy to disentangle which pages require put_page() and which require
unpin_user_page*().

And changing the bio_vec for *that* purpose was not really acceptable.

But now that you're looking to change it in a big way (and with some
spare bits avaiable...oohh!), maybe I can go that direction after all.

Or, are you looking at a design in which any phyr is implicitly FOLL_PIN'd
if it exists at all?

Or any other thoughts in this area are very welcome.


There are two distinct problems I want to address: doing I/O to memory
which does not have a struct page and efficiently doing I/O to large
blobs of physically contiguous memory, regardless of whether it has a
struct page. There are some other improvements which I regard as minor.

There are many types of memory that one might want to do I/O to that do
not have a struct page, some examples:
- Memory on a graphics card (or other PCI card, but gfx seems to be
the primary provider of DRAM on the PCI bus today)
- DAX, or other pmem (there are some fake pages today, but this is
mostly a workaround for the IO problem today)
- Guest memory being accessed from the hypervisor (KVM needs to
create structpages to make this happen. Xen doesn't ...)
All of these kinds of memories can be addressed by the CPU and so also
by a bus master. That is, there is a physical address that the CPU
can use which will address this memory, and there is a way to convert
that to a DMA address which can be programmed into another device.
There's no intent here to support memory which can be accessed by a
complex scheme like writing an address to a control register and then
accessing the memory through a FIFO; this is for memory which can be
accessed by DMA and CPU loads and stores.

For get_user_pages() and friends, we currently fill an array of struct
pages, each one representing PAGE_SIZE bytes. For an application that
is using 1GB hugepages, writing 2^18 entries is a significant overhead.
It also makes drivers hard to write as they have to recoalesce the
struct pages, even though the VM can tell it whether those 2^18 pages
are contiguous.

On the minor side, struct phyr can represent any mappable chunk of memory.
A bio_vec is limited to 2^32 bytes, while on 64-bit machines a phyr
can represent larger than 4GB. A phyr is the same size as a bio_vec
on 64 bit (16 bytes), and the same size for 32-bit with PAE (12 bytes).
It is smaller for 32-bit machines without PAE (8 bytes instead of 12).

Finally, it may be possible to stop using scatterlist to describe the
input to the DMA-mapping operation. We may be able to get struct
scatterlist down to just dma_address and dma_length, with chaining
handled through an enclosing struct.

I would like to see phyr replace bio_vec everywhere it's currently used.
I don't have time to do that work now because I'm busy with folios.
If someone else wants to take that on, I shall cheer from the sidelines.

I'm starting to wonder if I should jump in here, in order to get this
as a way to make the O_DIRECT conversion much cleaner. But let's see.

What I do intend to do is:

- Add an interface to gup.c to pin/unpin N phyrs
- Add a sg_map_phyrs()
This will take an array of phyrs and allocate an sg for them
- Whatever else I need to do to make one RDMA driver happy with
this scheme

At that point, I intend to stop and let others more familiar with this
area of the kernel continue the conversion of drivers.

P.S. If you've had the Prodigy song running through your head the whole
time you've been reading this email ... I'm sorry / You're welcome.
If people insist, we can rename this to phys_range or something boring,
but I quite like the spelling of phyr with the pronunciation of "fire".

A more conservative or traditional name might look like:

phys_vec (maintains some resemblance to what it's replacing)
phys_range
phys_addr_range

phyr is rather cool, but it also is awfully too close to "phys" for
reading comfort. And there is a lot to be said for self-descriptive names,
which phyr is not.

And of course, you're signing for another huge naming debate with Linus
if you go with the "cool" name here. :)


thanks,
--
John Hubbard
NVIDIA