Re: Could it be made possible to offer "supplementary" data to a DIO write ?

From: David Howells
Date: Thu Aug 05 2021 - 10:38:14 EST


Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:

> You can already get 400Gbit ethernet.

Sorry, but that's not likely to become relevant any time soon. Besides, my
laptop's wifi doesn't really do that yet.

> Saving 500 bytes by sending just the 12 bytes that changed is optimising the
> wrong thing.

In one sense, at least, you're correct. The cost of setting up an RPC to do
the write and setting up crypto is high compared to transmitting 3 bytes vs 4k
bytes.

> If you have two clients accessing the same file at byte granularity, you've
> already lost.

Doesn't stop people doing it, though. People have sqlite, dbm, mail stores,
whatever in the homedirs from the desktop environments. Granted, most of the
time people don't log in twice with the same homedir from two different
machines (and it doesn't - or didn't - used to work with Gnome or KDE).

> Extent based filesystems create huge extents anyway:

Okay, so it's not feasible. That's fine.

> This has already happened when you initially wrote to the file backing
> the cache. Updates are just going to write to the already-allocated
> blocks, unless you've done something utterly inappropriate to the
> situation like reflinked the files.

Or the file is being read random-access and we now have a block we didn't have
before that is contiguous to another block we already have.

> If you want to take leases at byte granularity, and then not writeback
> parts of a page that are outside that lease, feel free. It shouldn't
> affect how you track dirtiness or how you writethrough the page cache
> to the disk cache.

Indeed. Handling writes to the local disk cache is different from handling
writes to the server(s). The cache has a larger block size but I don't have
to worry about third-party conflicts on it, whereas the server can be taken as
having no minimum block size, but my write can clash with someone else's.

Generally, I prefer to write back the minimum I can get away with (as does the
Linux NFS client AFAICT).

However, if everyone agrees that we should only ever write back a multiple of
a certain block size, even to network filesystems, what block size should that
be? Note that PAGE_SIZE varies across arches and folios are going to
exacerbate this. What I don't want to happen is that you read from a file, it
creates, say, a 4M (or larger) folio; you change three bytes and then you're
forced to write back the entire 4M folio.

Note that when content crypto or compression is employed, some multiple of the
size of the encrypted/compressed blocks would be a requirement.

David