Re: O_DIRECT question

From: Bodo Eggert
Date: Sun Jan 14 2007 - 13:56:59 EST


On Sat, 13 Jan 2007, Bill Davidsen wrote:

> Bodo Eggert wrote:
>
> > (*) This would allow fadvise_size(), too, which could reduce fragmentation
> > (and give an early warning on full disks) without forcing e.g. fat to
> > zero all blocks. OTOH, fadvise_size() would allow users to reserve the
> > complete disk space without his filesizes reflecting this.
>
> Please clarify how this would interact with quota, and why it wouldn't
> allow someone to run me out of disk.

I fell into the "write-will-never-fail"-pit. Therefore I have to talk
about the original purpose, write with O_DIRECT, too.

- Reserved blocks should be taken out of the quota, since they are about
to be written right now. If you emptied your quota doing this, it's
your fault. It it was the group's quota, just run fast enough.-)

- If one write failed that extended the reserved range, the reserved area
should be shrunk again. Obviously you'll need something clever here.
* You can't shrink carelessly while there are O_DIRECT writes.
* You can't just try to grab the semaphore[0] for writing, this will
deadlock with other write()s.
* If you drop the read lock, it will work out, because you aren't
writing anymore, and if you get the write lock, there won't be anybody
else writing. Therefore you can clear the reservation for the not-
written blocks. You may unreserve blocks that should stay reserved,
but that won't harm much. At worst, you'll get fragmentation, loss
of speed and an aborted (because of no free space) write command.
Document this, it's a feature.-)

- If you fadvise_size on a non-quota-disk, you can possibly reserve it
completely, without being the easy-to-spot offender. You can do the
same by actually writing these files, keeping them open and unlinking
them. The new quality is: You can't just look at the file sizes in
/proc in order to spot the offender. However, if you reflect the
reserved blocks in the used-blocks-field of struct stat, du will
work as expected and the BOFH will know whom to LART.

BTW: If the fs supports holes, using du would be the right thing
to do anyway.


BTW2: I don't know if reserving without actually assigning blocks is
supported or easy to support at all. These reservations are the result of
"These blocks are not yet written, therefore they contain possibly secret
data that would leak on failed writes, therefore they may not be actually
assigned to the file before write finishes. They may not be on the free
list either. And hey, if we support pre-reserving blocks to the file, we
may additionally use it for fadvise_size. I'll mention that briefly."




[0] r/w semaphore, read={r,w}_odirect, write=ftruncate

--
Fun things to slip into your budget
Paradigm pro-activator (a whole pack)
(you mean beer?)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/