Re: [sqlite] light weight write barriers

From: Vladislav Bolkhovitin
Date: Fri Oct 26 2012 - 21:53:13 EST



Nico Williams, on 10/24/2012 05:17 PM wrote:
Yes, SCSI has full support for ordered/simple commands designed exactly for
that task: [...]

[...]

But historically for some reason Linux storage developers were stuck with
"barriers" concept, which is obviously not the same as ORDERED commands,
hence had a lot troubles with their ambiguous semantic. As far as I can tell
the reason of that was some lack of sufficiently deep SCSI understanding
(how to handle errors, believe that ACA is something legacy from parallel
SCSI times, etc.).

Barriers are a very simple abstraction, so there's that.

It isn't simple at all. If you think for some time about barriers from the storage point of view, you will soon realize how bad and ambiguous they are.

Before that happens, people will keep returning again and again with those
simple questions: why the queue must be flushed for any ordered operation?
Isn't is an obvious overkill?

That [cache flushing]

It isn't cache flushing, it's _queue_ flushing. You can call it queue draining, if you like.

Often there's a big difference where it's done: on the system side, or on the storage side.

Actually, performance improvements from NCQ in many cases are not because it allows the drive to reorder requests, as it's commonly thought, but because it allows to have internal drive's processing stages stay always busy without any idle time. Drives often have a long internal pipeline.. Hence the need to keep every stage of it always busy and hence why using ORDERED commands is important for performance.

is not what's being asked for here. Just a
light-weight barrier. My proposal works without having to add new
system calls: a) use a COW format, b) have background threads doing
fsync()s, c) in each transaction's root block note the last
known-committed (from a completed fsync()) transaction's root block,
d) have an array of well-known ubberblocks large enough to accommodate
as many transactions as possible without having to wait for any one
fsync() to complete, d) do not reclaim space from any one past
transaction until at least one subsequent transaction is fully
committed. This obtains ACI- transaction semantics (survives power
failures but without durability for the last N transactions at
power-failure time) without requiring changes to the OS at all, and
with support for delayed D (durability) notification.

I believe what you really want is to be able to send to the storage a sequence of your favorite operations (FS operations, async IO operations, etc.) like:

Write back caching disabled:

data op11, ..., data op1N, ORDERED data op1, data op21, ..., data op2M, ...

Write back caching enabled:

data op11, ..., data op1N, ORDERED sync cache, ORDERED FUA data op1, data op21, ..., data op2M, ...

Right?

(ORDERED means that it is guaranteed that this ordered command never in any circumstances will be executed before any previous command completed AND after any subsequent command completed.)

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/