On Thu, Apr 02 2009, Mikulas Patocka wrote:
On Tue, 31 Mar 2009, Jens Axboe wrote:
On Mon, Mar 30 2009, Mikulas Patocka wrote:I am saying that the filesystem should run fsck if journaled filesystem is
On Thu, 26 Mar 2009, Jens Axboe wrote:Plus the barrier also allows [submit req a, submit req b] and still
On Wed, Mar 25 2009, Mikulas Patocka wrote:There are three ordering guarantees:
We can do that, not a problem. The problem is that ordering is almostIf the ordering isn't guaranteed, the filesystem should know about it, andSo I think there should be flag (this device does/doesn't support dataAnd my point is that this case isn't interesting, because most setups
consistency) that the journaled filesystems can use to mark the disk dirty
for fsck. And if you implement this flag, you can accept barriers always
to all kind of devices regardless of whether they support consistency. You
can then get rid of that -EOPNOTSUPP and simplify filesystem code because
they'd no longer need two commit paths and a clumsy way to restart
-EOPNOTSUPPed requests.
don't guarantee proper ordering.
mark the partition for fsck. That's why I'm suggesting to use a flag for
that. That flag could be also propagated up through md and dm.
never preserved, SCSI does not use ordered tags because it hasn't
verified that its error path doesn't reorder by mistake. So right now
you can basically use 'false' as that flag.
1. - nothing (for devices with write cache without cache control)
2. - non-cached ordering: the sequence [submit req a, end req a, submit
req b, end req b] will make the ordering. It is guaranteed that when the
request ends successfully, it is on medium. This is what all the
filesystems, md and dm assume about disks. This consistency model was used
long way before barriers came in.
3. - barrier ordering: ordering is done with barriers, [submit req a, end
req a, submit req b, end req b] won't guarantee ordering of a and b, a
barrier must be inserted.
count on ordering if either one of them is a barrier. It doesn't have to
be sync, like the (2).
--- so you can make a two bitflags that differentiate these models. InBut what's the point? Basically no devices are naturally ordered by
current kernel, model (1) and (2) cannot be differentiated in any way. (3)
can be differentiated only after a trial write and it won't guarantee that
(3) will be valid further.
default. Either you need cache flushes, or you need to tell the device
not to reorder on a per-command basis.
Nobody is suggesting that, it's just not a feasible approach. But youIf someone implements "write barrier's aren't supported => run fsck", thenThe reasoning: "write barriers aren't supported => the device doesn'tIt's valid in the sense that it's the only RELIABLE primitive we have.
guarantee consistency" isn't valid.
Are you really suggestion that we just assume any device is fully
ordered, unless proven otherwise?
a lot of systems start fscking needlessly (for example those using md or
dm without write cache) and become inoperational for long time because of
that. So no one can really implement this logic and filesystems don't run
fsck at all when operated over a device that doesn't support ordering. So
you get data corruption if you get crash on those devices.
mounted on an unsafe device and crash happens.
have to warn if you don't know whether it provides the orderingThe warning of missing barriers (or other actions) should be printed only
guarantee you expect to provide consistency and integrity.
if write cache is enabled. But there's no way how a filesystem on the top
of several dm or md layers can find out if the disk is running with hdparm
-w 0 or hdparm -w 1.
Right, you can't possibly now that. Hence we have to print the warning.
And it makes barriers useless for ordering.The barrier can be cancelled with -EOPNOTSUPP at any time. Andi KleenYou are right, if a device is reconfigured beneath you it may very well
submitted a patch that implements failing barriers for device mapper and
he says that md-raid1 does the same thing.
begin to return -EOPNOTSUPP much later. I didn't take that into account,
I was considering only "plain" devices.
Filesystems handle these randomly failed barriers but the downside is thatIt can, but it requires you to operate at the request level. So for file
they must not submit any request concurrently with the barrier. Also, that
-EOPNOTSUPP restarting code is really crap, the request cannot be
restarted from bi_end_io, so bi_end_io needs to handle to another thread
for retry without barrier.
systems that is problematic, it wont work of course. It would not be
THAT hard to provide a helper to reissue the request. Not that pretty,
but...
The filesystem can't do [submit req a], [submit barrier req b], [submit
req c] and assume that the requests will be ordered. If [b] fails with
-EOPNOTSUPP, [a] and [c] could be already reordered and data corruption
has already happened. Even if you catch [b]'s error and resubmit it as
non-barrier request, it's too late.
So, as a result of this complication, all the existing filesystems send
just one barrier request and do not try to overlay it with any other write
requests.
So I'm wondering why Linux developers designed a barrier interface with
complex specification, complex implementation and the interface is useless
to provide any request ordering and it's no better than q->issue_flush_fn
method or whatever was there beffore. Obviously, the whole barrier thing
was designed by a person who never used it in a filesystem.
That's not quite true, it was done in conjunction with file system
people. At a certain level, we are restricted by what the hardware can
actually do. It's certainly possible to make sure your storage stack can
support barriers and be safe in that regard, but it's certainly also
true that reconfiguring devices may void that guarantee. So it's not
perfect, but it's the best we can do. The worst part is that it's
virtually impossible to inform of such limitations.
If we get rid of -EOPNOTSUPP and just warn in such cases, then you
should never see -EOPNOTSUPP in the above sequence. You may not actually
be safe, hence we print a warning. It'll also make the whole thing a lot
less complex.
And to wrap up with the history of barriers, there was NOTHING before.
->issue_flush_fn is a later addition to just force a flush for fsync()
and friends, the original implementation was just a data bio/bh with
barrier semantics, providing no reordering before and after the data
passed.
Nobody was interested in barriers when they were done. Nobody. The fact
that it's taken 6 years or so to actually emerge as a hot topic for data
consistency should make that quite obvious. So the original
implementation was basically a joint effort with Chris on the reiser
side and EMC as the hw vendor and me doing the block implementation.