Re: Wrong DIF guard tag on ext2 write

From: Vladislav Bolkhovitin
Date: Mon Jul 26 2010 - 15:26:19 EST


Gennadiy Nerubayev, on 07/26/2010 09:00 PM wrote:
On Mon, Jul 26, 2010 at 8:22 AM, Vladislav Bolkhovitin<vst@xxxxxxxx> wrote:
Gennadiy Nerubayev, on 07/24/2010 12:51 AM wrote:

The real life problem we can see in an active-active DRBD-setup. In
this
configuration 2 nodes act as a single SCST-powered SCSI device and they
both
run DRBD to keep their backstorage in-sync. The initiator uses them as
a
single multipath device in an active-active round-robin load-balancing
configuration, i.e. sends requests to both nodes in parallel, then DRBD
takes care to replicate the requests to the other node.

The problem is that sometimes DRBD complies about concurrent local
writes, like:

kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected!
[DISCARD
L] new: 144072784s +8192; pending: 144072784s +8192

This message means that DRBD detected that both nodes received
overlapping writes on the same block(s) and DRBD can't figure out which
one
to store. This is possible only if the initiator sent the second write
request before the first one completed.

The topic of the discussion could well explain the cause of that. But,
unfortunately, people who reported it forgot to note which OS they run
on
the initiator, i.e. I can't say for sure it's Linux.

Sorry for the late chime in, but here's some more information of
potential interest as I've previously inquired about this to the drbd
mailing list:

1. It only happens when using blockio mode in IET or SCST. Fileio,
nv_cache, and write_through do not generate the warnings.

Some explanations for those who not familiar with the terminology:

- "Fileio" means Linux IO stack on the target receives IO via
vfs_readv()/vfs_writev()

- "NV_CACHE" means all the cache synchronization requests
(SYNCHRONIZE_CACHE, FUA) from the initiator are ignored

- "WRITE_THROUGH" means write through, i.e. the corresponding backend
file
for the device open with O_SYNC flag.

2. It happens on active/passive drbd clusters (on the active node
obviously), NOT active/active. In fact, I've found that doing round
robin on active/active is a Bad Idea (tm) even with a clustered
filesystem, until at least the target software is able to synchronize
the command state of either node.
3. Linux and ESX initiators can generate the warning, but I've so far
only been able to reliably reproduce it using a Windows initiator and
sqlio or iometer benchmarks. I'll be trying again using iometer when I
have the time.
4. It only happens using a random write io workload (any block size),
with initiator threads>1, OR initiator queue depth>1. The higher
either of those is, the more spammy the warnings become.
5. The transport does not matter (reproduced with iSCSI and SRP)
6. If DRBD is disconnected (primary/unknown), the warnings are not
generated. As soon as it's reconnected (primary/secondary), the
warnings will reappear.

It would be great if you prove or disprove our suspicions that Linux can
produce several write requests for the same blocks simultaneously. To be
sure we need:

1. The initiator is Linux. Windows and ESX are not needed for this
particular case.

2. If you are able to reproduce it, we will need full description of
which
application used on the initiator to generate the load and in which mode.

Target and DRBD configuration doesn't matter, you can use any.

I just tried, and this particular DRBD warning is not reproducible
with io (iometer) coming from a Linux initiator (2.6.30.10) The same
iometer parameters were used as on windows, and both the base device
as well as filesystem (ext3) were tested, both negative. I'll try a
few more tests, but it seems that this is a nonissue with a Linux
initiator.

OK, but to be completely sure, can you check also with other load
generators, than IOmeter, please? IOmeter on Linux is a lot less effective
than on Windows, because it uses sync IO, while we need big multi-IO load to
trigger the problem we are discussing, if it exists. Plus, to catch it we
need an FS on the initiator side, not using raw devices. So, something like
fio over files on FS or diskbench should be more appropriate. Please don't
use direct IO to avoid the bug Dave Chinner pointed us out.

I tried both fio and dbench, with the same results. With fio in
particular, I think I used pretty much every possible combination of
engines, directio, and sync settings with 8 threads, 32 queue depth
and random write workload.

Also, you mentioned above about that Linux can generate the warning. Can you
recall on which configuration, including the kernel version, the load
application and its configuration, you have seen it?

Sorry, after double checking, it's only ESX and Windows that generate
them. The majority of the ESX virtuals in question are Windows, though
I can see some indications of ESX servers that have Linux-only
virtuals generating one here and there. It's somewhat difficult to
tell historically, and I probably would not be able to determine what
those virtuals were running at the time.

OK, I see. A negative result is also a result. Now we know that Linux (in contrast to VMware and Windows) works well in this area.

Thank you!
Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/