Re: kernel panic in skb_copy_bits

From: Alex Bligh
Date: Thu Jul 04 2013 - 08:57:38 EST




--On 4 July 2013 03:12:10 -0700 Eric Dumazet <eric.dumazet@xxxxxxxxx> wrote:

It looks like a typical COW issue to me.

If the page content is written while there is still a reference on this
page, we should allocate a new page and copy the previous content.

And this has little to do with networking.

I suspect this would get more attention if we could make Ian's case
below trigger (a) outside Xen, (b) outside networking.

memset(buf, 0xaa, 4096);
write(fd, buf, 4096)
memset(buf, 0x55, 4096);
(where fd is O_DIRECT on NFS) Can result in 0x55 being seen on the wire
in the TCP retransmit.

We know this should fail using O_DIRECT+NFS. We've had reports suggesting
it fails in O_DIRECT+iSCSI. However, that's been with a kernel panic
(under Xen) rather than data corruption as per the above.

Historical trawling suggests this is an issue with DRDB (see Ian's
original thread from the mists of time).

I don't quite understand why we aren't seeing corruption with standard
ATA devices + O_DIRECT and no Xen involved at all.

My memory is a bit misty on this but I had thought the reason why
this would NOT be solved simply by O_DIRECT taking a reference to
the page was that the O_DIRECT I/O completed (and thus the reference
would be freed up) before the networking stack had actually finished
with the page. If the O_DIRECT I/O did not complete until the
page was actually finished with, we wouldn't see the problem in the
first place. I may be completely off base here.

--
Alex Bligh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/