Re: Bug with "fix partial page writes" [3.2-rc regression]

From: Hugh Dickins
Date: Thu Dec 08 2011 - 12:40:31 EST


On Wed, 7 Dec 2011, Allison Henderson wrote:
> On 12/07/2011 10:04 AM, Allison Henderson wrote:
> > On 12/07/2011 01:28 AM, Yongqiang Yang wrote:
> > > Hi Allison and Hugh,
> > >
> > > I think I found the problem and it has nothing to do with punching
> > > hole. The patch [ext4: let ext4_bio_write_page handle EOF correctly]
> > > would fix up the problem.
> > >
> > > I post the patch so that it can be tested as early as possible. The
> > > problem has not appeared on my machine since the patch is applied.
> > >
> > > Yongqiang.
> >
> > Great! I will try it out with your other set in my sandbox and let you
> > know what happens. Thx!
> >
> > Allison Henderson
>
> Well, it's been running several hours now with out problems, so I think it
> will be ok, but I will let it run the full day.
>
> Andy, I know you were also seeing issues in this area. Could you try
> Yongqiang patches? The code you were modifying needed to be removed, so I
> think they will resolve the issues you were seeing too. Please try the
> following patch sets:
>
> [PATCH 1/2] ext4: let mpage_submit_io works well when blocksize < pagesize
> [PATCH 2/2] ext4: let ext4_discard_partial_buffers handle pages without
> buffers correctly
>
> and
>
> [PATCH 1/2] ext4: remove a wrong BUG_ON in ext4_ext_convert_to_initialized
> [PATCH 2/2] ext4: let ext4_bio_write_page handle EOF correctly

Those patches are working well for me, many thanks to Yongqiang.

The last (or more of them?) fix behaviour going back several
releases, and ought to be sent to -stable after verification.

I ran fsx (args as before on 1024k block ext2fs under memory pressure)
for 8 hours on three machines, and no problem showed up on any.
I didn't have time to try ext4, but I expect that you did.

And I've run kernel builds under memory pressure for 7.5 hours,
no problem has showed up there either - although that's not long
enough yet to validate the oops fix by itself, we've earlier run
long enough with the first 2/2 to be sure that it fixes the oops,
and the "corruption" that I saw.

Quotes around corruption now because, from Yongqiang's description,
I'm guessing that ld was mmap'ing objfiles and acting on "data"
from beyond eof. Which ld does have the right to do, it should
indeed be zeroed.

Only once, before the fixes, did I ever see an unexplained EINVAL
(from cp), like Andy reports: I'm very hopeful his case is fixed too.

Thanks!
Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/