Re: Sync writeback still broken

From: Jan Kara
Date: Sun Oct 31 2010 - 18:40:23 EST


On Sun 31-10-10 13:24:37, Jan Kara wrote:
> On Mon 25-10-10 01:41:48, Jan Engelhardt wrote:
> > On Sunday 2010-06-27 18:44, Jan Engelhardt wrote:
> > >On Monday 2010-02-15 16:41, Jan Engelhardt wrote:
> > >>On Monday 2010-02-15 15:49, Jan Kara wrote:
> > >>>On Sat 13-02-10 13:58:19, Jan Engelhardt wrote:
> > >>>> >>
> > >>>> >> This fixes it by using the passed in page writeback count, instead of
> > >>>> >> doing MAX_WRITEBACK_PAGES batches, which gets us much better performance
> > >>>> >> (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1)
> > >>>> >> finish properly even when new pages are being dirted.
> > >>>> >
> > >>>> >This seems broken.
> > >>>>
> > >>>> It seems so. Jens, Jan Kara, your patch does not entirely fix this.
> > >>>> While there is no sync/fsync to be seen in these traces, I can
> > >>>> tell there's a livelock, without Dirty decreasing at all.
> > >
> > >What ultimately became of the discussion and/or the patch?
> > >
> > >Your original ad-hoc patch certainly still does its job; had no need to
> > >reboot in 86 days and still counting.
> >
> > I still observe this behavior on 2.6.36-rc8. This is starting to
> > get frustrating, so I will be happily following akpm's advise to
> > poke people.
> Yes, that's a good way :)
>
> > Thread entrypoint: http://lkml.org/lkml/2010/2/12/41
> >
> > Previously, many concurrent extractions of tarballs and so on have been
> > one way to trigger the issue; I now also have a rather small testcase
> > (below) that freezes the box here (which has 24G RAM, so even if I'm
> > lacking to call msync, I should be fine) sometime after memset finishes.
> I've tried your test but didn't succeed in freezing my laptop.
> Everything was running smooth, the machine even felt reasonably responsive
> although constantly reading and writing to disk. Also sync(1) finished in a
> couple of seconds as one would expect in an optimistic case.
> Needless to say that my laptop has only 1G of ram so I had to downsize
> the hash table from 16G to 1G to be able to run the test and the disk is
> Intel SSD so the performance of the backing storage compared to the amount
> of needed IO is much in my favor.
> OK, so I've taken a machine with standard rotational drive and 28GB of
> ram and there I can see sync(1) hanging (but otherwise the machine looks
> OK). Investigating further...
So with the writeback tracing, I verified that indeed the trouble is that
work queued by sync(1) gets queued behind the background writeback which is
just running. And background writeback won't stop because your process is
dirtying pages so agressively. Actually, it would stop after writing
LONG_MAX pages but that's effectively infinity. I have a patch
(e.g. http://www.kerneltrap.com/mailarchive/linux-fsdevel/2010/8/3/6886244)
to stop background writeback when other work is queued but it's kind
of hacky so I can see why Christoph doesn't like it ;)
So I'll have to code something different to fix this issue...

Honza
--
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/