Re: [PATCH 01/45] writeback: reduce calls to global_page_state inbalance_dirty_pages()

From: Peter Zijlstra
Date: Tue Oct 13 2009 - 04:43:03 EST


On Tue, 2009-10-13 at 11:24 +0800, Wu Fengguang wrote:

> You are right too :) I followed you and Peter's advice to do the loop
> and the recheck of stats as follows:

> This patch slightly changes behavior by replacing clip_bdi_dirty_limit()
> with the explicit check (nr_reclaimable + nr_writeback >= dirty_thresh)
> to avoid exceeding the dirty limit. Since the bdi dirty limit is mostly
> accurate we don't need to do routinely clip. A simple dirty limit check
> would be enough.
>
> The check is necessary because, in principle we should throttle
> everything calling balance_dirty_pages() when we're over the total
> limit, as said by Peter.
>
> We now set and clear dirty_exceeded not only based on bdi dirty limits,
> but also on the global dirty limits. This is a bit counterintuitive, but
> the global limits are the ultimate goal and shall be always imposed.
>
> We may now start background writeback work based on outdated conditions.
> That's safe because the bdi flush thread will (and have to) double check
> the states. It reduces overall overheads because the test based on old
> states still have good chance to be right.

> static void balance_dirty_pages(struct address_space *mapping,
> unsigned long write_chunk)
> {
> long nr_reclaimable, bdi_nr_reclaimable;
> long nr_writeback, bdi_nr_writeback;
> unsigned long background_thresh;
> unsigned long dirty_thresh;
> unsigned long bdi_thresh;
> int dirty_exceeded;
> struct backing_dev_info *bdi = mapping->backing_dev_info;
>
> /*
> * If sync() is in progress, curb the to-be-synced inodes regardless
> * of dirty limits, so that a fast dirtier won't livelock the sync.
> */
> if (unlikely(bdi->sync_time &&
> S_ISREG(mapping->host->i_mode) &&
> time_after_eq(bdi->sync_time,
> mapping->host->dirtied_when))) {
> write_chunk *= 2;
> bdi_writeback_wait(bdi, write_chunk);
> }
>
> for (;;) {
> nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> global_page_state(NR_UNSTABLE_NFS);
> nr_writeback = global_page_state(NR_WRITEBACK) +
> global_page_state(NR_WRITEBACK_TEMP);
>
> global_dirty_thresh(&background_thresh, &dirty_thresh);
>
> /*
> * Throttle it only when the background writeback cannot
> * catch-up. This avoids (excessively) small writeouts
> * when the bdi limits are ramping up.
> */
> if (nr_reclaimable + nr_writeback <
> (background_thresh + dirty_thresh) / 2)
> break;
>
> bdi_thresh = bdi_dirty_thresh(bdi, dirty_thresh);
>
> /*
> * In order to avoid the stacked BDI deadlock we need
> * to ensure we accurately count the 'dirty' pages when
> * the threshold is low.
> *
> * Otherwise it would be possible to get thresh+n pages
> * reported dirty, even though there are thresh-m pages
> * actually dirty; with m+n sitting in the percpu
> * deltas.
> */
> if (bdi_thresh < 2*bdi_stat_error(bdi)) {
> bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
> bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
> } else {
> bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> }
>
> /*
> * The bdi thresh is somehow "soft" limit derived from the
> * global "hard" limit. The former helps to prevent heavy IO
> * bdi or process from holding back light ones; The latter is
> * the last resort safeguard.
> */
> dirty_exceeded =
> (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
> || (nr_reclaimable + nr_writeback >= dirty_thresh);
>
> if (!dirty_exceeded)
> break;
>
> bdi->dirty_exceed_time = jiffies;
>
> bdi_writeback_wait(bdi, write_chunk);
> }
>
> /*
> * In laptop mode, we wait until hitting the higher threshold before
> * starting background writeout, and then write out all the way down
> * to the lower threshold. So slow writers cause minimal disk activity.
> *
> * In normal mode, we start background writeout at the lower
> * background_thresh, to keep the amount of dirty memory low.
> */
> if (!laptop_mode && (nr_reclaimable > background_thresh) &&
> can_submit_background_writeback(bdi))
> bdi_start_writeback(bdi, NULL, WB_FOR_BACKGROUND);
> }

Looks good, Thanks Wu!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/