Re: [PATCH] mm: strictlimit feature -v2

From: Maxim Patlasov
Date: Fri Jul 05 2013 - 09:16:03 EST


Hi Jan,

First of all, thanks a lot for review, I highly appreciate it. I agree with most of your comments, see please inline replies below.

To make further reviews easier, I'm going to send the next version in the form of small mm-only patchset (i.e. putting all that "fuse writeback cache policy" stuff aside).

07/04/2013 03:16 AM, Jan Kara ÐÐÑÐÑ:
On Wed 03-07-13 15:01:19, Maxim Patlasov wrote:
07/02/2013 11:38 PM, Andrew Morton ÐÐÑÐÑ:
On Tue, 02 Jul 2013 21:44:47 +0400 Maxim Patlasov <MPatlasov@xxxxxxxxxxxxx> wrote:

From: Miklos Szeredi <mszeredi@xxxxxxx>

The feature prevents mistrusted filesystems to grow a large number of dirty
pages before throttling. For such filesystems balance_dirty_pages always
check bdi counters against bdi limits. I.e. even if global "nr_dirty" is under
"freerun", it's not allowed to skip bdi checks. The only use case for now is
fuse: it sets bdi max_ratio to 1% by default and system administrators are
supposed to expect that this limit won't be exceeded.

The feature is on if address space is marked by AS_STRICTLIMIT flag.
A filesystem may set the flag when it initializes a new inode.

Changed in v2 (thanks to Andrew Morton):
- added a few explanatory comments
- cleaned up the mess in backing_dev_info foo_stamp fields: now it's clearly
stated that bw_time_stamp is measured in jiffies; renamed other foo_stamp
fields to reflect that they are in units of number-of-pages.

Better, thanks.

The writeback arithemtic makes my head spin - I'd really like Fengguang
to go over this, please.

A quick visit from the spelling police:
Great! Thank you, Andrew. I'll wait for Fengguang' feedback for a
while before respin.
Sorry for the bad mail threading but I've noticed the thread only now and
I don't have email with your patches in my mailbox anymore. Below is a
review of your strictlimit patch. In principle, I'm OK with the idea (I
even wanted to have a similar ability e.g. for NFS mounts) but I have some
reservations regarding the implementation:

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 4beb8e3..00a28af 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -305,6 +305,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
inode->i_flags |= S_NOCMTIME;
inode->i_generation = generation;
inode->i_data.backing_dev_info = &fc->bdi;
+ set_bit(AS_STRICTLIMIT, &inode->i_data.flags);
fuse_init_inode(inode, attr);
unlock_new_inode(inode);
} else if ((inode->i_mode ^ attr->mode) & S_IFMT) {
It seems wrong to use address space bits for this. Using BDI capabilities for
this would look more appropriate. Sure you couldn't then always tune this on
per-fs basis since filesystems can share BDI (e.g. when they are on different
partitions of one disk) but if several filesystems are sharing a BDI and some
would want 'strict' behavior and some don't I don't believe the resulting
behavior would be sane - e.g. non-strict fs could be getting bdi over per-bdi
limit without any throttling and strictly limited fs would be continuously
stalled. Or do I miss any reason why this is set on address space?

As a bonus you don't have to pass the 'strictlimit' flag to writeback
functions when it is a bdi flag.

Completely agree. Address space flag was Miklos' idea. I had to convert it to bdi cap from the beginning.


diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index c388155..6b12d01 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -33,6 +33,8 @@ enum bdi_state {
BDI_sync_congested, /* The sync queue is getting full */
BDI_registered, /* bdi_register() was done */
BDI_writeback_running, /* Writeback is in progress */
+ BDI_idle, /* No pages under writeback at the moment of
+ * last update of write bw */
BDI_unused, /* Available bits start here */
};
@@ -41,8 +43,15 @@ typedef int (congested_fn)(void *, int);
enum bdi_stat_item {
BDI_RECLAIMABLE,
BDI_WRITEBACK,
- BDI_DIRTIED,
- BDI_WRITTEN,
+
+ /*
+ * The three counters below reflects number of events of specific type
+ * happened since bdi_init(). The type is defined in comments below:
+ */
+ BDI_DIRTIED, /* a page was dirtied */
+ BDI_WRITTEN, /* writeout completed for a page */
+ BDI_WRITTEN_BACK, /* a page went to writeback */
+
NR_BDI_STAT_ITEMS
};
@@ -73,9 +82,12 @@ struct backing_dev_info {
struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
- unsigned long bw_time_stamp; /* last time write bw is updated */
- unsigned long dirtied_stamp;
- unsigned long written_stamp; /* pages written at bw_time_stamp */
+ unsigned long bw_time_stamp; /* last time (in jiffies) write bw
+ * is updated */
+ unsigned long dirtied_nr_stamp;
+ unsigned long written_nr_stamp; /* pages written at bw_time_stamp */
+ unsigned long writeback_nr_stamp; /* pages sent to writeback at
+ * bw_time_stamp */
unsigned long write_bandwidth; /* the estimated write bandwidth */
unsigned long avg_write_bandwidth; /* further smoothed write bw */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 4514ad7..83c7434 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -680,28 +712,55 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
return 0;
/*
- * global setpoint
+ * The strictlimit feature is a tool preventing mistrusted filesystems
+ * to grow a large number of dirty pages before throttling. For such
+ * filesystems balance_dirty_pages always checks bdi counters against
+ * bdi limits. Even if global "nr_dirty" is under "freerun". This is
+ * especially important for fuse who sets bdi->max_ratio to 1% by
+ * default. Without strictlimit feature, fuse writeback may consume
+ * arbitrary amount of RAM because it is accounted in
+ * NR_WRITEBACK_TEMP which is not involved in calculating "nr_dirty".
*
- * setpoint - dirty 3
- * f(dirty) := 1.0 + (----------------)
- * limit - setpoint
+ * Here, in bdi_position_ratio(), we calculate pos_ratio based on
+ * two values: bdi_dirty and bdi_thresh. Let's consider an example:
+ * total amount of RAM is 16GB, bdi->max_ratio is equal to 1%, global
+ * limits are set by default to 10% and 20% (background and throttle).
+ * Then bdi_thresh is 1% of 20% of 16GB. This amounts to ~8K pages.
+ * bdi_dirty_limit(bdi, bg_thresh) is about ~4K pages. bdi_setpoint is
+ * about ~6K pages (as the average of background and throttle bdi
+ * limits). The 3rd order polynomial will provide positive feedback if
+ * bdi_dirty is under bdi_setpoint and vice versa.
*
- * it's a 3rd order polynomial that subjects to
+ * Note, that we cannot use global counters in these calculations
+ * because we want to throttle process writing to strictlimit address
+ * space much earlier than global "freerun" is reached (~23MB vs.
+ * ~2.3GB in the example above).
+ */
+ if (unlikely(strictlimit)) {
+ if (bdi_dirty < 8)
+ return 2 << RATELIMIT_CALC_SHIFT;
+
+ if (bdi_dirty >= bdi_thresh)
+ return 0;
+
+ bdi_setpoint = bdi_thresh + bdi_dirty_limit(bdi, bg_thresh);
+ bdi_setpoint /= 2;
+
+ if (bdi_setpoint == 0 || bdi_setpoint == bdi_thresh)
+ return 0;
+
+ pos_ratio = pos_ratio_polynom(bdi_setpoint, bdi_dirty,
+ bdi_thresh);
+ return min_t(long long, pos_ratio, 2 << RATELIMIT_CALC_SHIFT);
+ }
But if global limits are exceeded but a BDI in strict mode is below limit,
this would allow dirtying on that BDI which seems wrong. Also the logic
in bdi_position_ratio() is already supposed to take bdi limits in account
(although the math is somewhat convoluted) so you shouldn't have to touch it.
Only maybe the increasing of bdi_thresh to (limit - dirty) / 8 might be too
much for strict limitting so that may need some tweaking (although setting it
at 8 pages as your patch does seems *too* strict to me).

I agree that if global nr_dirty comes close to global limit (or already exceeded it), we must take global pos_ratio in consideration (even if bdi_dirty << bdi_thresh). But I don't think we can keep bdi_position_ratio() intact. Because: 1) the math there goes crazy if dirty < freerun; 2) "(limit - dirty) / 8" is useless for FUSE which account internal writeback in NR_WRITEBACK_TEMP; i.e. you can easily observe "dirty" close to zero while bdi_dirty is huge. Please let me know if you need more details about NR_WRITEBACK_TEMP peculiarities.


+
+ /*
+ * global setpoint
*
- * (1) f(freerun) = 2.0 => rampup dirty_ratelimit reasonably fast
- * (2) f(setpoint) = 1.0 => the balance point
- * (3) f(limit) = 0 => the hard limit
- * (4) df/dx <= 0 => negative feedback control
- * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
- * => fast response on large errors; small oscillation near setpoint
+ * See comment for pos_ratio_polynom().
*/
setpoint = (freerun + limit) / 2;
- x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
- limit - setpoint + 1);
- pos_ratio = x;
- pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
- pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
- pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+ pos_ratio = pos_ratio_polynom(setpoint, dirty, limit);
/*
* We have computed basic pos_ratio above based on global situation. If
@@ -892,7 +951,8 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
unsigned long bdi_thresh,
unsigned long bdi_dirty,
unsigned long dirtied,
- unsigned long elapsed)
+ unsigned long elapsed,
+ bool strictlimit)
{
unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
unsigned long limit = hard_dirty_limit(thresh);
@@ -910,10 +970,10 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
* The dirty rate will match the writeout rate in long term, except
* when dirty pages are truncated by userspace or re-dirtied by FS.
*/
- dirty_rate = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
+ dirty_rate = (dirtied - bdi->dirtied_nr_stamp) * HZ / elapsed;
pos_ratio = bdi_position_ratio(bdi, thresh, bg_thresh, dirty,
- bdi_thresh, bdi_dirty);
+ bdi_thresh, bdi_dirty, strictlimit);
/*
* task_ratelimit reflects each dd's dirty rate for the past 200ms.
*/
@@ -994,6 +1054,26 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
* keep that period small to reduce time lags).
*/
step = 0;
+
+ /*
+ * For strictlimit case, balanced_dirty_ratelimit was calculated
+ * above based on bdi counters and limits (see bdi_position_ratio()).
+ * Hence, to calculate "step" properly, we have to use bdi_dirty as
+ * "dirty" and bdi_setpoint as "setpoint".
+ *
+ * We rampup dirty_ratelimit forcibly if bdi_dirty is low because
+ * it's possible that bdi_thresh is close to zero due to inactivity
+ * of backing device (see the implementation of bdi_dirty_limit()).
+ */
+ if (unlikely(strictlimit)) {
+ dirty = bdi_dirty;
+ if (bdi_dirty < 8)
+ setpoint = bdi_dirty + 1;
+ else
+ setpoint = (bdi_thresh +
+ bdi_dirty_limit(bdi, bg_thresh)) / 2;
+ }
+
if (dirty < setpoint) {
x = min(bdi->balanced_dirty_ratelimit,
min(balanced_dirty_ratelimit, task_ratelimit));
@@ -1034,12 +1114,14 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
unsigned long dirty,
unsigned long bdi_thresh,
unsigned long bdi_dirty,
- unsigned long start_time)
+ unsigned long start_time,
+ bool strictlimit)
{
unsigned long now = jiffies;
unsigned long elapsed = now - bdi->bw_time_stamp;
unsigned long dirtied;
unsigned long written;
+ unsigned long writeback;
/*
* rate-limit, only update once every 200ms.
@@ -1049,6 +1131,7 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
+ writeback = bdi_stat_sum(bdi, BDI_WRITTEN_BACK);
/*
* Skip quiet periods when disk bandwidth is under-utilized.
@@ -1057,18 +1140,32 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
goto snapshot;
+ /*
+ * Skip periods when backing dev was idle due to abscence of pages
+ * under writeback (when over_bground_thresh() returns false)
+ */
+ if (test_bit(BDI_idle, &bdi->state) &&
+ bdi->writeback_nr_stamp == writeback)
+ goto snapshot;
+
Hum, I understand the desire behind BDI_idle but that seems to be solving
only a special case, isn't it? When there is a small traffic on the bdi, the
bandwidth will get updated anyway and the computed bandwidth will go down
from the real maximum value when there are enouch pages to write. So I'm not
sure how much this really helps. Plus this 'idle' logic seems completely
independent to balance_dirty_pages() tweaks so it would be better done as a
separate patch (in case you have convincing reasons we really need that
logic).

Yes, I'll move it to separate patch. The rationale behind BDI_idle looks like this:

BDI_idle along with BDI_WRITTEN_BACK exists to distinguish two cases:

1st. BDI_WRITTEN has not been incremented since we looked at it last
time because backing dev is unresponding. I.e. it had some pages under
writeback but it have not made any progress for some reasons.

2nd. BDI_WRITTEN has not been incremented since we looked at it last
time because backing dev had nothing to do. I.e. there are some dirty
pages on bdi, but they have not been passed to backing dev yet. This is
the case when bdi_dirty is under bdi background threshold and flusher
refrains from flushing even if we woke it up explicitly by
bdi_start_background_writeback.

We have to skip bdi_update_write_bandwidth() in the 2nd case because
otherwise bdi_update_write_bandwidth() will see written==0 and
mistakenly decrease write_bandwidth. The criterion to skip is the
following: BDI_idle is set (i.e. there were no pages under writeback
when we looked at the bdi last time) && BDI_WRITTEN_BACK counter has not
changed (i.e. no new pages has been sent to writeback since we looked at
the bdi last time).

Notice, that BDI_idle is useful right now only for strictlimit feature because w/o strictlimit, we do not call bdi_update_write_bandwidth() until "dirty" exceeds "freerun" and over_bground_thresh() guarantees that flusher has passed some amount of work to backing dev by then.


if (thresh) {
global_update_bandwidth(thresh, dirty, now);
bdi_update_dirty_ratelimit(bdi, thresh, bg_thresh, dirty,
bdi_thresh, bdi_dirty,
- dirtied, elapsed);
+ dirtied, elapsed, strictlimit);
}
bdi_update_write_bandwidth(bdi, elapsed, written);
snapshot:
- bdi->dirtied_stamp = dirtied;
- bdi->written_stamp = written;
+ bdi->dirtied_nr_stamp = dirtied;
+ bdi->written_nr_stamp = written;
bdi->bw_time_stamp = now;
+
+ bdi->writeback_nr_stamp = writeback;
+ if (bdi_stat_sum(bdi, BDI_WRITEBACK) == 0)
+ set_bit(BDI_idle, &bdi->state);
+ else
+ clear_bit(BDI_idle, &bdi->state);
}
static void bdi_update_bandwidth(struct backing_dev_info *bdi,
@@ -1077,13 +1174,14 @@ static void bdi_update_bandwidth(struct backing_dev_info *bdi,
unsigned long dirty,
unsigned long bdi_thresh,
unsigned long bdi_dirty,
- unsigned long start_time)
+ unsigned long start_time,
+ bool strictlimit)
{
if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
return;
spin_lock(&bdi->wb.list_lock);
__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
- bdi_thresh, bdi_dirty, start_time);
+ bdi_thresh, bdi_dirty, start_time, strictlimit);
spin_unlock(&bdi->wb.list_lock);
}
@@ -1226,6 +1324,7 @@ static void balance_dirty_pages(struct address_space *mapping,
unsigned long dirty_ratelimit;
unsigned long pos_ratio;
struct backing_dev_info *bdi = mapping->backing_dev_info;
+ bool strictlimit = test_bit(AS_STRICTLIMIT, &mapping->flags);
unsigned long start_time = jiffies;
for (;;) {
@@ -1250,7 +1349,7 @@ static void balance_dirty_pages(struct address_space *mapping,
*/
freerun = dirty_freerun_ceiling(dirty_thresh,
background_thresh);
- if (nr_dirty <= freerun) {
+ if (nr_dirty <= freerun && !strictlimit) {
current->dirty_paused_when = now;
current->nr_dirtied = 0;
current->nr_dirtied_pause =
I'd rather change this to check bdi_dirty <= bdi_freerun in strictlimit
case.

An excellent idea, thank you very much! Some complication comes from the fact that bdi_dirty and bdi_freerun are not calculated by that time, but we can check bdi_dirty <= bdi_freerun a bit later -- that shouldn't make big difference.


@@ -1258,7 +1357,7 @@ static void balance_dirty_pages(struct address_space *mapping,
break;
}
- if (unlikely(!writeback_in_progress(bdi)))
+ if (unlikely(!writeback_in_progress(bdi)) && !strictlimit)
bdi_start_background_writeback(bdi);
This can then go away.

Yes.


/*
@@ -1296,19 +1395,24 @@ static void balance_dirty_pages(struct address_space *mapping,
bdi_stat(bdi, BDI_WRITEBACK);
}
+ if (unlikely(!writeback_in_progress(bdi)) &&
+ bdi_dirty > bdi_thresh / 4)
+ bdi_start_background_writeback(bdi);
+
Why is this?

We attempted to avoid kicking flusher every time we dive into balance_dirty_pages(). This will surely go away.


dirty_exceeded = (bdi_dirty > bdi_thresh) &&
- (nr_dirty > dirty_thresh);
+ ((nr_dirty > dirty_thresh) || strictlimit);
if (dirty_exceeded && !bdi->dirty_exceeded)
bdi->dirty_exceeded = 1;
bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
nr_dirty, bdi_thresh, bdi_dirty,
- start_time);
+ start_time, strictlimit);
dirty_ratelimit = bdi->dirty_ratelimit;
pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
background_thresh, nr_dirty,
- bdi_thresh, bdi_dirty);
+ bdi_thresh, bdi_dirty,
+ strictlimit);
task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >>
RATELIMIT_CALC_SHIFT;
max_pause = bdi_max_pause(bdi, bdi_dirty);
@@ -1362,6 +1466,8 @@ static void balance_dirty_pages(struct address_space *mapping,
}
pause:
+ if (unlikely(!writeback_in_progress(bdi)))
+ bdi_start_background_writeback(bdi);
trace_balance_dirty_pages(bdi,
dirty_thresh,
background_thresh,
And this shouldn't be necessary either after updating the freerun test
properly.

Yep!

Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/