Re: [RFC v3] writeback: add elastic bdi in cgwb bdp

From: David Hildenbrand
Date: Thu Nov 14 2019 - 08:23:55 EST


On 14.11.19 10:38, Hillf Danton wrote:

On Tue, 12 Nov 2019 17:02:36 -0800 Andrew Morton wrote:

On Tue, 12 Nov 2019 11:42:27 +0800 Hillf Danton <hdanton@xxxxxxxx> wrote:

The elastic bdi (ebdi) which is the mirror bdi of spinning disk,
SSD and USB key on market is introduced to balancing dirty pages
(bdp).

The risk arises that system runs out of free memory, when dirty
pages are produced too many too soon, so bdp is needed in field.

Ebdi facilitates bdp in elastic time intervals e.g. from a jiffy
to one HZ, depending on the time it would take to increase dirty
pages by the amount which is defined by the variable
ratelimit_pages.

During cgroup writeback (cgwb) bdp, ebdi helps observe the
changes both in cgwb's dirty pages (dirty speed) and in
written-out pages (laundry speed) in elastic time intervals,
until a balance is established between the two parties i.e.
the two speeds statistically equal.

The above mechanism of elastic equilibrium effectively prevents
dirty page hogs, as no chance is left for dirty pages to pile up,
thus cuts the risk that system free memory falls to unsafe level.

Thanks to Rong Chen for testing.

That sounds like a Tested-by:

Yes, Sir, will add Tested-by: Rong Chen <rong.a.chen@xxxxxxxxx>

The changelog has no testing results. Please prepare results which
show, amongst other things, the change in performance when the kernel
isn't tight on memory. As well as the alteration in behaviour when
memory is short.

Will do.

Generally, please work on making this code much more understandable?

Will do.


...

--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -811,6 +811,8 @@ static long wb_split_bdi_pages(struct bd
if (nr_pages == LONG_MAX)
return LONG_MAX;
+ return nr_pages;
+
/*
* This may be called on clean wb's and proportional distribution
* may not make sense, just use the original @nr_pages in those
@@ -1604,6 +1606,7 @@ static long writeback_chunk_size(struct
pages = min(pages, work->nr_pages);
pages = round_down(pages + MIN_WRITEBACK_PAGES,
MIN_WRITEBACK_PAGES);
+ pages = work->nr_pages;

It's unclear what this is doing, but it makes the three preceding
statements non-operative.

This change, and the above one as well, is trying to bypass the
current bandwidth, and a couple of rounds of cleanup are needed
after it survives the LTP.

}
return pages;
@@ -2092,6 +2095,9 @@ void wb_workfn(struct work_struct *work)
wb_wakeup_delayed(wb);
current->flags &= ~PF_SWAPWRITE;
+
+ if (waitqueue_active(&wb->bdp_waitq))
+ wake_up_all(&wb->bdp_waitq);

Please add a comment explaining why this is being done here.

After writing out some dirty pages, it it a check point to see if
a balance is already set up between the dirty speed and laundry
speed. Those under throttling will be unthrottled after seeing
a balance in place.

A comment will be added.

}
/*
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1830,6 +1830,67 @@ pause:
wb_start_background_writeback(wb);
}
+/**
+ * cgwb_bdp_should_throttle() tell if a wb should be throttled
+ * @wb bdi_writeback to throttle
+ *
+ * To avoid the risk of exhausting the system free memory, we check
+ * and try much to prevent too many dirty pages from being produced
+ * too soon.
+ *
+ * For cgroup writeback, it is essencially to keep an equilibrium

"it is essential"?

Yes Sir.

+ * between its dirty speed and laundry speed i.e. dirty pages are
+ * written out as fast as they are produced in an ideal state.
+ */
+static bool cgwb_bdp_should_throttle(struct bdi_writeback *wb)
+{
+ struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
+
+ if (fatal_signal_pending(current))
+ return false;
+
+ gdtc.avail = global_dirtyable_memory();
+
+ domain_dirty_limits(&gdtc);
+
+ gdtc.dirty = global_node_page_state(NR_FILE_DIRTY) +
+ global_node_page_state(NR_UNSTABLE_NFS) +
+ global_node_page_state(NR_WRITEBACK);
+
+ if (gdtc.dirty < gdtc.bg_thresh)
+ return false;
+
+ if (!writeback_in_progress(wb))
+ wb_start_background_writeback(wb);

This is a bit ugly. Something called "bool cgwb_bdp_should_throttle()"
shoiuld just check whether we should throttle. But here it is, also
initiating writeback. That's an inappropriate thing for this function
to do?

It is the current bdp behavior trying to keep dirty pages below the
user-configurable background threshold by waking up flushers, because
no dirty page will be sent to disk without flusher's efforts, please
see 143dfe8611a6 ("writeback: IO-less balance_dirty_pages()").

Will try to find some chance to pinch it out.

Also, we don't know *why* this is being done here, because there's no
code comment explaining the reasoning to us.

Will add a comment.


+ if (gdtc.dirty < gdtc.thresh)
+ return false;
+
+ /*
+ * throttle wb if there is the risk that wb's dirty speed is
+ * running away from its laundry speed, better with statistic
+ * error taken into account.
+ */
+ return wb_stat(wb, WB_DIRTIED) >
+ wb_stat(wb, WB_WRITTEN) + wb_stat_error();
+}
+

...

@@ -1888,29 +1945,38 @@ void balance_dirty_pages_ratelimited(str
* 1000+ tasks, all of them start dirtying pages at exactly the same
* time, hence all honoured too large initial task->nr_dirtied_pause.
*/
- p = this_cpu_ptr(&bdp_ratelimits);
- if (unlikely(current->nr_dirtied >= ratelimit))
- *p = 0;
- else if (unlikely(*p >= ratelimit_pages)) {
- *p = 0;
- ratelimit = 0;
- }
+ dirty = this_cpu_ptr(&bdp_ratelimits);
+
/*
* Pick up the dirtied pages by the exited tasks. This avoids lots of
* short-lived tasks (eg. gcc invocations in a kernel build) escaping
* the dirty throttling and livelock other long-run dirtiers.
*/
- p = this_cpu_ptr(&dirty_throttle_leaks);
- if (*p > 0 && current->nr_dirtied < ratelimit) {
- unsigned long nr_pages_dirtied;
- nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
- *p -= nr_pages_dirtied;
- current->nr_dirtied += nr_pages_dirtied;
+ leak = this_cpu_ptr(&dirty_throttle_leaks);
+
+ if (*dirty + *leak < ratelimit_pages) {
+ /*
+ * nothing to do as it would take some more time to
+ * eat out ratelimit_pages
+ */
+ try_bdp = false;
+ } else {
+ try_bdp = true;
+
+ /*
+ * bdp in flight helps detect dirty page hogs soon
+ */

How? Please expand on this comment a lot.

We should be cautious here in red zone after paying the ratelimit_pages
price; we might soon have to tackle a deluge of dirty page hogs.

Will cut it.

+ flights = this_cpu_ptr(&bdp_in_flight);
+
+ if ((*flights)++ & 1) {

What is that "& 1" doing?

It helps to tell if a bdp is alredy in flight.

It would have been something like

if (*flights == 0) {
(*flights)++;
} else {
*flights = 0;
+ *dirty = *dirty + *leak - ratelimit_pages;
+ *leak = 0;
+ }

but I was curious to see the flights in long run.

Thanks
Hillf

}
preempt_enable();
- if (unlikely(current->nr_dirtied >= ratelimit))
- balance_dirty_pages(wb, current->nr_dirtied);
+ if (try_bdp)
+ cgwb_bdp(wb);
wb_put(wb);



Friendly note that your mail client is messing up the thread hierarchy again (I think it was correct for a while):

Message-Id: <20191114093832.8504-1-hdanton@xxxxxxxx>
In-Reply-To: <20191112034227.3112-1-hdanton@xxxxxxxx>
References: <20191112034227.3112-1-hdanton@xxxxxxxx>

I assume a kernel developer can setup a mail client or switch to a sane one. Please don't prove me wrong. ;)

--

Thanks,

David / dhildenb