[RFC] Transparent on-demand memory setup initialization embedded inthe (GFP) buddy allocator

From: Ingo Molnar
Date: Wed Jun 26 2013 - 05:23:03 EST



(Changed the subject, to make it more apparent what we are talking about.)

* Mike Travis <travis@xxxxxxx> wrote:

> On 6/25/2013 11:43 AM, H. Peter Anvin wrote:
> > On 06/25/2013 10:22 AM, Mike Travis wrote:
> >>
> >> On 6/25/2013 12:38 AM, Ingo Molnar wrote:
> >>>
> >>> * Nathan Zimmer <nzimmer@xxxxxxx> wrote:
> >>>
> >>>> On Sun, Jun 23, 2013 at 11:28:40AM +0200, Ingo Molnar wrote:
> >>>>>
> >>>>> That's 4.5 GB/sec initialization speed - that feels a bit slow and the
> >>>>> boot time effect should be felt on smaller 'a couple of gigabytes'
> >>>>> desktop boxes as well. Do we know exactly where the 2 hours of boot
> >>>>> time on a 32 TB system is spent?
> >>>>
> >>>> There are other several spots that could be improved on a large system
> >>>> but memory initialization is by far the biggest.
> >>>
> >>> My feeling is that deferred/on-demand initialization triggered from the
> >>> buddy allocator is the better long term solution.
> >>
> >> I haven't caught up with all of Nathan's changes yet (just
> >> got back from vacation), but there was an option to either
> >> start the memory insertion on boot, or trigger it later
> >> using the /sys/.../memory interface. There is also a monitor
> >> program that calculates the memory insertion rate. This was
> >> extremely useful to determine how changes in the kernel
> >> affected the rate.
> >>
> >
> > Sorry, I *totally* did not follow that comment. It seemed like a
> > complete non-sequitur?
> >
> > -hpa
>
> It was I who was not following the question. I'm still reverting
> back to "work mode".
>
> [There is more code in a separate patch that Nate has not sent
> yet that instructs the kernel to start adding memory as early
> as possible, or not. That way you can start the insertion process
> later and monitor it's progress to determine how changes in the
> kernel affect that process. It is controlled by a separate
> CONFIG option.]

So, just to repeat (and expand upon) the solution hpa and me suggests:
it's not based on /sys, delayed initialization lists or any similar
(essentially memory hot plug based) approach.

It's a transparent on-demand initialization scheme based on only
initializing the very early memory setup in 1GB (2MB) steps (not in 4K
steps like we do it today).

Any subsequent split-up initialization is done on-demand, in alloc_pages()
et al, initilizing a batch of 512 (or 1024) struct page head's when an
uninitialized portion is first encountered.

This leaves the principle logic of early init largely untouched, we still
have the same amount of RAM during and after bootup, except that on 32 TB
systems we don't spend ~2 hours initializing 8,589,934,592 page heads.

This scheme could be implemented by introducing a new PG_initialized flag,
which is seen by an unlikely() branch in alloc_pages() and which triggers
the on-demand initialization of pages.

[ It could probably be made zero-cost for the post-initialization state:
we already check a bunch of rare PG_ flags, one more flag would not
introduce any new branch in the page allocation hot path. ]

It's a technically different solution from what was submitted in this
thread.

Cons:

- it works after bootup, via GFP. If done in a simple fashion it adds one
more branch to the GFP fastpath. [ If done a bit more cleverly it can
merge into an existing unlikely() branch and become essentially
zero-cost for the fastpath. ]

- it adds an initialization non-determinism to GFP, to the tune of
initializing ~512 page heads when RAM is utilized first.

- initialization is done when memory is needed - not during or shortly
after bootup. This (slightly) increases first-use overhead. [I don't
think this factor is significant - and I think we'll quickly see
speedups to initialization, once the overhead becomes more easily
measurable.]

Pros:

- it's transparent to the boot process. ('free' shows the same full
amount of RAM all the time, there's no weird effects of RAM coming
online asynchronously. You see all the RAM you have - etc.)

- it helps the boot time of every single Linux system, not just large RAM
ones. On a smallish, 4GB system memory init can take up precious
hundreds of milliseconds, so this is a practical issue.

- it spreads initialization overhead to later portions of the system's
life time: when there's typically more idle time and more paralellism
available.

- initialization overhead, because it's a natural part of first-time
memory allocation with this scheme, becomes more measurable (and thus
more prominently optimized) than any deferred lists processed in the
background.

- as an added bonus it probably speeds up your usecase even more than the
patches you are providing: on a 32 TB system the primary initialization
would only have to enumerate memory, allocate page heads and buddy
bitmaps, and initialize the 1GB granular page heads: there's only 32768
of them.

So unless I overlooked some factor this scheme would be unconditional
goodness for everyone.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/