Re: [RFC PATCH] x86, vmlinux.lds: Add debug option to force all data sections aligned
From: Feng Tang
Date: Tue Nov 16 2021 - 00:57:26 EST
On Mon, Sep 27, 2021 at 03:04:48PM +0800, Feng Tang wrote:
[...]
> > >For data-alignment, it has huge impact for the size, and occupies more
> > >cache/TLB, plus it hurts some normal function like dynamic-debug. So
> > >I'm afraid it can only be used as a debug option.
> > >
> > >>On a similar vein I think we should re-explore permanently enabling
> > >>cacheline-sized function alignment i.e. making something like
> > >>CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B the default. Ingo did some
> > >>research on that a while back:
> > >>
> > >> https://lkml.kernel.org/r/20150519213820.GA31688@xxxxxxxxx
> > >
> > >Thanks for sharing this, from which I learned a lot, and I hope I
> > >knew this thread when we first check strange regressions in 2019 :)
> > >
> > >>At the time, the main reported drawback of -falign-functions=64 was that
> > >>even small functions got aligned. But now I think that can be mitigated
> > >>with some new options like -flimit-function-alignment and/or
> > >>-falign-functions=64,X (for some carefully-chosen value of X).
> >
> > -falign-functions=64,7 should be about right, I guess.
[...]
> I cannot run it with 0Day's benchmark service right now, but I'm afraid
> there may be some performance change.
>
> Btw, I'm still interested in the 'selective isolation' method, that
> chose a few .o files from different kernel modules, add alignment to
> one function and one global data of the .o file, setting up an
> isolation buffer that any alignment change caused by the module before
> this .o will _not_ affect the alignment of all .o files after it.
>
> This will have minimal size cost, for one .o file, the worst waste is
> 128 bytes, so even we pick 128 .o files, the total cost is 8KB text
> and 8KB data space.
>
> And surely we need to test if this method can really make kernel
> performance more stable, one testing method is to pick some reported
> "strange" performance change case, and check if they are gone with
> this method.
Some update on the experiments about "selective isolation": I tried
three code alignment related cases that have been discussed, and
found the method does help to reduce the performance bump, one is
cut from 7.5% to 0.1%, and another from 3.1% to 1.3%.
The 3 cases are:
1. y2038 code cleanup causing +11.7% improvement to 'mmap' test of
will-it-scale
https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/#r
2. a hugetlb fix causing +15.9% improvement to 'page_fault3' test of
will-it-scale
https://lore.kernel.org/lkml/20200114085637.GA29297@shao2-debian/#r
3. a one-line mm fix causing -30.7% regresson of scheduler test of
stress-ng
https://lore.kernel.org/lkml/20210427090013.GG32408@xsang-OptiPlex-9020/#r
These cases are old (one or two years old), and case 3 can't be
reproduced now. Case 1's current performance delta is +3.1%,
while case 2's is +7.5%, and we tried on case 1/2.
The experiment we did is to find what files the patch touches, say
a.c, then we chose the b.c which follows a.c in Makefile which means
the b.o will be linked right after a.o (this is for simplicity, that
there are other factors like special section definitions), and
make one function of b.c aligned on 4096 bytes.
For case 2, the bisected commit c77c0a8ac4c only touches hugetlb.c,
so we made a debug patch for mempolicy.c following it:
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 067cf7d3daf5..036c93abdf9b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -804,7 +804,7 @@ static int mbind_range(struct mm_struct *mm, unsigned long start,
}
/* Set the process memory policy */
-static long do_set_mempolicy(unsigned short mode, unsigned short flags,
+static long __attribute__((aligned(4096))) do_set_mempolicy(unsigned short mode, unsigned short flags,
nodemask_t *nodes)
{
struct mempolicy *new, *old;
with it, the performance delta is reduced from 7.5% to 0.1%
And for case 2, we tried similar way (add 128B align to several files),
and saw the performance change is reduced from 3.1% to 1.3%
So generally, this seems to be helpful for making the kernel performance
stabler and more controllable. And the cost is not high either, say if
we pick 32 files to make one of their function 128B aligned, the space
waste is 8KB for worst case (128KB for 4096 bytes alignment)
Thoughts?
Thanks,
Feng