Re: plumbers session on profiling?

From: Fangrui Song
Date: Fri Jul 01 2022 - 15:34:19 EST


On 2022-07-01, Bill Wendling wrote:
On Fri, Jul 1, 2022 at 4:49 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
On Fri, Jul 01, 2022 at 03:17:54AM -0700, Bill Wendling wrote:
> On Fri, Jul 1, 2022 at 2:02 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >
> > On Tue, Jun 28, 2022 at 07:08:48PM +0200, Jose E. Marchesi wrote:
> > >
> > > [Added linux-toolchains@vger in CC]
> > >
> > > It would be interesting to have some discussion in the Toolchains track
> > > on building the kernel with PGO/FDO. I have seen a raise on interest on
> > > the topic in several companies, but it would make very little sense if
> > > no kernel hacker is interested in participating... anybody?
> >
> > I know there's been a lot of work in this area, but none of it seems to
> > have trickled down to be easy enough for me to use it.
>
> We use an instrumented kernel to collect the data we need. It gives us
> the best payoff, because the profiling data is more fine-grained and
> accurate. (PGO does much more than make inlining decisions.)
>
> If I recall correctly, you previously suggested using sampling data.
> (Correct?) Is there a document or article that outlines that process?

IIRC Google has LBR sample driven PGO somewhere as well. ISTR that being
the whole motivation for that gruesome Zen3 BRS hack.

Google got me this: https://research.google.com/pubs/archive/45290.pdf


I very support that the mainline kernel adds instrumentation based PGO
but I vaguely recall that it was NAKed by Linus (because he thought
sample based is better).

Right. However, there's a chicken-and-egg issue with AutoFDO for the
production kernel. We can't release a kernel that hasn't been compiled
with PGO/FDO. We could only release it in a test environment, in which
case we could use AutoFDO. However, the document says that AutoFDO
only reaches ~90% of FDO. They list some reasons for this, but
nonetheless I suspect that the delta would be too severe for us to
release the kernel.

As for LBR, that will work with Intel/AMD, but I thought that LBR
doesn't exist for Arm processors (my knowledge could be out of date on
this).

Some folks try using Embedded Trace Macrocells.
I am not at all familiar with it but it seems that retrieving profiles
is not easy. The needed efforts seem even higher than using
instrumentation based PGO.

Instrumentation based PGO has the nice property that it works with all
architectures (the compiler supports) and does not rely on hardware
support. In addition, it collects indirect call targets and string
operation sizes which are very difficult/impossible for sample based
PGO.

What would make PGO (sample-based or instrumented) easy enough for you
to use? What're the key elements missing?

-bw