Re: Linux-kernel examples for LKMM recipes

From: Paul E. McKenney
Date: Tue Oct 17 2017 - 17:55:35 EST


On Tue, Oct 17, 2017 at 05:03:01PM -0400, Alan Stern wrote:
> On Tue, 17 Oct 2017, Paul E. McKenney wrote:
>
> > On Tue, Oct 17, 2017 at 03:38:23PM -0400, Alan Stern wrote:
> > > On Tue, 17 Oct 2017, Paul E. McKenney wrote:
> > >
> > > > How about this?
> > > >
> > > > 0. Simple special cases
> > > >
> > > > If there is only one CPU on the one hand or only one variable
> > > > on the other, the code will execute in order. There are (as
> > > > usual) some things to be careful of:
> > > >
> > > > a. There are some aspects of the C language that are
> > > > unordered. For example, the compiler can output code
> > > > computing arguments of a multi-parameter function in
> > > > any order it likes, or even interleaved if it so chooses.
> > >
> > > That parses a little oddly. I wouldn't agree that the compiler outputs
> > > the code in any order it likes!
> >
> > When was the last time you talked to a compiler writer? ;-)
> >
> > > In fact, I wouldn't even mention the compiler at all. Just say that
> > > (with a few exceptions) the language doesn't specify the order in which
> > > the arguments of a function or operation should be evaluated. For
> > > example, in the expression "f(x) + g(y)", the order in which f and g
> > > are called is not defined; the object code is allowed to use either
> > > order or even to interleave the computations.
> >
> > Nevertheless, I took your suggestion:
> >
> > a. There are some aspects of the C language that are
> > unordered. For example, in the expression "f(x) + g(y)",
> > the order in which f and g are called is not defined;
> > the object code is allowed to use either order or even
> > to interleave the computations.
>
> Good.
>
> > > > b. Compilers are permitted to use the "as-if" rule.
> > > > That is, a compiler can emit whatever code it likes,
> > > > as long as the results appear just as if the compiler
> > > > had followed all the relevant rules. To see this,
> > > > compiler with a high level of optimization and run
> > > > the debugger on the resulting binary.
> > >
> > > You might omit the last sentence. Furthermore, if the accesses don't
> > > use READ_ONCE/WRITE_ONCE then the code might not get the same result as
> > > if it had executed in order (even for a single variable!), and if you
> > > do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code
> > > it likes.
> >
> > Ah, I omitted an important qualifier:
> >
> > b. Compilers are permitted to use the "as-if" rule. That is,
> > a compiler can emit whatever code it likes, as long as
> > the results of a single-threaded execution appear just
> > as if the compiler had followed all the relevant rules.
> > To see this, compile with a high level of optimization
> > and run the debugger on the resulting binary.
>
> That's okay for the single-CPU case. I don't think it covers the
> multiple-CPU single-variable case correctly, though. If you don't use
> READ_ONCE or WRITE_ONCE, isn't the compiler allowed to tear the loads
> and stores? And won't that potentially cause the end result to be
> different from what you would get if the code had appeared to execute
> in order?

Ah, good point, I need yet another qualifier. How about the following?

b. Compilers are permitted to use the "as-if" rule. That is,
a compiler can emit whatever code it likes for normal
accesses, as long as the results of a single-threaded
execution appear just as if the compiler had followed
all the relevant rules. To see this, compile with a
high level of optimization and run the debugger on the
resulting binary.

I added "for normal accesses", which excludes READ_ONCE(), WRITE_ONCE(),
and atomics. This, in conjunction with the previously added
"single-threaded execution" means that yes, the compiler is permitted
to tear normal loads and stores. The reason is that a single-threaded
run could not tell the difference. Interrupt handlers or multiple
threads are required to detect load/store tearing.

So, what am I still missing? ;-)

> > I have seen people (including kernel hackers) surprised by what optimizers
> > do, so I would prefer that the last sentence remain.
> >
> > > > c. If there is only one variable but multiple CPUs, all
> > > > accesses to that variable must be aligned and full sized.
> > >
> > > I would say that the variable is what needs to be aligned, not the
> > > accesses. (Although, if the variable is aligned and all the accesses
> > > are full sized, then they must necessarily be aligned as well.)
> >
> > I was thinking in terms of an unaligned 16-bit access to a 32-bit
> > variable.
>
> That wouldn't be full sized.
>
> > But how about this?
> >
> > c. If there is only one variable but multiple CPUs, all
>
> Extra "all". Otherwise okay.

Good catch, I removed the extra "all".

> > that variable must be properly aligned and all accesses
> > to that variable must be full sized.
> >
> > > > Variables that straddle cachelines or pages void your
> > > > full-ordering warranty, as do undersized accesses that
> > > > load from or store to only part of the variable.
> > >
> > > How can a variable straddle pages without also straddling cache lines?
> >
> > Well, a variable -can- straddle cachelines without straddling pages,
> > which justifies the "or". Furthermore, given that cacheline sizes have
> > been growing, but pages are still 4KB, it is probably only a matter
> > of time. ;-)
>
> By that time, we'll probably be using 64-KB pages. Or even bigger!

PowerPC's server builds have had a minimum page size of 64KB for quite
a few years. This helps in many cases, but of course hurts for those
occasional application that insist on doing a pile of independent 4K
mappings. ;-)

So I would guess that the move from 4K pages to 64K (or whatever)
pages could be quite painful for some CPU families.

Thanx, Paul