RE: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node

From: Elliott, Robert (Persistent Memory)
Date: Mon Nov 12 2018 - 17:15:55 EST




> -----Original Message-----
> From: Daniel Jordan <daniel.m.jordan@xxxxxxxxxx>
> Sent: Monday, November 12, 2018 11:54 AM
> To: Elliott, Robert (Persistent Memory) <elliott@xxxxxxx>
> Cc: Daniel Jordan <daniel.m.jordan@xxxxxxxxxx>; linux-mm@xxxxxxxxx;
> kvm@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; aarcange@xxxxxxxxxx;
> aaron.lu@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; alex.williamson@xxxxxxxxxx;
> bsd@xxxxxxxxxx; darrick.wong@xxxxxxxxxx; dave.hansen@xxxxxxxxxxxxxxx;
> jgg@xxxxxxxxxxxx; jwadams@xxxxxxxxxx; jiangshanlai@xxxxxxxxx;
> mhocko@xxxxxxxxxx; mike.kravetz@xxxxxxxxxx; Pavel.Tatashin@xxxxxxxxxxxxx;
> prasad.singamsetty@xxxxxxxxxx; rdunlap@xxxxxxxxxxxxx;
> steven.sistare@xxxxxxxxxx; tim.c.chen@xxxxxxxxx; tj@xxxxxxxxxx;
> vbabka@xxxxxxx
> Subject: Re: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> initialization within each node
>
> On Sat, Nov 10, 2018 at 03:48:14AM +0000, Elliott, Robert (Persistent
> Memory) wrote:
> > > -----Original Message-----
> > > From: linux-kernel-owner@xxxxxxxxxxxxxxx <linux-kernel-
> > > owner@xxxxxxxxxxxxxxx> On Behalf Of Daniel Jordan
> > > Sent: Monday, November 05, 2018 10:56 AM
> > > Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> > > initialization within each node
> > >
...
> > > In testing, a reasonable value turned out to be about a quarter of the
> > > CPUs on the node.
> > ...
> > > + /*
> > > + * We'd like to know the memory bandwidth of the chip to
> > > calculate the
> > > + * most efficient number of threads to start, but we can't.
> > > + * In testing, a good value for a variety of systems was a
> > > quarter of the CPUs on the node.
> > > + */
> > > + nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4);
> >
> >
> > You might want to base that calculation on and limit the threads to
> > physical cores, not hyperthreaded cores.
>
> Why? Hyperthreads can be beneficial when waiting on memory. That said, I
> don't have data that shows that in this case.

I think that's only if there are some register-based calculations to do while
waiting. If both threads are just doing memory accesses, they'll both stall, and
there doesn't seem to be any benefit in having two contexts generate the IOs
rather than one (at least on the systems I've used). I think it takes longer
to switch contexts than to just turnaround the next IO.


---
Robert Elliott, HPE Persistent Memory