Re: [PATCH] mm: percpu: Add PCPU_FC_FIXED to pcpu_fc for settingfixed pcpu_atom_size.

From: Yanmin Zhang
Date: Fri Apr 27 2012 - 04:56:10 EST


On Fri, 2012-04-27 at 09:09 +0800, Yanmin Zhang wrote:
> On Thu, 2012-04-26 at 15:49 -0700, Tejun Heo wrote:
> > Hello,
> >
> > On Thu, Apr 26, 2012 at 10:01:12AM +0800, Yanmin Zhang wrote:
> > > [ 0.000000] SMP: Allowing 2 CPUs, 0 hotplug CPUs
> > > [ 0.000000] nr_irqs_gsi: 85
> > > [ 0.000000] Allocating PCI resources starting at 40000000 (gap: 40000000:bec00000)
> > > [ 0.000000] setup_percpu: NR_CPUS:2 nr_cpumask_bits:2 nr_cpu_ids:2 nr_node_ids:1
> > > [ 0.000000] PERCPU: Embedded 12 pages/cpu @f6400000 s25280 r0 d23872 u2097152
> > > [ 0.000000] pcpu-alloc: s25280 r0 d23872 u2097152 alloc=1*4194304
> > > [ 0.000000] pcpu-alloc: [0] 0 1
> >
> > Heh, I was getting confused, forget the distance thing, so it's single
> > group w/ 4MiB allocation size.
> >
> > > PERCPU: allocation failed, size=252 align=4, failed to allocate new chunk
> >
> > Which later fails percpu allocation due to vmalloc space exhaustion.
> > How long does that take to happen?
> It depends. Sometimes it fails in 400 seconds after booting. We run MTBF and other
> stress testing. Sometimes even with other non-stress testing, pecpu allocation
> fails. Most drivers or upper layers expect the percpu allocation should succeed. If
> not, although mostly there is no OOPS in kernel, upper applications wouldn't work.
>
> >
> > > vmallocinfo is attached. From the vmallocinfo, we could find the VM space
> > > is fragmented. We would write another patch to clean it up.
> >
> > Whee... ah well, 128M isn't that big after all.
> Indeed, so we need tune the memory utilization carefully on i386.
> We did work out other patches at other places/drivers to fix other OOM issues.
>
> >
> > > > > If using PERCPU_FC_PAGE, system can't go to deep sleep states.
> > > >
> > > > Why?
> > >
> > > Medfield has 2 cpu threads. Only when all the 2 threads enter deep C states,
> > > for example, C6, the core would enter C6. If booting kernel with percpu_alloc=page,
> > > cpu core often aborts the C6 entering. We don't know why. C6 is aborted under
> > > many conditions. One is when there is pending interrupt. I suspect with page size
> > > alloc, it might trigger more cache miss. Just before calls mwait to enter
> > > C6, we record some statistics data and that might trigger the cache miss
> > > to abort the C6. It's just a _GUESS_.
> > >
> > > We tried atom_size with 32k, 128k, 256k. There is no power regression.
> >
> > So, the difference between EMBED and PAGE is how the first chunk which
> > contains all the static percpu variables and some dynamic area for
> > optimization is allocated. For EMBED, it's just kmallocd which means
> > that it piggy backs on the default kernel linear mapping thus avoiding
> > adding any extra TLB pressure. For PAGE, all those percpu areas end
> > up getting re-mapped in vmalloc area using 4k pages, so if TLB
> > pressure can affect entering C6, that could be it.
> Thanks for the explanation.
>
> >
> > > We can't fix FC_PAGE power regression. If we do so, we need contact many
> > > hardware architects. Current kernel supports FC_PAGE and PMD_SIZE, why
> > > not to allow admin to choose other values?
> >
> > If this is something which is met in the field commonly, we need to
> > fix the default behavior rather than introducing some arcane boot
> > param.
> We just add a new value input method instead of introducing new parameter.
>
> > IIRC, the reasons PMD_SIZE is used for atom_size are so that
> > percpu areas are aligned to PSE mapping, maybe later we can make use
> > of PSE mapping in vmalloc area too, and it didn't seem to hurt
> > anything.
> Well, vmalloc area might use different prot to map physical pages.
> So sharing one PMD huge page by many vmalloc areas might be not good.
>
> >
> > If the large unit size is becoming a problem on i386, we can just use
> > PAGE_SIZE as atom_size. Can you please verify that atom_size of 4k w/
> > EMBED also resolves the power issue?
> We are enable Android ICS which is based on kernel 3.0.8. It seems there is
> no much change between 3.0.8 and the latest kernel.
> With 3.0.8, although we could set percpualloc=embed, atom_size would becomes
> PMD_SIZE.
> With our patch, we could do the experiment as we could configure percpu_alloc=4K
> easily. We would let you know the testing result of atom_size=4K && first_chunk_embedded.
Liu Shuo did initial testing about this scenario and there is no power
regression.

Yanmin


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/