Re: [PATCH V3 2/2] powerpc/pseries: init fault_around_order for pseries

From: Ingo Molnar
Date: Tue Apr 29 2014 - 03:06:45 EST



* Madhavan Srinivasan <maddy@xxxxxxxxxxxxxxxxxx> wrote:

> Performance data for different FAULT_AROUND_ORDER values from 4 socket
> Power7 system (128 Threads and 128GB memory). perf stat with repeat of 5
> is used to get the stddev values. Test ran in v3.14 kernel (Baseline) and
> v3.15-rc1 for different fault around order values.
>
> FAULT_AROUND_ORDER Baseline 1 3 4 5 8
>
> Linux build (make -j64)
> minor-faults 47,437,359 35,279,286 25,425,347 23,461,275 22,002,189 21,435,836
> times in seconds 347.302528420 344.061588460 340.974022391 348.193508116 348.673900158 350.986543618
> stddev for time ( +- 1.50% ) ( +- 0.73% ) ( +- 1.13% ) ( +- 1.01% ) ( +- 1.89% ) ( +- 1.55% )
> %chg time to baseline -0.9% -1.8% 0.2% 0.39% 1.06%

Probably too noisy.

> Linux rebuild (make -j64)
> minor-faults 941,552 718,319 486,625 440,124 410,510 397,416
> times in seconds 30.569834718 31.219637539 31.319370649 31.434285472 31.972367174 31.443043580
> stddev for time ( +- 1.07% ) ( +- 0.13% ) ( +- 0.43% ) ( +- 0.18% ) ( +- 0.95% ) ( +- 0.58% )
> %chg time to baseline 2.1% 2.4% 2.8% 4.58% 2.85%

Here it looks like a speedup. Optimal value: 5+.

> Binutils build (make all -j64 )
> minor-faults 474,821 371,380 269,463 247,715 235,255 228,337
> times in seconds 53.882492432 53.584289348 53.882773216 53.755816431 53.607824348 53.423759642
> stddev for time ( +- 0.08% ) ( +- 0.56% ) ( +- 0.17% ) ( +- 0.11% ) ( +- 0.60% ) ( +- 0.69% )
> %chg time to baseline -0.55% 0.0% -0.23% -0.51% -0.85%

Probably too noisy, but looks like a potential slowdown?

> Two synthetic tests: access every word in file in sequential/random order.
>
> Sequential access 16GiB file
> FAULT_AROUND_ORDER Baseline 1 3 4 5 8
> 1 thread
> minor-faults 263,148 131,166 32,908 16,514 8,260 1,093
> times in seconds 53.091138345 53.113191672 53.188776177 53.233017218 53.206841347 53.429979442
> stddev for time ( +- 0.06% ) ( +- 0.07% ) ( +- 0.08% ) ( +- 0.09% ) ( +- 0.03% ) ( +- 0.03% )
> %chg time to baseline 0.04% 0.18% 0.26% 0.21% 0.63%

Speedup, optimal value: 8+.

> 8 threads
> minor-faults 2,097,267 1,048,753 262,237 131,397 65,621 8,274
> times in seconds 55.173790028 54.591880790 54.824623287 54.802162211 54.969680503 54.790387715
> stddev for time ( +- 0.78% ) ( +- 0.09% ) ( +- 0.08% ) ( +- 0.07% ) ( +- 0.28% ) ( +- 0.05% )
> %chg time to baseline -1.05% -0.63% -0.67% -0.36% -0.69%

Looks like a regression?

> 32 threads
> minor-faults 8,388,751 4,195,621 1,049,664 525,461 262,535 32,924
> times in seconds 60.431573046 60.669110744 60.485336388 60.697789706 60.077959564 60.588855032
> stddev for time ( +- 0.44% ) ( +- 0.27% ) ( +- 0.46% ) ( +- 0.67% ) ( +- 0.31% ) ( +- 0.49% )
> %chg time to baseline 0.39% 0.08% 0.44% -0.58% 0.25%

Probably too noisy.

> 64 threads
> minor-faults 16,777,409 8,607,527 2,289,766 1,202,264 598,405 67,587
> times in seconds 96.932617720 100.675418760 102.109880836 103.881733383 102.580199555 105.751194041
> stddev for time ( +- 1.39% ) ( +- 1.06% ) ( +- 0.99% ) ( +- 0.76% ) ( +- 1.65% ) ( +- 1.60% )
> %chg time to baseline 3.86% 5.34% 7.16% 5.82% 9.09%

Speedup, optimal value: 4+

> 128 threads
> minor-faults 33,554,705 17,375,375 4,682,462 2,337,245 1,179,007 134,819
> times in seconds 128.766704495 115.659225437 120.353046307 115.291871270 115.450886036 113.991902150
> stddev for time ( +- 2.93% ) ( +- 0.30% ) ( +- 2.93% ) ( +- 1.24% ) ( +- 1.03% ) ( +- 0.70% )
> %chg time to baseline -10.17% -6.53% -10.46% -10.34% -11.47%

Rather significant regression at order 1 already.

> Random access 1GiB file
> FAULT_AROUND_ORDER Baseline 1 3 4 5 8
> 1 thread
> minor-faults 17,155 8,678 2,126 1,097 581 134
> times in seconds 51.904430523 51.658017987 51.919270792 51.560531738 52.354431597 51.976469502
> stddev for time ( +- 3.19% ) ( +- 1.35% ) ( +- 1.56% ) ( +- 0.91% ) ( +- 1.70% ) ( +- 2.02% )
> %chg time to baseline -0.47% 0.02% -0.66% 0.86% 0.13%

Probably too noisy.

> 8 threads
> minor-faults 131,844 70,705 17,457 8,505 4,251 598
> times in seconds 58.162813956 54.991706305 54.952675791 55.323057492 54.755587379 53.376722828
> stddev for time ( +- 1.44% ) ( +- 0.69% ) ( +- 1.23% ) ( +- 2.78% ) ( +- 1.90% ) ( +- 2.91% )
> %chg time to baseline -5.45% -5.52% -4.88% -5.86% -8.22%

Regression.

> 32 threads
> minor-faults 524,437 270,760 67,069 33,414 16,641 2,204
> times in seconds 69.981777072 76.539570015 79.753578505 76.245943618 77.254258344 79.072596831
> stddev for time ( +- 2.81% ) ( +- 1.95% ) ( +- 2.66% ) ( +- 0.99% ) ( +- 2.35% ) ( +- 3.22% )
> %chg time to baseline 9.37% 13.96% 8.95% 10.39% 12.98%

Speedup, optimal value hard to tell due to noise - 3+ or 8+.

> 64 threads
> minor-faults 1,049,117 527,451 134,016 66,638 33,391 4,559
> times in seconds 108.024517536 117.575067996 115.322659914 111.943998437 115.049450815 119.218450840
> stddev for time ( +- 2.40% ) ( +- 1.77% ) ( +- 1.19% ) ( +- 3.29% ) ( +- 2.32% ) ( +- 1.42% )
> %chg time to baseline 8.84% 6.75% 3.62% 6.5% 10.3%

Speedup, optimal value again hard to tell due to noise.

> 128 threads
> minor-faults 2,097,440 1,054,360 267,042 133,328 66,532 8,652
> times in seconds 155.055861167 153.059625968 152.449492156 151.024005282 150.844647770 155.954366718
> stddev for time ( +- 1.32% ) ( +- 1.14% ) ( +- 1.32% ) ( +- 0.81% ) ( +- 0.75% ) ( +- 0.72% )
> %chg time to baseline -1.28% -1.68% -2.59% -2.71% 0.57%

Slowdown for most orders.

> Incase of Kernel compilation, fault around order (fao) of 1 and 3
> provides fast compilation time when compared to a value of 4. On
> closer look, fao of 3 has higher agains. Incase of Sequential access
> synthetic tests fao of 1 has higher gains and in Random access test,
> fao of 3 has marginal gains. Going by compilation time, fao value of
> 3 is suggested in this patch for pseries platform.

So I'm really at loss to understand where you get the optimal value of
'3' from. The data does not seem to match your claim that '1 and 3
provides fast compilation time when compared to a value of 4':

> FAULT_AROUND_ORDER Baseline 1 3 4 5 8
>
> Linux rebuild (make -j64)
> minor-faults 941,552 718,319 486,625 440,124 410,510 397,416
> times in seconds 30.569834718 31.219637539 31.319370649 31.434285472 31.972367174 31.443043580
> stddev for time ( +- 1.07% ) ( +- 0.13% ) ( +- 0.43% ) ( +- 0.18% ) ( +- 0.95% ) ( +- 0.58% )
> %chg time to baseline 2.1% 2.4% 2.8% 4.58% 2.85%

5 and 8, and probably 6, 7 are better than 4.

3 is probably _slower_ than the current default - but it's hard to
tell due to inherent noise.

But the other two build tests were too noisy, and if then they showed
a slowdown.

> Worst case scenario: we touch one page every 16M to demonstrate overhead.
>
> Touch only one page in page table in 16GiB file
> FAULT_AROUND_ORDER Baseline 1 3 4 5 8
> 1 thread
> minor-faults 1,104 1,090 1,071 1,068 1,065 1,063
> times in seconds 0.006583298 0.008531502 0.019733795 0.036033763 0.062300553 0.406857086
> stddev for time ( +- 2.79% ) ( +- 2.42% ) ( +- 3.47% ) ( +- 2.81% ) ( +- 2.01% ) ( +- 1.33% )
> 8 threads
> minor-faults 8,279 8,264 8,245 8,243 8,239 8,240
> times in seconds 0.044572398 0.057211811 0.107606306 0.205626815 0.381679120 2.647979955
> stddev for time ( +- 1.95% ) ( +- 2.98% ) ( +- 1.74% ) ( +- 2.80% ) ( +- 2.01% ) ( +- 1.86% )
> 32 threads
> minor-faults 32,879 32,864 32,849 32,845 32,839 32,843
> times in seconds 0.197659343 0.218486087 0.445116407 0.694235883 1.296894038 9.127517045
> stddev for time ( +- 3.05% ) ( +- 3.05% ) ( +- 4.33% ) ( +- 3.08% ) ( +- 3.75% ) ( +- 0.56% )
> 64 threads
> minor-faults 65,680 65,664 65,646 65,645 65,640 65,647
> times in seconds 0.455537304 0.489688780 0.866490093 1.427393118 2.379628982 17.059295051
> stddev for time ( +- 4.01% ) ( +- 4.13% ) ( +- 2.92% ) ( +- 1.68% ) ( +- 1.79% ) ( +- 0.48% )
> 128 threads
> minor-faults 131,279 131,265 131,250 131,245 131,241 131,254
> times in seconds 1.026880651 1.095327536 1.721728274 2.808233068 4.662729948 31.732848290
> stddev for time ( +- 6.85% ) ( +- 4.09% ) ( +- 1.71% ) ( +- 3.45% ) ( +- 2.40% ) ( +- 0.68% )

There's no '%change' values shown, but the slowdown looks significant,
it's the worst case: for example with 1 thread order 3 looks about
300% slower (3x slowdown) compared to order 0.

All in one, looking at your latest data I don't think the conclusion
from your first version of this optimization patch from a month ago is
true anymore:

> + /* Measured on a 4 socket Power7 system (128 Threads and 128GB memory) */
> + fault_around_order = 3;

As the data is rather conflicting and inconclusive, and if it shows a
sweet spot it's not at order 3. New data should in general trigger
reanalysis of your first optimization value.

I'm starting to suspect that maybe workloads ought to be given a
choice in this matter, via madvise() or such.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/