Kernel optimizations... (PC-arcitechture/long)

Jukka Tapani Santala (e75644@UWasa.Fi)
Tue, 16 Sep 1997 22:45:13 +0300 (EET DST)


This message is in MIME format. The first part should be readable text,
while the remaining parts are likely unreadable without MIME-aware tools.
Send mail to mime@docserver.cac.washington.edu for more info.

---559023410-959030623-874439113=:7536
Content-Type: TEXT/PLAIN; charset=US-ASCII

Readprofile

Having caught a bad cold, I had a while to play around with things, so
the dice went up on Linux kernel, and I decided to take a look on what I
could optimize. First things first, though, ofcourse, so I took a look at
the readprofile-program for kernel profiling, and found out I could make
few changes there as well.

This is, by all means, a quick and dirty hack, but in concert with gdb
vmlinux and the -g compiled assembler-source of the modules it proved an
invaluable tool for finding what _really_ slowed things up in Linux... As
a fair warning, this is something that worked for me; most people won't
probably be satisfied with this.

In the process I found a slight bug in readprofile, though - in summing
the ticks it forgets the profile entry zero is used for granularity. This
makes all the addresses off by two bytes; however, this doesn't seem to
have any effect on result because all functions are null-padded or
otherwise exit without going to end. I've attached the patch to reaprofile
2.0 to the end of this mail.

uni2ascii

Next to the kernel-optimizations themselves. I'm not going to provide
patches for these, because being optimizations, I don't expect a lot of
people to agree with the way I've done them. Given quick descriptions,
though, any kernel-hacker worth their salts can repeat them. It would be
nicer if we got GCC auto-optimize these, though, but I can see trouble in
that happening.

First thing I noted was that during my conversion of several HPFS drives
to VFAT, readprofile showed that most of the kernels time was being spent
in uni2ascii. A quick look to the source gave a reason for this. Now, the
obivious optimization would be to form a complete unicode-translation
table to use, however in kernel we need to worry about optimizing space as
well, and we just don't have 64k of memory to spare for this, so...

Next thing in turn was getting the assembler source for the function in
fs/fat/dir.c. Using GCC 2.7.2.1 and equivalent libs on i486 and kernel
2.1.55, it turns out the compiler is a bit too faithful to the loop
exit-condition used : "while (*ip || ip[1]) {" translates into two
separate compares. Until such time as GCC realizes to combine these,
replacing the check with "while (*(short *)ip) {" leads to considerable
speed-up.

However, even this done the function is still unforgivably slow. Enters
something that may work in many other places of unicode-translation in
Linux kernel as well. Given that actually only two pages are defined in
the current code, further optimizations suggest itself, but with the
support for multi-page there, I went for a general solution.

In essence, knowing that most of the time there will be several characters
of the same unicode-page in line we'll predict that. Dividing the function
into two nested loops with the inner one breaking out if the page changes
letting the outer one fetch a new page gives _major_ speed-improvement in
most contexts. Unfortunately page zero seems to be used, so we need to
maintain the check for *(short *)ip in the inner loop. A shame.

The next optimization may be the most controversial, however, I believe
it was worth it, at least in my case. The code in the function behaves
very differently depending on whether uni_xlate in the driver is set or
not, and this check is done inside the loops - in this case the inner
loop - for _every_ letter in the filename. Tsk-tsk. GCC should really try
to optimize things like this out, but in this case I simply divided the
function into two different loop-constructs that are chosen between at
the start.

fat_readdirx

Okay, what next..? Well, surprising enough, fat_readdirx still pushes up
on the front... A quick look thru the new profiler-output, contrasted to
the gdb debug of vmlinux for finding the respective code, and finally
matched out from the assembler-output with debug symbols from dir.c
reveals an slightly surprising result : At least on my arcitecture the
conversions from FAT-filesystem charactersets & case to unix take major
cycles.

So, what to do..? Here the initial instinct might be right, and building
separate conversion-tables does indeed save some CPU here. However, we're
still losing about half-k of memory for the tables, which might be better
used elsewhere (For some strange reason the FAT filesystem uses different
mappings for the name and the extension).

Using the table lookups may also not be the fastest solution - and even if
it is, sharing the tables with some other similiar functions in the kernel
would provide more advantage for the memory. This is something I'll leave
for others to solve and find out.

Console

Okay, we've now made some major process in optimizing out the FAT-based
filesystems. For those who use them, the above optimizations may prove out
to be worth their weight in gold. There's no easy way to make a
subjective benchmark for them, but let's just say that in some
file-expensive activities the difference may be noticeable by eye.
Speaking of which, next we'll optimize something fairly visible ;)

A word of caution here : I have very little idea of the issues related
herein, however the changes described here worked on my system setup -
given that the video-adapter is rather old, there's good changes this
applies to most people with VGA/EGA video-card on intel-based machine.
Your mileage, however, may vary - those with more information, feel free
to butt in.

After doing some extensive work on the computer in shell-mode, I found
out that scrdown-function pressed itself on the front in readprofile -
output. This was a slight surprise to me - at least until I took a look
at the generated code. GCC does really terrible work here, but I guess
the kernel source is to blame in part for this.

First let's take a look at /usr/include/linux/selection.h which provides
many of the small parts used for building up the console access. Of
particular interest to us now are the scr_readw and scr_writew functions,
which are at the helm of the console CPU-usage. "normal VGA console
access" is what we're looking for, and the next comment pops up :

/*
* NOTE: "(long) addr < 0" tests for an Alpha kernel virtual address; this
* indicates a VC's backing store; otherwise, it's a bus memory address, for
* the VGA's screen memory, so we do the Alpha "swizzle"... :-)
*/

One wonders... if this is intended for Alpha, why aren't those bits of
code bounded out in #ifdef __alpha__ checks? Well, they aren't, and from
running for quite a while with them commented out it doesn't seem as if
leaving them out on non-alpha architectures causes any trouble. Slight
improvement, but we're not there yet...

Out next point of concern is the memcpyw() - function used extensively
for screen-scrolling. In fact, while we're at it, why not deal with
memsetw() at the same time - although there's a gotcha there, to which
I'll get promptly. But first memcpyw()... this particular function
translates into awful assembler-code; the target-address will be
maintained in the stack, and the addresses keep being "fixed" during
every loop.

So, what to do..? Personally, I could see only one thing to do to save
the day : using memcpy() function instead. The fact that this isn't done
in the code suggests there may be a reason for it, however doing so
hasn't manifested up as any negative effects on my machine at least as of
yet. I suppose some systems might have trouble with the generated 32-bit
transfers, but in that case we should have a way to instruct GCC to use
repz movsw instead.

On the negative side, GCC generates one unneccessary piece of code: at
the end of each memcpy(), the size is checked for parity, and missing
bytes are padded at the end. Because the size-operand for the memcpy()
command is already even, this is unneccessary. I guess it's not an easy
thing to optimize out for the compiler, and since it's out of the loop
it's little concern to me. Of course direct inline-assembler is one
choice, as we're dealing with non-portable code-section anyway.

Then over to the memsetw() calls. One would think that using memset()
here in the spirit of the last optimization would do, however since we
need to set both the character _and_ attribute we get into slight trouble
here - namely, a green background which I've come quite fond of
personally ;) If there's a command for setting the memory to specific
word instead of byte on GCC/lib, let me know; otherwise down to
inline-assy we go (using repz movsb is a bad choice for the CPU as well).

After applying these same changes to places in devices/char/console.c
where the memsetw and memcpyw functions are replaced by inline-code doing
the same thing - and remember to use __io_virt() on the video-pointers! -
you can, literally, see the difference. (In my case it's green - but
that's just because I'm too lazy to write the inline-assembler code for
word-fill;) In particular, after this an ls -lR on (V)FAT partition will
just fly. And sometimes you'd think that's what most people use their
computers for ;)

Comments?

Well, anything you'd have done differently..? Any major problems,
ideas..? Patches can be had by request, but as said I haven't included
them here, as I can't see a lot of people agreeing with the ways I've
chosen to go (I can already hear the shouts of "Kernel bloat", "Redundant
code!" and "Inline assembler! <grin>) but here's the mainlines at any
case. Now I just wish I could somehow optimize getphase() on the Adaptec
aha1520 SCSI-card... ;)

-Donwulff

Our cause is a secret within a secret,
a secret that only another secret can explain;
it is a secret about a secret veiled by a secret.
---559023410-959030623-874439113=:7536
Content-Type: TEXT/PLAIN; charset=US-ASCII; name="readprofile.patch"
Content-Transfer-Encoding: BASE64
Content-ID: <Pine.SOL.3.90.970916224513.7536G@ankkuri.uwasa.fi>
Content-Description:

LS0tIHJlYWRwcm9maWxlLmMJV2VkIE1heSAxNSAyMzoyODoxNiAxOTk2DQor
KysgcmVhZHByb2YuYwlUdWUgU2VwIDE2IDIxOjM0OjA1IDE5OTcNCkBAIC0z
NSw3ICszNSw3IEBADQogLyogVGhlc2UgYXJlIHRoZSBkZWZhdWx0cyAqLw0K
IHN0YXRpYyBjaGFyIGRlZmF1bHRtYXBbXT0iL3Vzci9zcmMvbGludXgvU3lz
dGVtLm1hcCI7DQogc3RhdGljIGNoYXIgZGVmYXVsdHByb1tdPSIvcHJvYy9w
cm9maWxlIjsNCi1zdGF0aWMgY2hhciBvcHRzdHJpbmdbXT0ibTpwOml0dmFy
ViI7DQorc3RhdGljIGNoYXIgb3B0c3RyaW5nW109Im06cDppdHZsYXJWIjsN
CiANCiB2b2lkIHVzYWdlKCkNCiB7DQpAQCAtNDUsNiArNDUsNyBAQA0KIAkJ
ICAiXHQgLXAgPHByby1maWxlPiAoZGVmYXVsdCA9IFwiJXNcIilcbiINCiAJ
CSAgIlx0IC1pICAgICAgICAgICAgcHJpbnQgb25seSBpbmZvIGFib3V0IHRo
ZSBzYW1wbGluZyBzdGVwXG4iDQogCQkgICJcdCAtdiAgICAgICAgICAgIHBy
aW50IHZlcmJvc2UgZGF0YVxuIg0KKwkJICAiXHQgLWwgICAgICAgICAgICBw
cmludCBpbiBsb25nIGZvcm1hdCAobG9hZCBwZXIgcG9pbnQpXG4iDQogCQkg
ICJcdCAtYSAgICAgICAgICAgIHByaW50IGFsbCBzeW1ib2xzLCBldmVuIGlm
IGNvdW50IGlzIDBcbiINCiAJCSAgIlx0IC1yICAgICAgICAgICAgcmVzZXQg
YWxsIHRoZSBjb3VudGVycyAocm9vdCBvbmx5KVxuIg0KIAkJICAiXHQgLVYg
ICAgICAgICAgICBwcmludCB2ZXJzaW9uIGFuZCBleGl0XG4iDQpAQCAtNzcs
OCArNzgsOCBAQA0KIHVuc2lnbmVkIGludCBmbl9hZGQsIG5leHRfYWRkOyAg
ICAgICAgICAgLyogY3VycmVudCBhbmQgbmV4dCBhZGRyZXNzICovDQogY2hh
ciBmbl9uYW1lW1NfTEVOXSwgbmV4dF9uYW1lW1NfTEVOXTsgICAvKiBjdXJy
ZW50IGFuZCBuZXh0IG5hbWUgKi8NCiBjaGFyIG1vZGVbOF07DQotaW50IGM7
DQotaW50IG9wdEFsbD0wLCBvcHRJbmZvPTAsIG9wdFJlc2V0PTAsIG9wdFZl
cmJvc2U9MDsNCitpbnQgYywgbm90eWV0PTAsIGxhcmdlc3Q9MDsNCitpbnQg
b3B0QWxsPTAsIG9wdEluZm89MCwgb3B0UmVzZXQ9MCwgb3B0VmVyYm9zZT0w
LCBvcHRMb25nPTA7DQogY2hhciBtYXBsaW5lW1NfTEVOXTsNCiBpbnQgbWFw
bGluZW5vPTE7DQogaW50IHBvcGVuTWFwOyAgIC8qIGZsYWcgdG8gdGVsbCBp
ZiBwb3BlbigpIGhhcyBiZWVuIHVzZWQgKi8NCkBAIC05Nyw2ICs5OCw3IEBA
DQogICAgICAgY2FzZSAncCc6IHByb0ZpbGU9b3B0YXJnOyBicmVhazsNCiAg
ICAgICBjYXNlICdhJzogb3B0QWxsKys7ICAgICAgIGJyZWFrOw0KICAgICAg
IGNhc2UgJ2knOiBvcHRJbmZvKys7ICAgICAgYnJlYWs7DQorICAgICAgY2Fz
ZSAnbCc6IG9wdExvbmcrKzsJYnJlYWs7DQogICAgICAgY2FzZSAncic6IG9w
dFJlc2V0Kys7ICAgICBicmVhazsNCiAgICAgICBjYXNlICd2Jzogb3B0VmVy
Ym9zZSsrOyAgIGJyZWFrOw0KICAgICAgIGNhc2UgJ1YnOiBwcmludGYoIiVz
IFZlcnNpb24gJXNcbiIscHJnbmFtZSxSRUxFQVNFKTsgZXhpdCgwKTsNCkBA
IC0xNDQsNiArMTQ2LDkgQEANCiAgICAgZXhpdCgwKTsNCiAgICAgfSANCiAN
CisgIGZvcihpbmRleD0xOyBpbmRleDwobGVuL3NpemVvZih1bnNpZ25lZCBp
bnQpKTsgaW5kZXgrKykgaWYoYnVmW2luZGV4XT5sYXJnZXN0KSBsYXJnZXN0
PWJ1ZltpbmRleF07DQorICBpbmRleD0xOw0KKw0KICAgdG90YWw9MDsNCiAN
CiAgIGlmICghKG1hcD1teW9wZW4obWFwRmlsZSwiciIsJnBvcGVuTWFwKSkp
DQpAQCAtMTg1LDIxICsxOTAsNDAgQEANCiAgICAgICB9DQogICAgIGlmICgq
bW9kZSE9J1QnICYmICptb2RlIT0ndCcpIGJyZWFrOyAvKiBvbmx5IHRleHQg
aXMgcHJvZmlsZWQgKi8NCiANCi0gICAgd2hpbGUgKGluZGV4IDwgKG5leHRf
YWRkLWFkZDApL3N0ZXApDQorICAgIHdoaWxlIChpbmRleCA8PSAobmV4dF9h
ZGQtYWRkMCkvc3RlcCkgew0KKyAgICAgIGlmIChvcHRMb25nKSB7DQorICAg
ICAgICBpZiAoYnVmW2luZGV4XSkgew0KKyAgICAgICAgICBpZihub3R5ZXQ9
PTEpDQorICAgICAgICAgICAgcHJpbnRmKCIlMDh4XG4iLCAoKGluZGV4LTIp
KnN0ZXApK2FkZDApOw0KKyAgICAgICAgICBpZihub3R5ZXQ+MSkgcHJpbnRm
KCI9PT09PT09PSAlMDh4XG4iLCBub3R5ZXQqc3RlcCk7DQorICAgICAgICAg
IHByaW50ZigiJTA4eCAlNGkgIiwgKChpbmRleC0xKSpzdGVwKSthZGQwLCBi
dWZbaW5kZXhdKTsNCisgICAgICAgICAgZm9yKG5vdHlldD0wOyAobm90eWV0
PDYwKSYmKG5vdHlldDwoYnVmW2luZGV4XS8obGFyZ2VzdC82MCkpKTsNCisg
ICAgICAgICAgICBwcmludGYoIioiKSwgbm90eWV0KyspOw0KKyAgICAgICAg
ICBwcmludGYoIlxuIik7DQorICAgICAgICAgIG5vdHlldD0wOw0KKyAgICAg
ICAgfSBlbHNlDQorICAgICAgICAgIG5vdHlldCsrOw0KKyAgICAgIH0NCiAg
ICAgICB0aGlzICs9IGJ1ZltpbmRleCsrXTsNCisgICAgfQ0KICAgICB0b3Rh
bCArPSB0aGlzOw0KIA0KICAgICBmbl9sZW4gPSBuZXh0X2FkZC1mbl9hZGQ7
DQotICAgIGlmIChmbl9sZW4gJiYgKHRoaXMgfHwgb3B0QWxsKSkNCisgICAg
aWYgKGZuX2xlbiAmJiAodGhpcyB8fCBvcHRBbGwgfHwgb3B0TG9uZykpDQog
ICAgICAgew0KKyAgICBpZihub3R5ZXQpIHByaW50ZigiPT09PT09PT0gJTA4
eFxuIiwgbm90eWV0KjIpOw0KICAgICAgIGlmIChvcHRWZXJib3NlKQ0KIAlw
cmludGYoIiUwOHggJS00MHMgJTZpICU4LjRmXG4iLA0KIAkgICAgICAgZm5f
YWRkLGZuX25hbWUsdGhpcyx0aGlzLyhkb3VibGUpZm5fbGVuKTsNCisgICAg
ICBlbHNlIGlmIChvcHRMb25nKQ0KKwlwcmludGYoIiUwOHgtJTA4eCAlLTQw
cyAlNmkgJTguNGZcbiIsDQorCSAgICAgICBmbl9hZGQsZm5fYWRkK2ZuX2xl
bi0xLGZuX25hbWUsdGhpcyx0aGlzLyhkb3VibGUpZm5fbGVuKTsNCiAgICAg
ICBlbHNlDQogCXByaW50ZigiJTZpICUtNDBzICU4LjRmXG4iLA0KIAkgICAg
ICAgdGhpcyxmbl9uYW1lLHRoaXMvKGRvdWJsZSlmbl9sZW4pOw0KICAgICAg
IH0NCiAgICAgZm5fYWRkPW5leHRfYWRkOyBzdHJjcHkoZm5fbmFtZSxuZXh0
X25hbWUpOw0KKyAgICBub3R5ZXQ9MDsNCiAgICAgfQ0KICAgLyogdHJhaWxl
ciAqLw0KICAgaWYgKG9wdFZlcmJvc2UpDQpAQCAtMjA4LDcgKzIzMiw5IEBA
DQogICBlbHNlDQogICAgIHByaW50ZigiJTZpICUtNDBzICU4LjRmXG4iLA0K
IAkgICB0b3RhbCwidG90YWwiLHRvdGFsLyhkb3VibGUpKGZuX2FkZC1hZGQw
KSk7DQotCQ0KKw0KKyAgcHJpbnRmKCIlaVxuIiwgbGFyZ2VzdCk7DQorCSAg
IAkNCiAgIHBvcGVuTWFwID8gcGNsb3NlKG1hcCkgOiBmY2xvc2UobWFwKTsN
CiAgIGV4aXQoMCk7DQogfQ0K
---559023410-959030623-874439113=:7536--