Re: Performance regression in write() syscall

From: Ingo Molnar
Date: Tue Feb 24 2009 - 11:52:19 EST



* Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

> > No, but I think it should be in arch code, and the
> > "_nocache" suffix should just be a hint to the architecture
> > that the destination is not so likely to be used.
>
> Yes. Especially since arch code is likely to need various
> arch-specific checks anyway (like the x86 code does about
> aligning the destination).

I'm inclined to do this in two or three phases: first apply the
fix from Salman in the form below.

In practice if we get a 4K copy request the likelyhood is large
that this is for a larger write and for a full pagecache page.
The target is very unlikely to be misaligned, the source might
be as it comes from user-space.

This portion of __copy_user_nocache() becomes largely
unnecessary:

ENTRY(__copy_user_nocache)
CFI_STARTPROC
cmpl $8,%edx
jb 20f /* less then 8 bytes, go to byte copy loop */
ALIGN_DESTINATION
movl %edx,%ecx
andl $63,%edx
shrl $6,%ecx
jz 17f

And the tail portion becomes unnecessary too. Those are over a
dozen instructions so probably worth optimizing out.

But i'd rather express this in terms of a separate
__copy_user_page_nocache function and keep the generic
implementation too.

I.e. like the second patch below. (not tested)

With this __copy_user_nocache() becomes unused - and once we are
happy with the performance characteristics of 4K non-temporal
copies, we can remove this more generic implementation.

Does this sound reasonable, or do you think we can be smarter
than this?

Ingo

-------------------->