RE: [arjanv@redhat.com: Re: [PATCH] shrink per_cpu_pages to fit 32byte cacheline]

From: Nakajima, Jun
Date: Thu Sep 23 2004 - 21:19:43 EST


>From: Marcelo Tosatti [mailto:marcelo.tosatti@xxxxxxxxxxxx]
>Sent: Thursday, September 23, 2004 7:12 AM
>To: linux-kernel@xxxxxxxxxxxxxxx
>Cc: Nakajima, Jun; akpm@xxxxxxxx; arjanv@xxxxxxxxxx; ak@xxxxxxx
>Subject: [arjanv@xxxxxxxxxx: Re: [PATCH] shrink per_cpu_pages to fit
32byte
>cacheline]
>
>
>Forgot to CC linux-kernel, just in case someone else
>can have useful information on this matter.
>
>Andi says any additional overhead will be in the noise
>compared to cacheline saving benefit.
>
>***********
>
>Jun,
>
>We need some assistance here - you can probably help us.
>
>Within the Linux kernel we can benefit from changing some fields
>of commonly accessed data structures to 16 bit instead of 32 bits,
>given that the values for these fields never reach 2 ^ 16.
>
>Arjan warned me, however, that the prefix (in this case "data16") will
>cause an additional extra cycle in instruction decoding, per message
above.

On the Pentium4 core, this is not a big deal because it runs out of the
trace cache (i.e. decoded in advance). However, on the Pentium III/M
(aka P6) core (i.e. Penitum III, Banias, Dothan, Yonah, etc.),
especially when an operand size prefix (0x66) changes the # of bytes in
an instruction (usually by impacting the size of an immediate in the
instruction), the P6 core pays unnegligible penalty, slowing down
decoding.

Jun

>
>Can you confirm that please? We can't seem to be able to find
>it in Intel's documentation.
>
>By shrinking two fields of "per_cpu_pages" structure we can fit it
>in one 32-byte cacheline (<= Pentium III and probably several other
>embedded/whatnot architectures will benefit from such a change).
>
>And we just shrank two fields of "struct pagevec" in a similar way
>in Andrew's -mm tree.
>
>I'm adding linux-kernel just in case someone else can have
>useful comments.
>
>Thanks.
>
>----- Forwarded message from Arjan van de Ven <arjanv@xxxxxxxxxx> -----
>
>From: Arjan van de Ven <arjanv@xxxxxxxxxx>
>Date: Tue, 14 Sep 2004 13:13:29 +0200
>To: Marcelo Tosatti <marcelo.tosatti@xxxxxxxxxxxx>
>Cc: akpm@xxxxxxxx, "Martin J. Bligh" <mbligh@xxxxxxxxxxx>,
> linux-mm@xxxxxxxxx
>In-Reply-To: <20040914093407.GA23935@xxxxxxxxxx>
>Subject: Re: [PATCH] shrink per_cpu_pages to fit 32byte cacheline
>Original-Recipient: rfc822;linux-mm@xxxxxxxxx
>X-Loop: owner-majordomo@xxxxxxxxx
>X-MIMETrack: Itemize by SMTP Server on USMail/Cyclades(Release
>6.5.1|January 21, 2004) at
> 09/14/2004 03:13:02
>
>On Tue, Sep 14, 2004 at 06:34:07AM -0300, Marcelo Tosatti wrote:
>> How come short access can cost 1 extra cycle? Because you need two
"read
>bytes" ?
>
>on an x86, a word (2byte) access will cause a prefix byte to the
>instruction, that particular prefix byte will take an extra cycle
during
>execution
>of the instruction and potentially reduces the parallal decodability of
>instructions....
>
>
>
>----- End forwarded message -----
>
>----- End forwarded message -----
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/