Re: [patch] epoll use a single inode ...

From: Kyle Moffett
Date: Thu Mar 08 2007 - 20:37:08 EST


On Mar 08, 2007, at 02:24:04, Eric Dumazet wrote:
Kyle Moffett a écrit :
Prefetching is also fairly critical on a Power4 or G5 PowerPC system as they have a long memory latency; an L2-cache miss can cost 200+ cycles. On such systems the "dcbt" prefetch instruction brings in a single 128-byte cacheline and has no serializing effects whatsoever, making it ideal for use in a linked-list- traversal inner loop.

OK, 200 cycles...

But what is the cost of the conditional branch you added in prefetch (x) ?

if (!x) return;

(correctly predicted or not, but do powerPC have a BTB ?)

Well, PowerPC "dcbt" does prefetch() correctly, it doesn't ever raise exceptions, doesn't have any side effects, takes only enough CPU to decode the address, and is ignored if it would have to do anything other than load the cacheline from RAM. Prefetch streams are halted when they reach the end of a page boundary (no trapping to the MMU) and if the TLB entry isn't present then they would asynchronously abort. This is a suitable prefetch implementation for G5:

#define prefetch(x) __asm__ __volatile__ ("dcbt 0, %0"::"r"(x));

On GCC4+ the "__builtin_prefetch" built-in functions is probably mildly better tuned for most architectures:

#define PREFETCH_RD 0
#define PREFETCH_WR 1
#define PREFETCH_LOCALITY_NONE 0
#define PREFETCH_LOCALITY_LOW 1
#define PREFETCH_LOCALITY_MED 2
#define PREFETCH_LOCALITY_HIGH 3
#define prefetch(x, args...) __builtin_prefetch(x ,##args)

The pertinent GCC docs:
This function is used to minimize cache-miss latency by moving data into a cache before it is accessed. You can insert calls to __builtin_prefetch into code for which you know addresses of data in memory that is likely to be accessed soon. If the target supports them, data prefetch instructions will be generated. If the prefetch is done early enough before the access then the data will be in the cache by the time it is accessed.

The value of addr is the address of the memory to prefetch. There are two optional arguments, rw and locality. The value of rw is a compile-time constant one or zero; one means that the prefetch is preparing for a write to the memory address and zero, the default, means that the prefetch is preparing for a read. The value locality must be a compile-time constant integer between zero and three. A value of zero means that the data has no temporal locality, so it need not be left in the cache after the access. A value of three means that the data has a high degree of temporal locality and should be left in all levels of cache possible. Values of one and two mean, respectively, a low or moderate degree of temporal locality. The default is three.
for (i = 0; i < n; i++) {
a[i] = a[i] + b[i];
__builtin_prefetch (&a[i+j], 1, 1);
__builtin_prefetch (&b[i+j], 0, 1);
/* ... */
}

Data prefetch does not generate faults if addr is invalid, but the address expression itself must be valid. For example, a prefetch of p->next will not fault if p->next is not a valid address, but evaluation will fault if p is not a valid address.

If the target does not support data prefetch, the address expression is evaluated if it includes side effects but no other code is generated and GCC does not issue a warning.

Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/