AMD dual core opetron optimization

From: kernel coder
Date: Mon Apr 30 2007 - 15:13:37 EST


hi,

I'm doing trying to write some optimized code for AMD dual core
opetron processor.But things are getting no where.I've installed
Fedora 5 with 2.6 series Linux kernel and 4 series GCC

Following are few lines of code which are consuming close to 100
cycles.Yes this is not the forum for such questions but i think people
on linux kernel and GCC are best to answer such type of questions.I'm
realy getting frustated and helpless ,that's why i've put question on
this forum.

/*******************************************/
/* these variables will be used for RDTSC instrucion */
uint64_t before, overhead, clocks;


/*ReadTsc funtion is given below */
before=ReadTsc();
before=ReadTsc();
before=ReadTsc();

overhead=ReadTsc()-before;

printf(" ReadTSC overerhead is %lu ",overhead);


unsigned int test;
unsigned long buffer [128];
buffer[12]=08;
buffer[13]=00;
buffer[23]=06;

/* starting cycles */

before=ReadTsc();

/**************************Start of Targeted code
******************************/

test= buffer[12] | buffer[13] | buffer[23] ;

switch( test )

{

case 12:
asm(" jmp proc_1");

case 13:
asm("jmp proc_2");
case 14:
asm("jmp proc_3");
case 15:
asm("jmp proc_4");
default :
asm("jmp proc_5");

}



asm(" proc_5:");

/**************************End of Targeted code ******************************/
/*current cycles */
clocks=ReadTsc() ;

clocks=clocks - before;
printf("\n cycles consumed %lu \n",clocks - overhead);

/**********************************/

The overhead varies from generally 360 to 395 cycles .Sometimes it
also reduces close to 270 cycles.

Cycles consumed by the targetd code varies from 20 to 100
cycles.Theoratically i thing cycles consumed should be less than
20.Then why so many cycles ? and the output vary from 20 to 100
cycles .Sometimes it crosses 100 cycles as well.

Sometimes the cycles consumed by targetted code become far less that
the RDTSC instrucion overhead.

Is there better way to write above code.I even used the prefetch
instruction before the targeted code to make sure that buffer is in
the L1 cache but no success.

The code for ReadTsc() is as follows.Please also tell me if its
correct way to measure cycles .


/*****************************************/

typedef long long __int64;

__int64 ReadTSC() {
int res[2]; // store 64 bit result here
#if defined(__GNUC__) && !defined(__INTEL_COMPILER)
// Inline assembly in AT&T syntax
#if defined (_LP64) // 64 bit mode
__asm__ __volatile__ ( // serialize (save rbx)
"xorl %%eax,%%eax \n push %%rbx \n cpuid \n"
::: "%rax", "%rcx", "%rdx");
__asm__ __volatile__ ( // read TSC, store edx:eax in res
"rdtsc\n"
: "=a" (res[0]), "=d" (res[1]) );
__asm__ __volatile__ ( // serialize again
"xorl %%eax,%%eax \n cpuid \n pop %%rbx \n"
::: "%rax", "%rcx", "%rdx");
#else // 32 bit mode
__asm__ __volatile__ ( // serialize (save ebx)
"xorl %%eax,%%eax \n pushl %%ebx \n cpuid \n"
::: "%eax", "%ecx", "%edx");
__asm__ __volatile__ ( // read TSC, store edx:eax in res
"rdtsc\n"
: "=a" (res[0]), "=d" (res[1]) );
__asm__ __volatile__ ( // serialize again
"xorl %%eax,%%eax \n cpuid \n popl %%ebx \n"
::: "%eax", "%ecx", "%edx");
#endif
#else
// Inline assembly in MASM syntax
__asm {
xor eax, eax
cpuid // serialize
rdtsc // read TSC
mov dword ptr res, eax // store low dword in res[0]
mov dword ptr res+4, edx // store high dword in res[1]
xor eax, eax
cpuid // serialize again
};
#endif // __GNUC__
return *(__int64*)res; // return result
}


/*****************************************************************/


thanks,
shahzad
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/