Hello,
can anyone confirm on deny my conclusions on execution times on
different cpus:
there's a tiny program which doesn nothing except for incrementation
particular variable. There is one such a variable per thread to be
incremented. I'm running this program on different cpus to observe the
false sharing phenomenon.
computers:
1. single core [ AMD Athlon XP ]
2. single core HT [ Pentium IV ]
3. dual core separate caches [AMD Turion X2]
4. dual core common L2 cache [Core 2 DUO ]
5. 2xdual core separate caches [ 2x Opteron ]
notation: ./a.out N means that N threads are spawned, amount of work is
pro****tional to the number of threads so execution time should increase
with N.
1.
../a.out 1
real 0m1.447s
user 0m1.440s
../a.out 2
real 0m2.758s
user 0m2.744s
../a.out 4
real 0m5.382s
user 0m5.364s
note: linear scaling is what was to be expected
2.
../a.out 1
real 0m0.617s
user 0m0.616s
../a.out 2
real 0m1.164s
user 0m2.244s
../a.out 4
real 0m2.277s
user 0m4.464s
note: same as above
3.
../a.out 1
real 0m1.011s
user 0m1.008s
../a.out 2
real 0m5.108s
user 0m8.817s
../a.out 4
real 0m10.790s
user 0m19.573s
note: non-linear growth - false sharing
4.
../a.out 1
real 0m0.692s
user 0m0.692s
../a.out 2
real 0m1.025s
user 0m1.588s
../a.out 4
real 0m1.993s
user 0m3.524s
note: interesting example. time is growing at slower rate that in
example 3. Does it mean that in CPUs with common L2 cache false sharing
takes place only at level L1? And common cache is better than separate?
What are drawbacks of common cache? (reduced transfer rate per core?)
4.
../a.out 1
real 0m0.544s
user 0m0.540s
../a.out 2
real 0m1.787s
user 0m2.952s
../a.out 4
real 0m7.460s
user 0m22.817s
note: why there is so big jump from 2 to 4? Is it because false sharing
occurs among two separate chips? (unfortunately I do not have an access
to 2xsingle core)
*****************************88
And now the most concerning thing for me. I realized that I forgot to
set an optimization flag in gcc namely "-O3". Each architecture
responded to this change exactly as it should but one: Pentium IV HT.
2'.
../a.out 1
real 0m0.269s
user 0m0.268s
note: time drop due to -O3, everything's ok
../a.out 2
real 0m2.767s
user 0m5.436s
note: what the hell??
../a.out 4
real 0m5.704s
user 0m11.261s
thank you in advance for any help
Wojtek
the code:
#include <stdio.h>
#include <string.h>
#include <pthread.h>
#include <stdlib.h>
typedef unsigned char uchar;
typedef struct
{
unsigned long long pos;
uchar *in;
uchar *out;
} obj;
int NUM=1;
pthread_attr_t attr;
void* cryptThread(void* ptr)
{
int i;
obj *cS=(obj*)ptr;
int s=1;
for ( i=0; i<10000000; i++)
cS->out[cS->pos]+=s;
return s%1; //to avoid removal of loop which does nothing
}
void test(int *in,int *out, int len)
{
int rV,status,i,j;
obj cs[NUM];
for (i=0; i<NUM; i++)
{
cs[i].in=(uchar*)in;
cs[i].out=(uchar*)out;
cs[i].pos=i;
}
pthread_t threads[NUM];
for (i=0; i<NUM; i++)
{
rV=pthread_create(&threads[i],&attr,cryptThread,&cs[i]);
if (rV) exit(-1);
}
for(i=0; i<NUM; i++)
{
rV = pthread_join(threads[i], NULL);
if (rV) exit(-1);
}
}
int main(int argc, char* argv[])
{
int i;
if (argc==2) NUM=atoi(argv[1]);
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
int *plaintext=(int*)malloc(NUM*sizeof(int));
int *ciphertext=(int*)malloc(NUM*sizeof(int));
for (i=0; i<10; i++)
test(plaintext,ciphertext,NUM);
free(plaintext);
free(ciphertext);
pthread_attr_destroy(&attr);
pthread_exit(NULL);
return 0;
}


|