On Oct 2, 2:10=A0pm, Alexander Chemeris <alexander.cheme...@[EMAIL PROTECTED]
>
wrote:
> I think using rdtsc won't introduce too much distortion, even when
> timing small pieces of code. You can look into ffmpeg's timing
> facilities.
> Ffmpeg guys really do care about performance and have developed
> a good set of C macros for performance measurement.
>
> But, with current long-pipline CPUs there is a problem that you can't
> actually say that "this operation takes exactly N cycles". You have to
> clarify is it a number of cycles from the start of the first
> instruction
> to the start of the last instruction, or from the start of the first
> instruction
> to the end of the last instruction. If I recall correctly, using rdtsc
> for
> measuring every operation will give you the latter. While measuring
> the overall time will give you the former + loop overhead + context
> switches overhead. Probably context switches overhead is negligible
> because of the nature of experiment, but you can't get rid of loop
> overhead (though it should be just few cycles).
>
> So, in making story short - it would be interesting if you make
> timings on per-operation basis and compare that results with your
> previous results.
I agree that in the perfect world I have to make several extremely
precise measurements. But in real world I really don't have time to
make such things (I'm not a researcher, and nobody pays me for that).
Nevertheless I believe my measurements give precise results (but maybe
not very-very precise results). And I post source code, so everybody
is free to make any measurements he wants ;)
If we will take into consideration loop overhead and context switches
etc, well, maybe we will get 25 cycles/op instead of 30 cycles/op. I
think this is just inessential. This won't affect linear scaling of my
map, and super linear degradation of lock-based map.
As for rdtsc. When we are talking about things like 30 cycles, I think
rdtsc can introduce considerable distortion. Btw, when I was measuring
exact timings of some things I was using following scheme:
CPUID [eax=3D0]
t1 =3D RDTSC
CPUID [eax=3D0]
[tested code]
CPUID [eax=3D0]
t2 =3D RDTSC
CPUID [eax=3D0]
t =3D t2 - t1 - K
K must be determined experimentally so that t =3D 0 when [tested code]
is empty.
This allows one to measure timing down to individual instructions.
Dmitriy V'jukov


|