James Harris wrote:
....
>> btw: I don't have a CPUID before the second RDTSC
>> so I get an almost contant 14 cycles count for an empty test.
> If I remove the second cpuid I seem to get a consistent 32 cycles (on
> a Pentium 3). As I understand it, though, the second serialisation is
> needed to ensure any previous instructions (which there would be if
> this was testing a real code sequence) have completed.
This CPUID seem to do a bit more then just wait for pipes completion,
and its behavour may vary a lot for different CPUs even within family.
Ok, the second RDTSC may stall on preceding register jobs (eax,edx)
but this can be covered by a few nops in front of it (which I usually
always have in my test field anyway).
> I've tried various offsets both before the three tests and within but
> cannot see a consistent pattern. The CPU re****ts the following
> instruction cache characteristics.
>
> 08 1st-level instruction cache: 16 KB, 4-way set associative, 32-byte
> line size
You can try to put the first RDTSC at the very end of a (physical)
cache-line, so that the code under test always start on cache bounds.
> 01 Instruction TLB: 4 KB Pages, 4-way set associative, 32 entries
> 02 Instruction TLB: 4 MB Pages, fully associative, 2 entries
>
> This isn't a problem as it stands unless the symptoms persist within a
> loop. Something to watch out for, though, for tight code!
Yes, there are many things to consider for optimising ... beside
alignment, cache-bounds, dependencies and reg/pipe-stalls there is the
code-prefetch with its CPU-dependent size and decode capabilities.
Sometimes a redundant looking jmp or useless NOPs can speed up
a following tiny loop by four times, but it always depends ...
__
wolfgang


|