Robert Spykerman <robert.spykerman@[EMAIL PROTECTED]
> writes:
>On May 1, 12:30 pm, brian....@[EMAIL PROTECTED]
wrote:
>> I have been dis-assembling pieces of GForth and I don't understand why
>> it is generally faster than other ITC Forths that are written in
>> Assembler like CI-Forth or my old 16 bit HS/Forth.
It is? On what hardware? On what benchmarks? How much?
>1. As you know ciforth is really really small. I am wondering about
>cache issues. As you know a lot of modern pentia are von Neumann
>outside but actually are Harvard internally. I believe from memory the
>separate L1 data/instr caches of a CORE are 32 k bytes or something
>like that. I can't remember sizes of cache lines but as you know data
>and executable IA-32 code are definitely within this 20-30k area...
Yes, cache consistency issues are usually a good candidate for
explaining unexpected slowdowns. However, I would not expect that to
play a significant role with ITC on modern CPUs (it is pretty bad with
traditional ITC on Pentium, Pentium MMX, and K6 series CPUs).
Modern CPUs only suffer from writes in the same consistency region
that code resides in. The size of the consistency region is 64 Bytes
for the K7/K8/K10, AFAIK 32 Bytes on the Pentium Pro ... Pentium 3,
Pentium M, Core, and Core 2 family, and 1KB on the Pentium 4. So one
would have to place frequently-written variables or buffers pretty
close to primitives to get hit by that.
One can measure that by using performance counters and looking at the
I-cache and D-cache misses.
>2. Most of ciforth is actually written in forth. I wonder if it's the
>same in gforth.
Apart from about 300 primitives, yes, it's the same. Hmm, the
additional primitives may be helpful for the prediction accuracy of
the indirect JMP. This can also be checked with performance counters.
>3. I've heard some say that lodsw (ie for NEXT) is slow on modern
>pentia and I've been wondering about whether changing that to an idiom
>like doing a manual load of EAX from [ESI], bumping it up and then jmp
>ing to [EAX] may be better.
Hmm, I thought that this has become better than in the 486 and Pentium
days, but looking in the Athlon Optimization Guide (ok, already 8
years old itself, but the Athlon 64 (K8) is not that different in
these areas from the Athlon (K7)), I find that LODSD has a latency of
4 cycles, more than the equivalent MOV/ADD sequence. It is also a
VectorPath instruction, so it needs its own decode cycle. Still, I
would be surprised if that's it, but that's easy to check by replacing
all occurences of LODSD with
MOV EAX, [ESI]
ADD ESI, 4
One could then also use a different register than EAX and schedule the
MOV further up, which should be helpful when the JMP mispredicts
(about half of the instructions).
Concerning the PUSH and POP instructions, on the Athlon all POPs are
VectorPath (slower decode) with 4 cycles latency, and the simple PUSHs
are DirectPath (fast Decode) instructions with 3 cycles latency. The
K10 (Phenom) has special hardware that speeds up PUSH and POP, but I
think the K8 (Athlon 64 (X2)) is still pretty similar to the Athlon in
this area, so one probably should avoid them. Certainly something
like @[EMAIL PROTECTED]
should be done without PUSH and POP, and + should be done with
at most one POP.
I don't have the Intel Optimization manual at hand, but these
instructions should be pretty similar to the sequences of simple
instructions on the Pentium 4 with its trace cache, and IIRC the Core
microarchitecture (Core 2 CPUs, not Core CPUs; thank you, Intel
marketing) has special hardware for PUSH and POP, like the K10.
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2008:
http://www.complang.tuwien.ac.at/anton/euroforth/ef08.html


|