kuratkull@[EMAIL PROTECTED]
<spamtrap@[EMAIL PROTECTED]
> wrote:
> Anyway, I am trying to build the very-fast-prime-counter i assembly
> and I succeeded when using only the regular instructions, but a 6.5
> secs for a 100k primes is way too slow, so I looked around and
> profiled my code...and of course, a major bottleneck was the idivl.
I am not sure what "prime-counter" mean -- I you want just generate
primes than sieve of Eratosthenes is the standard method (much faster
than method using division).
In case you really want to use division, you may consider storing
recipracals -- you are using the same divisor many times.
> So
> I decided to use the 3DNow! instructions.
>
> ecx = ecx+2 (begins from 3; 2 is already in the stack)
> edx = all found primes less than current ecx, are looped through it.
>
> MOVD %ecx, %MM0 /* divisor */
> PI2FD %MM0, %MM0 /* divisor to float */
> MOVQ %MM0, %MM2 /* copy of divisor */
> PFRCP %MM0, %MM0 /* 1/divisor */
> MOVD %edx, %MM1 /* prime_candidate */
> PI2FD %MM1, %MM1 /* prime. to float */
> MOVQ %MM1, %MM3 /* copy of prime. */
> PFMUL %MM0, %MM1 /* 1/divisor * prime */
> PF2ID %MM1, %MM1 /* answer to int ...*/
> PI2FD %MM1, %MM1 /* ... and back to float */ /* NOTE1 */
> PFMUL %MM1, %MM2 /* roundeddown(1/divisor*prime.)*divisor
> PFCMPEQ %MM2, %MM3
> MOVD %MM3, %eax /* if prime check "good", then != */
>
> It will run perfectly until ecx=9 and edx=3. After "NOTE1", the MM1
> should be 3.0, but it seems to get rounded down to 2 on the toint-
> tofloat round. It seems the previous calculation before the converting
> stays a bit below 3, and then gets rounded down.
>
3DNow! PFRCP intruction give only approximate result (unlike 387 FPU
the result in _not_ correctly rounded). Worse, according to AMD
do***entation PFRCP has only 12 bits of accuracy, so you are limited
to really tiny numbers. So, you probably should add PFRCPIT1 and
PFRCPIT2 instructions, which together should give you 24 bits of
accuracy (however, the do***entation I have is pretty unclear what
accuraccy is guaranteed, so I would not trust in the last 1-2 bits).
Given inaccuracy of reciprocal and rounding of multiplicatin you
may get result which is smaller than true quotient. If the true
result is an integer and rounded result is smaller then rounding
down give you wrong result. To avoid this error you must make
sure that approximate result is _always_ bigger than true result.
Simple way to to multiply approximate a/b by a number slightly
bigger than 1, say 1+2^(-20) (this should be enough if the
approximate a/b has 21 bits of accuracy). Of course, there is
a danger that real a/b is smaller than an integer, but
after biasing up may get bigger and you may get the result
which is bigger than correct result. However, this can not
happen if the numbers are small enough. Namely, if a and b
are positive integer and a/b in not an integer, than distance
from a/b to the nearest integer is at least 1/b. As long as
absolute error of the calculation is smaller than 1/b there
is no risk that after rounding you get number bigger than
true result. Error bounds for floating point computations
are relative, so you really want relative error smaller than
1/a. Which means that you are limited to 18-19 bit a.
If you want bigger numbers than it is hard to beat IDIV.
One may try the same trick used by PFRCP, PFRCPIT1, PFRCPIT2 --
first get rough approximantion (say, via table lookup) and
refine using Newton iteration. But it is not clear if
table lookup and several aritmetic instructions needed
for Newton iteration will be faster than IDIV...
--
Waldek Hebisch
hebisch@[EMAIL PROTECTED]


|