On Apr 4, 5:23 am, Waldek Hebisch <spamt...@[EMAIL PROTECTED]
> wrote:
> kuratk...@[EMAIL PROTECTED]
<spamt...@[EMAIL PROTECTED]
> wrote:
> > Anyway, I am trying to build the very-fast-prime-counter i assembly
> > and I succeeded when using only the regular instructions, but a 6.5
> > secs for a 100k primes is way too slow, so I looked around and
> > profiled my code...and of course, a major bottleneck was the idivl.
>
> I am not sure what "prime-counter" mean -- I you want just generate
> primes than sieve of Eratosthenes is the standard method (much faster
> than method using division).
>
> In case you really want to use division, you may consider storing
> recipracals -- you are using the same divisor many times.
>
>
>
> > So
> > I decided to use the 3DNow! instructions.
>
> > ecx = ecx+2 (begins from 3; 2 is already in the stack)
> > edx = all found primes less than current ecx, are looped through it.
>
> > MOVD %ecx, %MM0 /* divisor */
> > PI2FD %MM0, %MM0 /* divisor to float */
> > MOVQ %MM0, %MM2 /* copy of divisor */
> > PFRCP %MM0, %MM0 /* 1/divisor */
> > MOVD %edx, %MM1 /* prime_candidate */
> > PI2FD %MM1, %MM1 /* prime. to float */
> > MOVQ %MM1, %MM3 /* copy of prime. */
> > PFMUL %MM0, %MM1 /* 1/divisor * prime */
> > PF2ID %MM1, %MM1 /* answer to int ...*/
> > PI2FD %MM1, %MM1 /* ... and back to float */ /* NOTE1 */
> > PFMUL %MM1, %MM2 /* roundeddown(1/divisor*prime.)*divisor
> > PFCMPEQ %MM2, %MM3
> > MOVD %MM3, %eax /* if prime check "good", then != */
>
> > It will run perfectly until ecx=9 and edx=3. After "NOTE1", the MM1
> > should be 3.0, but it seems to get rounded down to 2 on the toint-
> > tofloat round. It seems the previous calculation before the converting
> > stays a bit below 3, and then gets rounded down.
>
> 3DNow! PFRCP intruction give only approximate result (unlike 387 FPU
> the result in _not_ correctly rounded). Worse, according to AMD
> do***entation PFRCP has only 12 bits of accuracy, so you are limited
> to really tiny numbers. So, you probably should add PFRCPIT1 and
> PFRCPIT2 instructions, which together should give you 24 bits of
> accuracy (however, the do***entation I have is pretty unclear what
> accuraccy is guaranteed, so I would not trust in the last 1-2 bits).
>
> Given inaccuracy of reciprocal and rounding of multiplicatin you
> may get result which is smaller than true quotient. If the true
> result is an integer and rounded result is smaller then rounding
> down give you wrong result. To avoid this error you must make
> sure that approximate result is _always_ bigger than true result.
> Simple way to to multiply approximate a/b by a number slightly
> bigger than 1, say 1+2^(-20) (this should be enough if the
> approximate a/b has 21 bits of accuracy). Of course, there is
> a danger that real a/b is smaller than an integer, but
> after biasing up may get bigger and you may get the result
> which is bigger than correct result. However, this can not
> happen if the numbers are small enough. Namely, if a and b
> are positive integer and a/b in not an integer, than distance
> from a/b to the nearest integer is at least 1/b. As long as
> absolute error of the calculation is smaller than 1/b there
> is no risk that after rounding you get number bigger than
> true result. Error bounds for floating point computations
> are relative, so you really want relative error smaller than
> 1/a. Which means that you are limited to 18-19 bit a.
> If you want bigger numbers than it is hard to beat IDIV.
> One may try the same trick used by PFRCP, PFRCPIT1, PFRCPIT2 --
> first get rough approximantion (say, via table lookup) and
> refine using Newton iteration. But it is not clear if
> table lookup and several aritmetic instructions needed
> for Newton iteration will be faster than IDIV...
>
> --
> Waldek Hebisch
> hebi...@[EMAIL PROTECTED]
our posts are falling out of sync with eachother.
Sorry Sebastian, I mixed up: the gas doc. says:
"Currently, as does not sup****t Intel's floating point SIMD, Katmai
(KNI)."
->
SSE works, but the examples I tried didn't work for some reason, after
I had "converted" them from Intel to AT&T. Enough about this :)
Waldek:
By prime-counter I meant trial division with every previously found
prime. (I have looked into number theory as a personal interest ).
Thanks for the detailed info on the workaround, but I am more
staggered to find that the 3DNow! actually IS imprecise :/ So it is
basically useless, because as I said, it generates ~500 primes more
for 1000,000 than it should. (I previously wrote 10 mil, sorry).
The solution you offered is very nice, but I was hoping the solution
to be without limits, though I appreciate your effort. It just looks
like I have to dump 3DNow! altogether. (I'm going to get myself a new
IntC2D laptop, so I will be able to use SSE upto SSSE3).


|