"James Van Buskirk" <spamtrap@[EMAIL PROTECTED]
> schrieb im Newsbeitrag
news:P9adnbQZvO6pK57VnZ2dnUVZ_vOlnZ2d@[EMAIL PROTECTED]
> "Wolfgang Kern" <spamtrap@[EMAIL PROTECTED]
> wrote in message
> news:ftvcf7$umq$1@[EMAIL PROTECTED]
>> Even this algo suffer on many dependency stalls, it reads four bytes
>> and processes 32 bits per iteration, so it reduces the costly memory
>> accesses which you encounter with a byte by byte indexed table lookup.
>> Granted, a 256 byte table could be easy held in cache, so it would be
>> worth to benchmark and compare the performance of the two methods.
> Well, I tried rewriting the 8-bit LUT in assembly to minimize
> memory access, also I tried the 4-bit LUT. In the following,
> popcnt1 is the topical code of this thread, popcnt2 is the
> 8-bit LUT, popcnt3 is popcnt1 without compression as in the AMD
> manual, and popcnt4 is the 4-bit LUT:
I see in your later posts you found already faster solutions ...
[code ..]
when I look at this table ...
> LUT:
> .byte 0,1,1,2,1,2,2,3 00 1.
> .byte 1,2,2,3,2,3,3,4 08 2.
> .byte 1,2,2,3,2,3,3,4 10 =08 [2
> .byte 2,3,3,4,3,4,4,5 18 3.
> .byte 1,2,2,3,2,3,3,4 20 = 08 [2
> .byte 2,3,3,4,3,4,4,5 28 = 18 [3
> .byte 2,3,3,4,3,4,4,5 30 = 18 [3
> .byte 3,4,4,5,4,5,5,6 38 4.
> .byte 1,2,2,3,2,3,3,4 40 = 08 [2
> .byte 2,3,3,4,3,4,4,5 48 = 18 [3
> .byte 2,3,3,4,3,4,4,5 50 = 18 [3
> .byte 3,4,4,5,4,5,5,6 58 = 38 [4
> .byte 2,3,3,4,3,4,4,5 60 = 18 [3
> .byte 3,4,4,5,4,5,5,6 68 = 38 [4
> .byte 3,4,4,5,4,5,5,6 70 = 38 [4
> .byte 4,5,5,6,5,6,6,7 78 5.
> .byte 1,2,2,3,2,3,3,4 80 = 08 [2
> .byte 2,3,3,4,3,4,4,5 88 = 18 [3
> .byte 2,3,3,4,3,4,4,5 90 = 18 [3
> .byte 3,4,4,5,4,5,5,6 98 = 38 [4
> .byte 2,3,3,4,3,4,4,5 a0 = 18 [3
> .byte 3,4,4,5,4,5,5,6 a8 = 38 [4
> .byte 3,4,4,5,4,5,5,6 b0 = 38 [4
> .byte 4,5,5,6,5,6,6,7 b8 = 78 [5
> .byte 2,3,3,4,3,4,4,5 c0 = 18 [3
> .byte 3,4,4,5,4,5,5,6 c8 = 38 [4
> .byte 3,4,4,5,4,5,5,6 d0 = 38 [4
> .byte 4,5,5,6,5,6,6,7 d8 = 78 [5
> .byte 3,4,4,5,4,5,5,6 e0 = 38 [4
> .byte 4,5,5,6,5,6,6,7 e8 = 78 [5
> .byte 4,5,5,6,5,6,6,7 f0 = 78 [5
> .byte 5,6,6,7,6,7,7,8 f8 6.
I wonder if a few calculations wont reduce the LUT to 6*8 bytes
or even smaller with a double '0,1,1,2' index conversion,
which then could be held in registers ...
Haven't checked on the compressed method in detail, so this
may be in use already anyway.
[code..]
> So we got down to about 1 byte per 1.5 clocks for the 8-bit LUT.
> Quite disappointing. The 4-bit LUT is close to the AMD manual
> method, though. For a processor with fast PSHUFB and no
> hardware POPCNT this may be a good thing. PSHUFB unfortunately is
> slow on my Core 2 Duo E6700.
My AMD64 docs show latency/throughput = 2(5)/1 for POPCNT, and I see
this as 0.25 cycles per byte on GPR64 and 0.625 on mem64, .. not bad
and hard to beat with an algo.
__
wolfgang


|