"Wolfgang Kern" <spamtrap@[EMAIL PROTECTED]
> wrote in message
news:fu4v9j$b5g$1@[EMAIL PROTECTED]
> "James Van Buskirk" <spamtrap@[EMAIL PROTECTED]
> schrieb im Newsbeitrag
> news:P9adnbQZvO6pK57VnZ2dnUVZ_vOlnZ2d@[EMAIL PROTECTED]
> when I look at this table ...
>> LUT:
>> .byte 0,1,1,2,1,2,2,3 00 1.
>> .byte 1,2,2,3,2,3,3,4 08 2.
>> .byte 1,2,2,3,2,3,3,4 10 =08 [2
>> .byte 2,3,3,4,3,4,4,5 18 3.
>> .byte 1,2,2,3,2,3,3,4 20 = 08 [2
>> .byte 2,3,3,4,3,4,4,5 28 = 18 [3
>> .byte 2,3,3,4,3,4,4,5 30 = 18 [3
>> .byte 3,4,4,5,4,5,5,6 38 4.
>> .byte 1,2,2,3,2,3,3,4 40 = 08 [2
>> .byte 2,3,3,4,3,4,4,5 48 = 18 [3
>> .byte 2,3,3,4,3,4,4,5 50 = 18 [3
>> .byte 3,4,4,5,4,5,5,6 58 = 38 [4
>> .byte 2,3,3,4,3,4,4,5 60 = 18 [3
>> .byte 3,4,4,5,4,5,5,6 68 = 38 [4
>> .byte 3,4,4,5,4,5,5,6 70 = 38 [4
>> .byte 4,5,5,6,5,6,6,7 78 5.
>> .byte 1,2,2,3,2,3,3,4 80 = 08 [2
>> .byte 2,3,3,4,3,4,4,5 88 = 18 [3
>> .byte 2,3,3,4,3,4,4,5 90 = 18 [3
>> .byte 3,4,4,5,4,5,5,6 98 = 38 [4
>> .byte 2,3,3,4,3,4,4,5 a0 = 18 [3
>> .byte 3,4,4,5,4,5,5,6 a8 = 38 [4
>> .byte 3,4,4,5,4,5,5,6 b0 = 38 [4
>> .byte 4,5,5,6,5,6,6,7 b8 = 78 [5
>> .byte 2,3,3,4,3,4,4,5 c0 = 18 [3
>> .byte 3,4,4,5,4,5,5,6 c8 = 38 [4
>> .byte 3,4,4,5,4,5,5,6 d0 = 38 [4
>> .byte 4,5,5,6,5,6,6,7 d8 = 78 [5
>> .byte 3,4,4,5,4,5,5,6 e0 = 38 [4
>> .byte 4,5,5,6,5,6,6,7 e8 = 78 [5
>> .byte 4,5,5,6,5,6,6,7 f0 = 78 [5
>> .byte 5,6,6,7,6,7,7,8 f8 6.
> I wonder if a few calculations wont reduce the LUT to 6*8 bytes
> or even smaller with a double '0,1,1,2' index conversion,
> which then could be held in registers ...
> Haven't checked on the compressed method in detail, so this
> may be in use already anyway.
I think I see what you are driving at. Actually if you look
closer at the 4-bit LUT code it holds the first 16 bytes of the
LUT in one register, and looks up bytes in two stages. Look at
the _popcnt4 function (latest version to be found at
http://groups.google.com/group/comp.lang.asm.x86/msg/ee1ed0bbd6f35874
) and it may already be doing what you are proposing.
> My AMD64 docs show latency/throughput = 2(5)/1 for POPCNT, and I see
> this as 0.25 cycles per byte on GPR64 and 0.625 on mem64, .. not bad
> and hard to beat with an algo.
This doesn't beat the _popcnt1 version, which is getting 4.7 bytes
per clock, but the latency/throughput numbers quoted above for
Phenom imply almost 8 bytes/clock to me:
popcnt (%rcx), %rbx
addq %rbx, %rax
popcnt 8(%rcx), %rbx
addq %rbx, %rax
.....
But I think it can go even faster because Phenom should be capable
of 3 DirectPath operations per cycle as well as two loads, and the
above sequence only uses one load and two DirectPath operations
per cycle. Thus every third instruction could be devoted to
a compression algorithm like the one seen in _popcnt1, whose
final counting scheme could be accelerated with hardware popcount.
Of course my attempts to get Core 2 Duo do sustain 3 useful
calculational operations per cycle have been in vain, so there may
also be throughput issues with Phenom that prevent this idea from
becoming reality, but if anyone out there has a Phenom running
64-bit Windows I would be eager to run some tests...
--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end


|