"Terence" wrote:
> Someone tell me why 21 instructions and a branch, per word, is somehow
> better/faster than
>
> Initializtion all methods need.
>
> zero summing index (eg ax or eax)
> load address of string to esi
> load string count to ecx
> load table address to ebx
> back: ; 3 instructions per word follow
> add to current sum, from count table indexed by the 8 byte or 16 bit
> word (one instruction)
> increment string pointer
> (esi) (one instruction)
> branch to repeat if not finished string byte/word string length
> (dec-branch instruction)
>
> ?
> I want to know, if I haven't understood what you all seem to be trying
> to do!
> Is it that all those instuctions are all somehow loaded to the cpu
> chip inststruction store and processed on chip? Because there is
> always the memory access to be considered.
The shown code reminds me of what Randy Hyde posted a while back
in ALA for counting set bits.
Even this algo suffer on many dependency stalls, it reads four bytes
and processes 32 bits per iteration, so it reduces the costly memory
accesses which you encounter with a byte by byte indexed table lookup.
Granted, a 256 byte table could be easy held in cache, so it would be
worth to benchmark and compare the performance of the two methods.
I never needed a bit counter yet, but if I ever need one, I'd try
if the known to be awful slow BSF/BSR instructions wont save on
many iterations, ...sure the worst case would be all bits set then,
so your suggestion may perform quite well:
xor eax,eax ;init count
xor ecx,ecx ;used as dwords_size
mov esi,start_of
mov edi, Byte_LUT ;should be prepared already
prefetchnta [edi]
[edi+40h]
[edi+80h]
[edi+C0h]
L0:
mov ebx,[esi+ecx*4]
movzx edx,bl ;four times unrolled
shr ebx,8 ;because a loop branch
add AL,[edi+edx] ;may cost more time
adc eax,0x0100 ;
movzx edx,bl
shr ebx,8
add AL,[edi+edx]
adc eax,0x0100
movzx edx,bl
shr ebx,8
add AL,[edi+edx]
adc eax,0x0100
movzx edx,bl
;shr ebx,8 ;not needed in last
add AL,[edi+edx]
adc eax,0x0100
inc ecx
cmp ecx,dword_size
jc L0
; eax = bit count
but I'm not sure that your short variant:
;ecx=byte_size
L0:
movzx edx,[esi]
inc esi
add AL,[edi+edx]
adc eax,0x0100
dec ecx
jnz L0
could be faster, because it reads 'bytes' more often from memory.
perhaps I type in in one day and post timing results then.
__
wolfgang


|
33 Posts in Topic:
|
"James Van Buskirk&q |
2008-04-12 03:14:45 |
|
Terence <spamtrap@[EM |
2008-04-12 16:55:40 |
|
Terje Mathisen <spamt |
2008-04-13 15:08:36 |
|
"James Van Buskirk&q |
2008-04-13 09:20:49 |
|
"Maarten Kronenburg& |
2008-04-13 21:48:40 |
|
Jake Waskett <spamtra |
2008-04-13 21:43:32 |
|
"Maarten Kronenburg& |
2008-04-14 01:55:14 |
|
Jake Waskett <spamtra |
2008-04-14 11:19:35 |
|
"James Van Buskirk&q |
2008-04-14 02:38:07 |
|
"Maarten Kronenburg& |
2008-04-14 20:53:32 |
|
"Maarten Kronenburg& |
2008-04-15 17:13:38 |
|
"Maarten Kronenburg& |
2008-04-15 21:58:21 |
|
Terence <spamtrap@[EM |
2008-04-13 17:14:55 |
|
"Wolfgang Kern" |
2008-04-14 12:42:35 |
|
"James Van Buskirk&q |
2008-04-14 13:53:21 |
|
"Wolfgang Kern" |
2008-04-16 15:34:09 |
|
"James Van Buskirk&q |
2008-04-16 10:05:48 |
|
Robert Redelmeier <red |
2008-04-14 14:21:05 |
|
"James Van Buskirk&q |
2008-04-14 02:58:34 |
|
"Maarten Kronenburg& |
2008-04-14 18:09:21 |
|
Terje Mathisen <spamt |
2008-04-15 07:28:26 |
|
Terence <spamtrap@[EM |
2008-04-14 05:00:42 |
|
Terence <spamtrap@[EM |
2008-04-14 15:09:35 |
|
Terence <spamtrap@[EM |
2008-04-15 02:29:34 |
|
Gerd Isenberg <spamtr |
2008-04-15 02:56:39 |
|
"James Van Buskirk&q |
2008-04-16 00:33:16 |
|
"Maarten Kronenburg& |
2008-04-16 14:42:37 |
|
"Maarten Kronenburg& |
2008-04-16 19:38:21 |
|
"James Van Buskirk&q |
2008-04-16 12:41:36 |
|
"Maarten Kronenburg& |
2008-04-16 21:39:47 |
|
"Maarten Kronenburg& |
2008-04-17 16:43:31 |
|
Gerd Isenberg <spamtr |
2008-04-16 09:58:11 |
|
"James Van Buskirk&q |
2008-04-16 12:59:38 |
|