

|
 |
| Programming > Assembly x86 > Re: Population ... |
|
| << Topic |
< Post |
Post 9 of 33 Topic 4614 of 4729
|
Post > |
Topic >> |
Re: Population count in SSE2, again
by "James Van Buskirk" <spamtrap@[EMAIL PROTECTED]
>
Apr 14, 2008 at 02:38 AM
| "Maarten Kronenburg" <spamtrap@[EMAIL PROTECTED]
> wrote in message
news:4802639c$0$720$7ade8c0d@[EMAIL PROTECTED]
> For what it's worth below I give my 32-bit assembler for popcnt.
This is a crude version of the well-known code in AMD's optimization
manual, and is not as fast as the AMD code.
> Also I peeked for popcnt in GMP 4.2.2, see:
> http://gmplib.org/
> They have an IA-64 Itanium popcnt with 1 cycle/limb (64-bit), using the
> IA-64 popcnt instruction.
I couldn't find where you got that number from. In
http://www.technovelty.org/code/arch/popcount.html
I see a result of 6 clocks per 8 bytes for Itanium 2.
> Their table lookup popcnt runs 8 cycle/limb (32-bit).
Obviously one should be able to get close to 1 byte per cycle
with a Core 2 Duo and 8-bit LUT. Only crappy code would be
short of this mark by a factor of 2.
> Clearly your xmm popcnt1 is also very fast, about 1 cycle/limb (32-bit)!
Also on a machine with throughput of one 64-bit hardware per clock
cycle it may be possible to get about 10 bytes per cycle, if the
other throughput problem I am experiencing is not also a problem in
the code I envision.
> But because it seems more complicated you should also test it for all
> possible 128-bit bit patterns.
No, the thing to do is to follow the references given and assure
oneself that the algorithm is mathematically sound. The compression
stage only needs a truth table for all combinations of three input
bits (2**3 antries) to be exhaustive and the wrap-up stage can
be thought of as 16 independent bytes in parallel. The only
aspects that I haven't seen in previous code were use PSADBW
for the final sweep stage (which is, if anything, more
transparent than the multiply and ****ft given by AMD, but integer
registers don't have SIMD operations in the x86 ISA) and the
negative logic in the compression stage (which can be extablished
by an 8-entry truth table if you flunked logic circuits.)
After that, testing with extreme conditions such as all bits set
and all bits clear as well as random bits to guard against gross
errors in coding, while not exhaustive of all possibilities, is
at least persuasive to me. Besides, the algorithm mashes
together 1024 bits in each loop trip, distilling out one 128-bit
drop to count up, not to mention that the dregs of previous
iterations is an input to each successive iteration, so even
an unattainable number like 128 bits doesn't even start to
describe the complexity of the algorithm. It's explained fairly
well in the second link I gave in the seminal post of this
thread. Check it out and see what's going on here.
The Wikipedia entry on Hamming weight is pretty good for the
wrap stage of the algorithm, but the code there is almost
identical to your example, so it probably doesn't help you.
--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end


|
33 Posts in Topic:
|
"James Van Buskirk&q |
2008-04-12 03:14:45 |
|
Terence <spamtrap@[EM |
2008-04-12 16:55:40 |
|
Terje Mathisen <spamt |
2008-04-13 15:08:36 |
|
"James Van Buskirk&q |
2008-04-13 09:20:49 |
|
"Maarten Kronenburg& |
2008-04-13 21:48:40 |
|
Jake Waskett <spamtra |
2008-04-13 21:43:32 |
|
"Maarten Kronenburg& |
2008-04-14 01:55:14 |
|
Jake Waskett <spamtra |
2008-04-14 11:19:35 |
|
"James Van Buskirk&q |
2008-04-14 02:38:07 |
|
"Maarten Kronenburg& |
2008-04-14 20:53:32 |
|
"Maarten Kronenburg& |
2008-04-15 17:13:38 |
|
"Maarten Kronenburg& |
2008-04-15 21:58:21 |
|
Terence <spamtrap@[EM |
2008-04-13 17:14:55 |
|
"Wolfgang Kern" |
2008-04-14 12:42:35 |
|
"James Van Buskirk&q |
2008-04-14 13:53:21 |
|
"Wolfgang Kern" |
2008-04-16 15:34:09 |
|
"James Van Buskirk&q |
2008-04-16 10:05:48 |
|
Robert Redelmeier <red |
2008-04-14 14:21:05 |
|
"James Van Buskirk&q |
2008-04-14 02:58:34 |
|
"Maarten Kronenburg& |
2008-04-14 18:09:21 |
|
Terje Mathisen <spamt |
2008-04-15 07:28:26 |
|
Terence <spamtrap@[EM |
2008-04-14 05:00:42 |
|
Terence <spamtrap@[EM |
2008-04-14 15:09:35 |
|
Terence <spamtrap@[EM |
2008-04-15 02:29:34 |
|
Gerd Isenberg <spamtr |
2008-04-15 02:56:39 |
|
"James Van Buskirk&q |
2008-04-16 00:33:16 |
|
"Maarten Kronenburg& |
2008-04-16 14:42:37 |
|
"Maarten Kronenburg& |
2008-04-16 19:38:21 |
|
"James Van Buskirk&q |
2008-04-16 12:41:36 |
|
"Maarten Kronenburg& |
2008-04-16 21:39:47 |
|
"Maarten Kronenburg& |
2008-04-17 16:43:31 |
|
Gerd Isenberg <spamtr |
2008-04-16 09:58:11 |
|
"James Van Buskirk&q |
2008-04-16 12:59:38 |
|
Post A Reply:

|
|
|
|