

|
 |
| Programming > Assembly x86 > Re: Population ... |
|
| << Topic |
< Post |
Post 10 of 33 Topic 4614 of 4710
|
Post > |
Topic >> |
Re: Population count in SSE2, again
by "Maarten Kronenburg" <spamtrap@[EMAIL PROTECTED]
>
Apr 14, 2008 at 08:53 PM
| "James Van Buskirk" wrote in message
> "Maarten Kronenburg" wrote in message
>
> > For what it's worth below I give my 32-bit assembler for popcnt.
>
> This is a crude version of the well-known code in AMD's optimization
> manual, and is not as fast as the AMD code.
>
Yes it's the basic one, I'm now putting it in xmm and see where I get.
I work in 32-bit system so the number of registers is a limitation.
> > Also I peeked for popcnt in GMP 4.2.2, see:
> > http://gmplib.org/
> > They have an IA-64 Itanium popcnt with 1 cycle/limb (64-bit), using
the
> > IA-64 popcnt instruction.
>
> I couldn't find where you got that number from. In
>
Just go to:
http://gmplib.org
and download gmp-4.2.2.tar.gz.
Then look for the files popcount.asm.
I just copied their timings, perhaps they have already improved.
> http://www.technovelty.org/code/arch/popcount.html
>
> I see a result of 6 clocks per 8 bytes for Itanium 2.
>
> > Their table lookup popcnt runs 8 cycle/limb (32-bit).
>
> Obviously one should be able to get close to 1 byte per cycle
> with a Core 2 Duo and 8-bit LUT. Only crappy code would be
> short of this mark by a factor of 2.
>
> > Clearly your xmm popcnt1 is also very fast, about 1 cycle/limb
(32-bit)!
>
> Also on a machine with throughput of one 64-bit hardware per clock
> cycle it may be possible to get about 10 bytes per cycle, if the
> other throughput problem I am experiencing is not also a problem in
> the code I envision.
>
> > But because it seems more complicated you should also test it for all
> > possible 128-bit bit patterns.
>
> No, the thing to do is to follow the references given and assure
> oneself that the algorithm is mathematically sound. The compression
> stage only needs a truth table for all combinations of three input
> bits (2**3 antries) to be exhaustive and the wrap-up stage can
> be thought of as 16 independent bytes in parallel. The only
> aspects that I haven't seen in previous code were use PSADBW
> for the final sweep stage (which is, if anything, more
> transparent than the multiply and ****ft given by AMD, but integer
> registers don't have SIMD operations in the x86 ISA) and the
> negative logic in the compression stage (which can be extablished
> by an 8-entry truth table if you flunked logic circuits.)
>
Yes I'm sure you are right, but a typo in assembler is easily made.
> After that, testing with extreme conditions such as all bits set
> and all bits clear as well as random bits to guard against gross
> errors in coding, while not exhaustive of all possibilities, is
> at least persuasive to me. Besides, the algorithm mashes
> together 1024 bits in each loop trip, distilling out one 128-bit
> drop to count up, not to mention that the dregs of previous
> iterations is an input to each successive iteration, so even
> an unattainable number like 128 bits doesn't even start to
> describe the complexity of the algorithm. It's explained fairly
> well in the second link I gave in the seminal post of this
> thread. Check it out and see what's going on here.
>
> The Wikipedia entry on Hamming weight is pretty good for the
> wrap stage of the algorithm, but the code there is almost
> identical to your example, so it probably doesn't help you.
>
Yes I found:
http://www.hackersdelight.org/HDcode/newCode/pop_arrayHS.c
To put those in assembler as you did is quite some work.
Perhaps you may find and eliminate dependency chains in your assembler
code,
that's the normal way to optimize exisiting assembler, but sometimes it's
rather a puzzle.
My popcnt is part of an integer library, just like GMP, and it's linear so
its performance is not so critical, but for some other special
applications
it may be critical. Amazing that you can get so far with optimizing
tricks.
Regards, Maarten.


|
33 Posts in Topic:
|
"James Van Buskirk&q |
2008-04-12 03:14:45 |
|
Terence <spamtrap@[EM |
2008-04-12 16:55:40 |
|
Terje Mathisen <spamt |
2008-04-13 15:08:36 |
|
"James Van Buskirk&q |
2008-04-13 09:20:49 |
|
"Maarten Kronenburg& |
2008-04-13 21:48:40 |
|
Jake Waskett <spamtra |
2008-04-13 21:43:32 |
|
"Maarten Kronenburg& |
2008-04-14 01:55:14 |
|
Jake Waskett <spamtra |
2008-04-14 11:19:35 |
|
"James Van Buskirk&q |
2008-04-14 02:38:07 |
|
"Maarten Kronenburg& |
2008-04-14 20:53:32 |
|
"Maarten Kronenburg& |
2008-04-15 17:13:38 |
|
"Maarten Kronenburg& |
2008-04-15 21:58:21 |
|
Terence <spamtrap@[EM |
2008-04-13 17:14:55 |
|
"Wolfgang Kern" |
2008-04-14 12:42:35 |
|
"James Van Buskirk&q |
2008-04-14 13:53:21 |
|
"Wolfgang Kern" |
2008-04-16 15:34:09 |
|
"James Van Buskirk&q |
2008-04-16 10:05:48 |
|
Robert Redelmeier <red |
2008-04-14 14:21:05 |
|
"James Van Buskirk&q |
2008-04-14 02:58:34 |
|
"Maarten Kronenburg& |
2008-04-14 18:09:21 |
|
Terje Mathisen <spamt |
2008-04-15 07:28:26 |
|
Terence <spamtrap@[EM |
2008-04-14 05:00:42 |
|
Terence <spamtrap@[EM |
2008-04-14 15:09:35 |
|
Terence <spamtrap@[EM |
2008-04-15 02:29:34 |
|
Gerd Isenberg <spamtr |
2008-04-15 02:56:39 |
|
"James Van Buskirk&q |
2008-04-16 00:33:16 |
|
"Maarten Kronenburg& |
2008-04-16 14:42:37 |
|
"Maarten Kronenburg& |
2008-04-16 19:38:21 |
|
"James Van Buskirk&q |
2008-04-16 12:41:36 |
|
"Maarten Kronenburg& |
2008-04-16 21:39:47 |
|
"Maarten Kronenburg& |
2008-04-17 16:43:31 |
|
Gerd Isenberg <spamtr |
2008-04-16 09:58:11 |
|
"James Van Buskirk&q |
2008-04-16 12:59:38 |
|
Post A Reply:

|
|
|
|