Robert Redelmeier wrote:
> rep_movsd <spamtrap@[EMAIL PROTECTED]
> wrote in part:
>> The thing is Floyd-Steinberg dithering works serially
>> one pixel at a time... Each pixel processed affects the
>> pixel to its right and the pixels on the next scanline.
>
> Hmm ... this sounds a bit like the game of LIFE .
> It has seen some asm optimization that might be reelvant.
Indeed.
I was sure that I had the victory to "Dr Dobbs Second Annual Code
Optimization Challenge" all sewn up:
I had very code data compression, I worked on four cells in parallel,
and I used the minimum possible (i.e. 3) updates to modify the
surrounding cells when anything changed.
David Stafford didn't only beat me, but he did so by a factor of two, by
taking advantage of the fact that the 486 target cpu had an 8 KB L1 cache.
My algorithm did less work, and was faster when a lot of cells were
alive, but after the first 200 generations or so, David's working set
started to fit in L1 cache, and he blew me away from that point on. :-)
Anyway, FS works on the current pixel values, instead of being able to
generate a new output bitmap in parallel for all cells/pixels.
On a real many-core cpu it might still make sense to run the algorithm
in parallel:
Start core 1 on pixel (1,1). As soon as it finishes pixel (2,1) there
will be no further updates to pixel (1,2), so the second core can start
on this pixel at this time, right?
When the second core has finished the first two pixels, the third core
can start on the third line etc.
In the limit, such an approach can reduce the problem from
O(height*width) to O(height+width) as long as you have the resources
needed for (height) scan lines, in the form of separate
cores*number_of_scanlines_that_fit_in_registers.
Terje
--
- <Terje.Mathisen@[EMAIL PROTECTED]
>
"almost all programming can be viewed as an exercise in caching"


|