On May 8, 7:00 pm, Terje Mathisen <spamt...@[EMAIL PROTECTED]
> wrote:
>
> Interesting algorithm, very serially dependent. :-(
>
> How does it handle the corner/edge cases?
>
> I.e. the last pixel on a scan line has no down-right neighbor, the first
> pixel has no down-left and the last scanline has nothing below at all!
>
First of all, I'm honored to have a reply from Terje :)
The edges are handled by ignoring all pixels which do not have the
neighbours, the simplest way I found was to keep my x and y loops
ranging over [1, width-1) and [0, height-1).
>
> It seems to me like the getNearestPalColor() function could easily
> dominate the processing time here!
Well Initially the code used the standard distance matching algorithm
as seen in zillions of color reduction libraries :
For each color in the palette
Compute the RGB colorspace distance between the palette entry and the
target color -> (r - rp)^2 + (g - gp)^2 + (b - bp)^2
Choose the index with the minimum distance.
This was pathetically slow... It has to loop 256 times for each pixel
doing several arithmetic ops!
One optimization is to use abs and get rid of the multiplications.
Then I thought of a simplistic algorithm based on a precomputed
table...
I start with a 3d array 256 * 256 * 256, initialized to -1
I iterate over the 256 colors and each time i "draw" a cube of a
certain size in that array around each color entry. The cube size
grows from size 3 to 33. The cube "color" is the palette index.
So the sequence goes like 1 pixel, 3x3 shell, 5x5 shell etc etc. I
also make sure I do not overwrite an empty slot.
Its kind of like a cubical floodfill.
Now I can use this lookup table to get the nearest color at any given
R, G and B.
Later I changed the box to 32x32x32 and maximal box size to 8 and
still get good results.
The reasoning is that Floyd-Steinberg dithering perturbs the colors
usually by a small amount, so it would be unlikely to look for a color
thats way far away from any given palette entry.
It can happen that the lookup table doesnt have an entry at some given
point, in such cases I fall back on the old linear algorithm.
With some tests I find that about 80% of the cube is usually filled,
and the "cache misses" happen about 5% to 8% of the time.
> I guess a 3D space partition would be a good way to go, using an initial
> lookup table based on something like 3:3:3 bits to start in the proper
> subvolume, or simply use that table to point to lists of possible
> palColor candidates for each starting point.
>
> (Due to truncation effects, several pal entries might belong to multiple
> lists, depending upon the value of the rounded/truncated RGB bits, but
> with a good set of palette colors, the number of values to test against
> should be (on average) pretty low.
> Anyway, to speed up the dithering stage I would use MMX/SSE and work on
> multiple lines in parallel, keeping the intermediate results in
> registers until we're finished with them.
How is it possible to do multiple lines in paralell?
One thing I can think of is that :
At each pixel, 2 pixels have to be updated immediately - the right and
bottom-left. The other two have to be updated twice.. So I could hold
3 error values in registers, ac***ulating them and touching each pixel
only once instead of twice...
My biggest problem now is that the main loop is littered with memory
accesses, and since the getNearestPaletteIndex() function is called on
every pixel, there is some register starvation.
I will next try to avoid looping on x and y indexing and increment the
pointers directly.
But my last measurement shows that this diffusion ( which took more
than half the time before) is now less of the performance hog. So I'll
have to see where to pry out some performance next...
Its been really fun for me to dive into assembler after a period of 6
or 7 years and actually get things working fast!!
Best Regards
Vivek


|