"rep_movsd" wrote in message
> On May 7, 11:35 pm, "Maarten Kronenburg" wrote:
> > "Maarten Kronenburg" wrote in message
> >
> > Now I see the data is in bytes. In that case it seems better to put 16
bytes
> > in an 128-bit xmm register, then put 8 bytes each time into 8 16-bit
words
> > by ****fting and anding, and do the above with PSLLW/PSRLW and
PADDW/PSUBW in
> > the 16-bit words. Then the scaling mentioned is not needed because the
upper
> > 8 bits in the 16-bit words should be zero.
> > Maarten.
>
> Thanks...
>
> The thing is Floyd-Steinberg dithering works serially one pixel at a
> time... Each pixel processed affects the pixel to its right and the
> pixels on the next scanline.
> So the best I can hope for is to handle one pixels R, G and B byte
> values in one go.
>
> Let me clarify with actual code ( highly unoptimal C++ , for clarities
> sake )
Yes it's always good to first make a working C++ code and testing it
before
transferring it to assembler.
>
> /////////////////////////////////////////////////
> struct RGBA {unsigned char r, g, b, a; };
>
> int saturateAdd(int a, int b)
> {
> int ret = a + b;
> if(ret < 0) return 0;
> if(ret > 255) return 255;
> return ret;
> }
>
This is a signed addition with unsigned saturation.
Normally a signed byte goes from -2^7 to 2^7-1,
and adding these can never be larger than 2^8-1 anyway.
See below for a hint how to do this.
> void diffuse(RGBA* pImg, int w, int h)
> {
> for(int y = 0; y < h; ++y)
> {
> RGBA* pPix = pImg + (y * w);
Here it seems that w and h are unsigned int offsets, so unsigned int seems
better for w and h.
> for(int x = 0; x < w; ++x, ++pPix)
> {
> RGBA bestMatch = getNearestPalColor(pPix);
> int rDiff = pPix->r - bestMatch.r;
> int gDiff = pPix->g - bestMatch.g;
> int bDiff = pPix->b - bestMatch.b;
>
>
> RGBA* pNext = pPix + 1;
> pNext->r = saturateAdd(pNext->r, (rDiff * 7) >> 16);
> pNext->g = saturateAdd(pNext->g, (gDiff * 7) >> 16);
> pNext->b = saturateAdd(pNext->b, (bDiff * 7) >> 16);
Divide by 16 means ****ft right by 4, so >> 4.
>
> // repeat 3 lines above for pixel below, below left and
> below right with co efficients 5, 3 and 1
> }
> }
> }
>
> ///////////////////////////////////////////////
>
> Since logical and multiply arent available for 8 bit operands, heres
> what im thinking....
>
> The pixel bytes lets say are RGBA
> I do a PUNPCKLBW getting bytes XRXGXBXA in an MMX register
> The X are unwanted values.
> Then I do a PAND with 0x00FF00FF00FF00FF , getting rid of the X's
>
This seems to be OK, because here they are still unsigned.
> I repeat the same process for the new palette pixel
>
> Then I can do a PSUB to get 4 signed differences
> Then PMULLW with the coefficient value like 0x0007000700070007
> and PSRAW to ****ft
Yes this seems to be OK, PSUBW and PMULLW for signed multiplication.
Because they fit in 8 bits each the signed result fits in 16 bits so you
need only low result.
>
> Now i have 4 signed WORDs in some MMX register which are the signed
> differences between the original and palettized pixel colors...
>
> Now how do i add these to the destination pixel with saturated
> addition?
>
> There are instructions for adding signed values with signed saturation
> and unsigned values with unsigned saturation.
> How do i add signed differences to unsigned values with unsigned
> saturation?
>
> Perhaps some sort of tricky bit manipulation can work?
>
The result A seems to be unsigned, so in range 0 .. 2^8-1.
You need to saturated add a signed B, in range -2^7 .. 2^7-1.
Now the trick seems to first normally (that is unsaturated) subtract 2^7
from A, so it becomes a signed.
That means that 0 becomes -2^7 and 2^8-1 becomes 2^7-1, so that's the
signed
range.
Note that for unsaturated add and subtract, signed or unsigned doesn't
matter, because it always wraps around in the same way.
Then saturated add the signed B, with signed saturation.
Then normally add 2^7 to the resulting A again.
This works because the total byte saturation range is 2^8, whether signed
or
unsigned.
Unsaturated subtract 2^7 may also be unsaturated add 2^7, because as
mentioned unsaturated addition and subtraction wrap around, and 2^7 is
exactly half the wrap around of 2^8.
Perhaps doing this in C++ first as above:
signed char saturate_signed_add( signed char a, signed char b )
and rewrite diffuse with this one and above trick and test it, and then
put
it in assembler.
> Anyhow even if i get this far and have to do the rest normally without
> MMX, it should be much simpler code than the horror that my compiler
> generates for the above C++ code.
>
> Further ideas appreciated...
>
Yes but assembler code may also be hard to debug.
Maarten.


|