Danjel McGougan wrote:
> On 17 Apr, 20:14, Terje Mathisen <spamt...@[EMAIL PROTECTED]
> wrote:
>> pete wrote:
>>>> Was Terje's solution:
>>>> add dl,al ; Carry set if it overflows!
>>>> sbb al,al ; AL = 0xff if overflow
>>>> or dl,al ; Turns any overflow into 0xff
>>>> ...the right one? I see overflow handling but not underflow...
> [...]
>> Anyway, unless clamping is very common, it is hard to beat the branch
>> predictors!
>>
>
> I don't think the 808x CPU has a branch predictor. I guess one could
Ouch, I forgot about the 8088/8086 target!
> just add up the cycles used for each instruction on such an old CPU
> and come up with the best performing code for various statistics of
> the data. It should be pretty much deterministic (does it even use
> cache memory?).
The only cache at all is a 6/8 byte prefetch queue for instruction
bytes, but on 8088 that queue is very often empty since it has to
compete with all memory accesses.
The performance is easy to calculate though:
Without taken branches:
4 cycles per byte touched, code & data.
Taken branches were something like 4-8 cycles afair, might be wrong.
; Input: AL = signed byte, DL = unsigned byte
; Output: AL unsigned clamped byte
xor dx,dx
next: ; bytes
mov dl,[bx+si] ; 2+1
lodsb ; 1+1
cbw ; 1 -128 to 127
add ax,dx ; 2 -128 to 383
js underflow ; 2
test ah,ah ; 2
jnz overflow ; 2
store:
stosb ; 1+1 Store clamped result
jmp next ; 2+t
This looks like 14 bytes total, i.e. about 56 cycles for a nonclamped
result.
underflow:
xor ax,ax ; 2
stosb ; 1+1
jmp next ; 2+t
Underflow adds a taken branch but get rid of 2 code bytes, so we're
talking about +/- zero extra cost.
overflow:
mov al,255 ; 3
stosb ; 1+1
jmp next ; 2+t
Overflow requires 3 extra code bytes and a taken branch, i.e.
approximately 16+ cycles.
If most samples turn out to be in range, then we should simplify the
code to optimize this path even further:
xor dx,dx
next: ; bytes
mov dl,[bx+si] ; 2+1
lodsb ; 1+1
cbw ; 1 -128 to 127
add ax,dx ; 2 -128 to 383
test ah,0ffh
jnz clamp
stosb
jmp next
clamp:
js underflow
mov al,255
stosb
jmp next
underflow:
xor ax,ax
stosb
jmp next
Terje
--
- <Terje.Mathisen@[EMAIL PROTECTED]
>
"almost all programming can be viewed as an exercise in caching"


|