Jim Leonard wrote:
> On Apr 18, 1:21 pm, Terje Mathisen <spamt...@[EMAIL PROTECTED]
> wrote:
>> The only cache at all is a 6/8 byte prefetch queue for instruction
>> bytes, but on 8088 that queue is very often empty since it has to
>> compete with all memory accesses.
>
> 4 bytes, actually, on 8088. It's 6 on 8086 and NEC V20/30.
>
>> Taken branches were something like 4-8 cycles afair, might be wrong.
>
> 17, actually. That's where the detective work comes in; to "beat" the
> penalty of a jump, the alternate solution has to be 16 cycles or less
> (including opcode fetch as you noted).
>
> I think, based on what you wrote, that 17 cycles is not that big a
> deal when alternate solutions would most likely be longer/slower.
>
>> If most samples turn out to be in range, then we should simplify the
>> code to optimize this path even further:
>
> "Most" is subjective, but in an ADPCM decoder (what this code is for)
> I would agree that the ratio of needing clamping to not needing is
> about 1:8, so yes, most samples will be in range.
>
>> xor dx,dx
>> next: ; bytes
>> mov dl,[bx+si] ; 2+1
>> lodsb ; 1+1
>> cbw ; 1 -128 to 127
>> add ax,dx ; 2 -128 to 383
>> test ah,0ffh
>> jnz clamp
>> stosb
>> jmp next
>>
>> clamp:
>> js underflow
>> mov al,255
>> stosb
>> jmp next
>>
>> underflow:
>> xor ax,ax
>> stosb
>> jmp next
>>
>> Terje
>
> Looks like a winner; thanks for all your help!
>
Might it be "better" (at least smaller code) to place the stosb after
the "next" label? but then you'd need a setup JMP to start,
[and a "cx=0" test at some point!]
xor dx,dx
jmp short start
next: ; bytes
stosb
start:
mov dl,[bx+si] ; 2+1
lodsb ; 1+1
cbw ; 1 -128 to 127
add ax,dx ; 2 -128 to 383
test ah,0ffh
jz next
clamp:
js underflow
mov al,255
jmp next
underflow:
xor ax,ax
jmp next


|