On Apr 18, 1:21 pm, Terje Mathisen <spamt...@[EMAIL PROTECTED]
> wrote:
> The only cache at all is a 6/8 byte prefetch queue for instruction
> bytes, but on 8088 that queue is very often empty since it has to
> compete with all memory accesses.
4 bytes, actually, on 8088. It's 6 on 8086 and NEC V20/30.
> Taken branches were something like 4-8 cycles afair, might be wrong.
17, actually. That's where the detective work comes in; to "beat" the
penalty of a jump, the alternate solution has to be 16 cycles or less
(including opcode fetch as you noted).
I think, based on what you wrote, that 17 cycles is not that big a
deal when alternate solutions would most likely be longer/slower.
> If most samples turn out to be in range, then we should simplify the
> code to optimize this path even further:
"Most" is subjective, but in an ADPCM decoder (what this code is for)
I would agree that the ratio of needing clamping to not needing is
about 1:8, so yes, most samples will be in range.
> xor dx,dx
> next: ; bytes
> mov dl,[bx+si] ; 2+1
> lodsb ; 1+1
> cbw ; 1 -128 to 127
> add ax,dx ; 2 -128 to 383
> test ah,0ffh
> jnz clamp
> stosb
> jmp next
>
> clamp:
> js underflow
> mov al,255
> stosb
> jmp next
>
> underflow:
> xor ax,ax
> stosb
> jmp next
>
> Terje
Looks like a winner; thanks for all your help!


|