BBC BASIC for Windows: Setting a register to zero if it's

BBC BASIC for Windows
Programming >> Assembly Language Programming >> Setting a register to zero if it's < zero
http://bb4w.conforums.com/index.cgi?board=assembler&action=display&num=1317631428

Setting a register to zero if it's < zero
Post by David Williams on Oct 3^rd, 2011, 08:43am

Unless I dreamt it, I'm sure Richard once showed a little 'trick' to test if a 32-bit register (EAX, EBX, etc.) is less than zero, and set it to zero (in a branchless way) if so.

Please remind me!

Re: Setting a register to zero if it's < zero
Post by David Williams on Oct 3^rd, 2011, 09:41am

Actually, an efficient bit of code for clamping a signed 32-bit integer to between 0 and 255 inclusive would be even more useful. I'm currently trawling the web for code snippets...

Re: Setting a register to zero if it's < zero
Post by David Williams on Oct 3^rd, 2011, 09:46am

Found this:

(QUOTE)
The generic way to handle arbitrary clamping values is something like this:

;; if ( x < MIN) x = MIN;
;; else if (x > MAX) x = MAX;

cmp eax,MIN ; Carry set if (eax < MIN)
sbb ebx,ebx ; EBX = -1 if underflow
cmp eax,MAX+1 ; Carry clear if (eax > MAX)
adc ebx,ebx ; Merge carry with previous result

At this point there are 4 possible combinations:

second 0 second 1
first 0 OVFL OK
first 1 impossible UNDFL

These four corresponds to EBX = 0, 1, -2, -1 which means that it is
possible to store the original value into a four-element table, then use
EBX to retrieve either the same value back or the clamped results:

cmp eax,MIN ; Carry set if (eax < MIN)
sbb ebx,ebx ; EBX = -1 if underflow
cmp eax,MAX+1 ; Carry clear if (eax > MAX)
adc ebx,ebx ; Merge carry with previous result
mov clamp_table[4], eax
mov eax,clamp_table[ebx*4+8]

This code will only be faster than the naive approach if clamped values
are both common and non-predictable, if this isn't true then it is much
better to simply use the generic C-style if/else if/else approach!

Terje
(END QUOTE)

That'll probably do.

Re: Setting a register to zero if it's < zero
Post by admin on Oct 3^rd, 2011, 10:37am

on Oct 3^rd, 2011, 09:46am, David Williams wrote:

That'll probably do.

But do note Terje's postscript: "This code will only be faster than the naive approach if clamped values are both common and non-predictable, if this isn't true then it is much better to simply use the generic C-style if/else if/else approach!". Specifically, if not many values have to be clamped (and in my experience that's the usual case) you'll probably be better off using conditional jumps.

Richard.

Re: Setting a register to zero if it's < zero
Post by Michael Hutton on Oct 8^th, 2011, 9:38pm

Could you use the CMOV instruction: ?

push max
push min
cmp eax, max
cmovg eax,[esp+4]
cmp eax, min
cmovl eax,[esp]
add esp,8

Michael

(You may have seen the original post... I didn't realise at first that CMOV doesn't actually do the test itself.. in a way I wish it did. ::))

Edit:
Also , I suppose you could do it all in registers if you have them spare:

mov edi, max
mov esi, min
cmp eax, edi
cmovg eax, edi
cmp eax, esi
cmovl eax, esi

another Edit:

Yes this works:

Code:

      INSTALL @lib$ + "ASMLIB"
      DIM C% 100, L%-1
      FOR pass=8 TO 10 STEP 2
        P% = C%
        ON ERROR LOCAL [OPT FN_asmext
        [
        OPT pass
        .clamp
        ;eax contains value to test
        push 5
        push 0
        cmp eax, 5
        cmovg eax, [esp+4]
        cmp eax, 0
        cmovl eax, [esp]
        add esp,8
        ret
        ]
      NEXT
      
      FOR A% = -5 TO 10
        PRINT A%, USR(clamp)
      NEXT

Re: Setting a register to zero if it's < zero
Post by David Williams on Oct 9^th, 2011, 05:03am

on Oct 8^th, 2011, 9:38pm, Michael Hutton wrote:

push max
push min
cmp eax, max
cmovg eax,[esp+4]
cmp eax, min
cmovl eax,[esp]
add esp,8

Looks very elegant to my eyes. And totally branchless.

I think that with the help your solution, my bitmap colour saturation modifying routine will be fast enough to render dozens of colour-saturation-(de-)enhanced sprites in realtime, offering yet another creative option for the growing er.. army of BB4W game-makers.

Thanks again, Michael.

Rgs,
David.

Re: Setting a register to zero if it's < zero
Post by admin on Oct 9^th, 2011, 09:46am

on Oct 9^th, 2011, 05:03am, David Williams wrote:

Looks very elegant to my eyes. And totally branchless.

The downside, of course, is that (if we're talking about BB4W here) it means using the ASMLIB library, with all that implies in respect of disabling crunching - for example needing to put your source in a separate file which you assemble using CALL.

I assume your application is somewhat untypical, in that clipping is both common and unpredictable. In the more typical case (digital filtering is a classic example) clipping is the exception: the majority of values will be 'in range'. In that case the naive approach using conditional jumps will outperform the CMOV code on any modern CPU.

Incidentally, your original question was how to clip off negative values, i.e. clipping to zero. That's simpler than the general case of course, for example this is one branchless solution (if that's definitely what you want):

Code:

      cmp eax,&80000000
      sbb ebx,ebx
      and eax,ebx

Richard.

Re: Setting a register to zero if it's < zero
Post by David Williams on Oct 9^th, 2011, 10:28am

on Oct 9^th, 2011, 09:46am, Richard Russell wrote:

Yes, I'm aware of this. I'll probably hand-assemble the instructions to get around the problem of not being able to crunch the program.

on Oct 9^th, 2011, 09:46am, Richard Russell wrote:

I assume your application is somewhat untypical, in that clipping is both common and unpredictable. In the more typical case (digital filtering is a classic example) clipping is the exception: the majority of values will be 'in range'. In that case the naive approach using conditional jumps will outperform the CMOV code on any modern CPU.

I'll conduct some timed tests at some point, aware of course of the performance variations resulting from different CPU architectures.

on Oct 9^th, 2011, 09:46am, Richard Russell wrote:

Incidentally, your original question was how to clip off negative values, i.e. clipping to zero. That's simpler than the general case of course, for example this is one branchless solution (if that's definitely what you want):

Code:

      cmp eax,&80000000
      sbb ebx,ebx
      and eax,ebx

Richard.

Thanks for this. I think I might set up another web page at bb4wgames.com -- a reference page of handy code snippets and gems just like that one; various x86 asm tips & tricks, etc.

David.

Re: Setting a register to zero if it's < zero
Post by Michael Hutton on Oct 9^th, 2011, 10:31am

Here are three links discussing the cmov instruction which seem not to be so impressed with it:

http://www.redhat.com/archives/rhl-devel-list/2009-February/msg00422.html
https://mail.mozilla.org/pipermail/tamarin-devel/2008-April/000455.html
http://ondioline.org/mail/cmov-a-bad-idea-on-out-of-order-cpus

So it's horses for courses as per usual.

Michael

Re: Setting a register to zero if it's < zero
Post by Michael Hutton on Oct 9^th, 2011, 11:11am

Ok, for what it is worth, a not so very accurate test but it seems to favour the cmov instruction with random numbers on a core i5. You can test for not so random numbers as well I suppose, and it will vary according to processor type.
Code:

          0:               INSTALL @lib$ + "ASMLIB"
         0:               DIM C% 100, L%-1
         0:               FOR pass=8 TO 10 STEP 2
         0:               P% = C%
         0:               ON ERROR LOCAL [OPT FN_asmext
         0:               [
         0:               OPT pass
         0:               .clampcmov
         0:               .M%
         0:               push 255
         0:               push 0
         0:               cmp eax, 255
         0:               cmovg eax, [esp+4]
         0:               cmp eax, 0
         0:               cmovl eax, [esp]
         0:               add esp,8
         0:               ret
         0:               
         0:               .clampbranch
         0:               .N%
         0:               cmp eax,5
         0:               jl testmin
         0:               mov eax,5
         0:               .testmin
         0:               cmp eax,0
         0:               jg end
         0:               xor eax, eax
         0:               .end
         0:               ret
         0:               
         0:               .clampcmov2
         0:               .O%
         0:               mov esi, 255
         0:               mov edi,0
         0:               cmp eax, 255
         0:               cmovg eax, esi
         0:               cmp eax, 0
         0:               cmovl eax, edi
         0:               ret
         0:               
         0:               
         0:               ]
         0:               NEXT
         0:               
        20:     0.08      FOR I%=1 TO 10000000
      1695:     7.13      A% = RND
      5503:    23.15      B% = USR(M%)
       628:     2.64      NEXT
         0:               
        41:     0.17      FOR I%=1 TO 10000000
      1637:     6.89      A% = RND
      5628:    23.67      B% = USR(N%)
       683:     2.87      NEXT
         0:               
        25:     0.11      FOR I%=1 TO 10000000
      1668:     7.02      A% = RND
      5552:    23.35      B% = USR(O%)
       693:     2.91      NEXT
         0:               
         0:               
         0:               END
         0:

:-/

Michael

Re: Setting a register to zero if it's < zero
Post by David Williams on Oct 9^th, 2011, 12:01pm

Interesting stuff, Michael. Thanks for links to the CMOV discussions, and the timing test.

Those discussions seem to suggest, overall, that the jury is still out?

Incidentally, when I do timing tests, I usually raise the process priority of my program to Higher (or Above Normal) -- certainly not Highest, because I think it produces more stable timings.

For my specific application (fast colour saturation enhancement/reduction - basically blending a colour RGB pixel with its own greyscale intensity -- another routine for GFXLIB, of course), I don't think the branches are very predictable, so perhaps CMOV will win for me. It remains to be seen!

===

Now, sorry to chop & change, but...

I'd like to hijack my own thread and ask what you think is the fastest way of getting the separate R, G, B values from a 32-bit xRGB pixel (&xxRrGgBb) into three registers... say EAX, EBX, ECX.

I've found that this seems fast:

Code:

movzx ecx, BYTE [edi + 4*esi]      ; blue
movzx ebx, BYTE [edi + 4*esi + 1]  ; green
movzx eax, BYTE [edi + 4*esi + 2]  ; red

; EDI points to ARGB32 bitmap base address
; ESI is the pixel index

By the way, I found no discernible speed increase by loading (via LEA) the pixel address EDI+4*ESI into some register.

Why am I surprised that the above method is noticeably faster than this one below:

Code:

mov edx, [edi + 4*esi]   ; load 32-bit &xxRrGgBb pixel

mov eax, edx             ; copy EDX
mov ebx, edx             ; copy EDX (again)
mov ecx, edx             ; ...and again

and eax, &FF0000         ; EAX = &00 Rr 00 00 (red component * 2^16)
and ebx, &FF00           ; EBX = &00 00 Gg 00 (green component * 2^8)
and ecx, &FF             ; ECX = &00 00 00 Bb (blue component)
        
shr eax, 16              ; EAX (al) = &Rr (red byte)
shr ebx, 8               ; EBX (bl) = &Gg (green byte)
;                        ; ECX (cl) = &Bb (blue byte)

(My overly verbose ASM comments are for my own benefit - I don't intend to patronise! I nearly always verbosely comment my asm code.)

So, one single main memory access (in the latter case) versus three main memory accesses.

Would you say that the three EDX copying instructions are destroying parallel (U & V pipe) processing opportunities?

In any case, is there a faster way of getting those R, G, B values into EAX, EBX and ECX respectively?

Thanks in advance!

Rgs,
David.

PS. I notice that this forum system sometimes changes the word a_s_s_e_m_b_l_y into disagreembly. Bizarre.

Re: Setting a register to zero if it's < zero
Post by Michael Hutton on Oct 9^th, 2011, 12:35pm

Quote:

In any case, is there a faster way of getting those R, G, B values into EAX, EBX and ECX respectively?

Good god! Why are you asking me? ???

I am an amoebic 'stimulus - respond' man when it comes to asm! Normally, I just play with what you've posted!

But out of interest, I was thinking about that very thing the other day when you sent me the 'Colourdrain' code.. and I am in visual C++ now typing out the procedure to see if I can look at the asm output and seeing how it may change depending on the C++ code.

I too was surprised you used the :
Code:

movzx ecx, BYTE [edi + 4*esi]      ; blue
movzx ebx, BYTE [edi + 4*esi + 1]  ; green
movzx eax, BYTE [edi + 4*esi + 2]  ; red

but came to the conclusion that that it would be cached.

but as alternatives I played around initially with:
Code:

xor ebx,ebx
xor ecx,ecx
mov eax, [edi + 4 * esi]
mov bl, ah     <note edited from original>
mov bl, al      <note edited from original>
shr eax,16
; and eax, &FF (if your going to use eax rather than al)

and if you just want the byte values you could forget about the xor's and just use the al, bl and cl registers..

so I suppose the bare minimal would be:
Code:

mov eax, [edi + 4 * esi]
mov bl, ah <note edited from original>
mov cl, al  <note edited from original>
shr eax,16

and you'll have the rgb values in al, bl, cl respectively. If you want them 'tidied' to 32 bit values then add

Code:

and eax,&FF
and ebx,&FF
and ecx,&FF

I hope I haven't overlooked something blindingly obvious as per ususal.

Michael

Re: Setting a register to zero if it's < zero
Post by David Williams on Oct 9^th, 2011, 1:33pm

on Oct 9^th, 2011, 12:35pm, Michael Hutton wrote:

Good god! Why are you asking me? ???

Yes, sorry if I put you on the spot a little. Actually, the question was meant to be addressed to you, Richard and anyone else who knows x86 assembly language.

on Oct 9^th, 2011, 12:35pm, Michael Hutton wrote:

so I suppose the bare minimal would be:
Code:

mov eax, [edi + 4 * esi]
mov ebx, ah
mov ecx, al
shr eax,16

Sorry Michael, but the BB4W assembler didn't like MOVing an 8-bit register into a 32-bit one like that. It's hard to see why such an operation shouldn't be allowed, but then I'm not an Intel engineer!

Here is a timing test I did (excuse the obsession with data - and even instruction - alignment!):

Code:

      MODE 8 : OFF
      
      HIMEM = LOMEM + 2*&100000
      
      DIM gap1% 4096
      DIM bitmap% 4*(640*512 + 1)
      bitmap% = (bitmap% + 3) AND -4
      DIM  gap2% 4096
      
      REM. These 4 Kb gaps are probably way OTT, but just to be certain!
      
      PROC_asm
      
      REM. Fill bitmap with random values
      FOR I% = bitmap% TO (bitmap% + 4*640*512)-1 STEP 4
        !I% = RND
      NEXT
      
      G% = FNSYS_NameToAddress( "GetTickCount" )
      time0% = 0
      time1% = 0
      
      PRINT '" Conducting test #1, please wait..."'
      
      REM. Test #1 (three MOVZX instructions)
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &80
      SYS G% TO time0%
      FOR I% = 1 TO 5000
        CALL A%
      NEXT
      SYS G% TO time1%
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &20
      
      PRINT '" Test #1 (three MOVZX instructions) took ";
      PRINT ;(time1% - time0%)/1000; " s."'
      
      PRINT '" Conducting test #2, please wait..."'
      
      REM. Test #2 (one MOV instruction)
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &80
      SYS G% TO time0%
      FOR I% = 1 TO 5000
        CALL B%
      NEXT
      SYS G% TO time1%
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &20
      
      PRINT '" Test #2 (one MOV instruction) took ";
      PRINT ;(time1% - time0%)/1000; " s."'
      
      PRINT '" Conducting test #3, please wait..."'
      
      REM. Test #3 (one MOV instruction; other instructions re-ordered)
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &80
      SYS G% TO time0%
      FOR I% = 1 TO 5000
        CALL C%
      NEXT
      SYS G% TO time1%
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &20
      
      PRINT '" Test #3 (one MOV instruction; other instructions re-ordered) took ";
      PRINT ;(time1% - time0%)/1000; " s."'
      
      PRINT''" Finished."
      END
      :
      :
      :
      :
      DEF PROC_asm
      LOCAL I%, P%, code%, loop_A, loop_B, loop_C
      DIM code% 1000
      
      FOR I% = 0 TO 2 STEP 2
        P% = code%
        [OPT I%
        
        ] : P%=(P%+3) AND -4 : [OPT I%
        
        .A%
        mov edi, bitmap%
        xor esi, esi
        .loop_A
        movzx ecx, BYTE [edi + 4*esi]
        movzx ebx, BYTE [edi + 4*esi + 1]
        movzx eax, BYTE [edi + 4*esi + 2]
        add esi, 1
        cmp esi, (640 * 512)
        jl loop_A
        ret
        
        ] : P%=(P%+3) AND -4 : [OPT I%
        
        .B%
        mov edi, bitmap%
        xor esi, esi
        .loop_B
        mov edx, [edi + 4*esi]
        mov eax, edx
        mov ebx, edx
        mov ecx, edx
        and eax, &FF0000
        and ebx, &FF00
        and ecx, &FF
        shr eax, 16
        shr ebx, 8
        add esi, 1
        cmp esi, (640 * 512)
        jl loop_B
        ret
        
        ] : P%=(P%+3) AND -4 : [OPT I%
        
        .C%
        mov edi, bitmap%
        xor esi, esi
        .loop_C
        mov edx, [edi + 4*esi]
        
        mov eax, edx
        and eax, &FF0000
        shr eax, 16
        
        mov ebx, edx
        and ebx, &FF00
        shr ebx, 8
        
        mov ecx, edx
        and ecx, &FF
        
        add esi, 1
        cmp esi, (640 * 512)
        jl loop_C
        ret
        
        ]
        
      NEXT I%
      ENDPROC
      :
      :
      :
      :
      DEF FNSYS_NameToAddress( f$ )
      LOCAL P%
      DIM P% LOCAL 5
      [OPT 0 : call f$ : ]
      =P%!-4+P%

On my Intel Centrino Duo 1.whatever GHz laptop, I get the following results:

Test #1 (three MOVZX instructions): 7.84 seconds.

Test #2 (one MOV instruction): 10.92 seconds.

Test #3 (one MOV instructoin; other instructions re-ordered): 10.97 seconds.

By "MOV instruction", I mean of course the main memory/cache pixel data loads.

I'd say that, with a difference of nearly 3 seconds, the three MOVZX method is significantly faster.

I still wonder if there's an even faster way of doing it.

Regards,

David.

Re: Setting a register to zero if it's < zero
Post by Michael Hutton on Oct 9^th, 2011, 1:39pm

Code:

mov eax, [edi + 4 * esi]
mov ebx, ah
mov ecx, al
shr eax,16

well, you must have known I meant: ;)

Code:

mov eax, [edi + 4 * esi]
mov bl, ah
mov cl, al
shr eax,16

I shouldn't edit code in the conforums!

Put that in the timings program... I would do myself but I have only got the demo version on this laptop (Yes, Richard, I can't remember what the Serial Number reg key is actually called and exactly where it is!)

Michael

Re: Setting a register to zero if it's < zero
Post by David Williams on Oct 9^th, 2011, 2:03pm

on Oct 9^th, 2011, 1:39pm, Michael Hutton wrote:

well, you must have known I meant: ;)

I was just being dense. And having been up all night... etc.

on Oct 9^th, 2011, 1:39pm, Michael Hutton wrote:

Code:

mov eax, [edi + 4 * esi]
mov bl, ah
mov cl, al
shr eax,16

Right, the updated timing test code with your solution(s):

Code:

      MODE 8 : OFF
      
      HIMEM = LOMEM + 2*&100000
      
      DIM gap1% 4096
      DIM bitmap% 4*(640*512 + 1)
      bitmap% = (bitmap% + 3) AND -4
      DIM  gap2% 4096
      
      REM. These 4 Kb gaps are probably way OTT, but just to be certain!
      
      PROC_asm
      
      REM. Fill bitmap with random values
      FOR I% = bitmap% TO (bitmap% + 4*640*512)-1 STEP 4
        !I% = RND
      NEXT
      
      G% = FNSYS_NameToAddress( "GetTickCount" )
      time0% = 0
      time1% = 0
      
      PRINT '" Conducting test #1, please wait..."'
      
      REM. Test #1 (three MOVZX instructions)
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &80
      SYS G% TO time0%
      FOR I% = 1 TO 5000
        CALL A%
      NEXT
      SYS G% TO time1%
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &20
      
      PRINT '" Test #1 (three MOVZX instructions) took ";
      PRINT ;(time1% - time0%)/1000; " s."'
      
      PRINT '" Conducting test #2, please wait..."'
      
      REM. Test #2 (one MOV instruction)
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &80
      SYS G% TO time0%
      FOR I% = 1 TO 5000
        CALL B%
      NEXT
      SYS G% TO time1%
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &20
      
      PRINT '" Test #2 (one MOV instruction) took ";
      PRINT ;(time1% - time0%)/1000; " s."'
      
      PRINT '" Conducting test #3, please wait..."'
      
      REM. Test #3 (one MOV instruction; other instructions re-ordered)
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &80
      SYS G% TO time0%
      FOR I% = 1 TO 5000
        CALL C%
      NEXT
      SYS G% TO time1%
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &20
      
      PRINT '" Test #3 (one MOV instruction; other instructions re-ordered) took ";
      PRINT ;(time1% - time0%)/1000; " s."'
      
      PRINT '" Conducting test #4, please wait..."'
      
      REM. Test #4 (one MOV instruction; Michael's solution (no XORs))
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &80
      SYS G% TO time0%
      FOR I% = 1 TO 5000
        CALL D%
      NEXT
      SYS G% TO time1%
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &20
      
      PRINT '" Test #4 (one MOV instruction; Michael's solution (no XORs)) took ";
      PRINT ;(time1% - time0%)/1000; " s."'
      
      PRINT '" Conducting test #5, please wait..."'
      
      REM. Test #5 (one MOV instruction; Michael's solution (with XORs))
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &80
      SYS G% TO time0%
      FOR I% = 1 TO 5000
        CALL E%
      NEXT
      SYS G% TO time1%
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &20
      
      PRINT '" Test #5 (one MOV instruction; Michael's solution (with XORs)) took ";
      PRINT ;(time1% - time0%)/1000; " s."'
      
      PRINT''" Finished."
      END
      :
      :
      :
      :
      DEF PROC_asm
      LOCAL I%, P%, code%, loop_A, loop_B, loop_C, loop_D, loop_E
      DIM code% 1000
      
      FOR I% = 0 TO 2 STEP 2
        P% = code%
        [OPT I%
        
        ] : P%=(P%+3) AND -4 : [OPT I%
        
        .A%
        mov edi, bitmap%
        xor esi, esi
        .loop_A
        movzx ecx, BYTE [edi + 4*esi]
        movzx ebx, BYTE [edi + 4*esi + 1]
        movzx eax, BYTE [edi + 4*esi + 2]
        add esi, 1
        cmp esi, (640 * 512)
        jl loop_A
        ret
        
        ] : P%=(P%+3) AND -4 : [OPT I%
        
        .B%
        mov edi, bitmap%
        xor esi, esi
        .loop_B
        mov edx, [edi + 4*esi]
        mov eax, edx
        mov ebx, edx
        mov ecx, edx
        and eax, &FF0000
        and ebx, &FF00
        and ecx, &FF
        shr eax, 16
        shr ebx, 8
        add esi, 1
        cmp esi, (640 * 512)
        jl loop_B
        ret
        
        ] : P%=(P%+3) AND -4 : [OPT I%
        
        .C%
        mov edi, bitmap%
        xor esi, esi
        .loop_C
        mov edx, [edi + 4*esi]
        
        mov eax, edx
        and eax, &FF0000
        shr eax, 16
        
        mov ebx, edx
        and ebx, &FF00
        shr ebx, 8
        
        mov ecx, edx
        and ecx, &FF
        
        add esi, 1
        cmp esi, (640 * 512)
        jl loop_C
        ret
        
        ] : P%=(P%+3) AND -4 : [OPT I%
        
        .D%
        mov edi, bitmap%
        xor esi, esi
        .loop_D
        
        mov eax, [edi + 4 * esi]
        mov bl, ah
        mov cl, al
        shr eax,16
        
        add esi, 1
        cmp esi, (640 * 512)
        jl loop_D
        ret
        
        ] : P%=(P%+3) AND -4 : [OPT I%
        
        .E%
        mov edi, bitmap%
        xor esi, esi
        .loop_E
        
        xor ebx, ebx
        xor ecx, ecx
        mov eax, [edi + 4 * esi]
        mov bl, ah
        mov cl, al
        shr eax,16
        
        add esi, 1
        cmp esi, (640 * 512)
        jl loop_E
        ret
        
        ]
        
      NEXT I%
      ENDPROC
      :
      :
      :
      :
      DEF FNSYS_NameToAddress( f$ )
      LOCAL P%
      DIM P% LOCAL 5
      [OPT 0 : call f$ : ]
      =P%!-4+P%

Results:

Test#1 (three MOVZX's) : 7.81 s
Test#2 (one MOV) : 10.89 s
Test#3 (one MOV; instructions re-ordered): 10.83 s
Test#4 (one MOV; Michael's solution (no XORs)): 6.02 s
Test#5 (one MOV; Michael's solution (with XORs)): 7.45 s

Since I probably need (I can't remember off-hand) to have the 3 higher bytes of EAX, EBX and ECX all clear, your code with the XORs looks like the one to go with, so thanks for that.

(Edit: Hey, I think this is another prime candidate for my proposed BB4W "Handy Assembler Code Snippets" (or whatever) web page!)

Rgs,
David.

Re: Setting a register to zero if it's < zero
Post by Michael Hutton on Oct 9^th, 2011, 2:21pm

Well, I am chuffed it worked.

Michael

Re: Setting a register to zero if it's < zero
Post by admin on Oct 9^th, 2011, 2:27pm

on Oct 9^th, 2011, 11:11am, Michael Hutton wrote:

Ok, for what it is worth, a not so very accurate test

I would rather say 'completely meaningless'. sad

There are no alignment instructions to ensure that the different subroutines are being run 'on a level playing field', i.e. with each independently aligned to a multiple of 32 bytes.

Also, the only way you'll get any meaningful comparison is by looping in the assembler code, because otherwise the overhead of the USR function and the NEXT statement will be hugely more than the code you are trying to benchmark.

Even having taken care of those factors, when comparing my suggested code with CMOV I still found that if I swapped around the two subroutines it was always the second that ran the faster, so the difference clearly wasn't related to the actual code.

It is well established that using conditional jumps will always win if the jumps are accurately predicted by the CPU.

Richard.

Re: Setting a register to zero if it's < zero
Post by admin on Oct 9^th, 2011, 2:35pm

on Oct 9^th, 2011, 12:01pm, David Williams wrote:

I'd like to hijack my own thread and ask what you think is the fastest way of getting the separate R, G, B values from a 32-bit xRGB pixel (&xxRrGgBb) into three registers... say EAX, EBX, ECX.

I know you didn't ask me, but....

I wouldn't start from there, because constraining yourself to using the general-purpose registers (eax etc.) is going to limit the speed, however optimised the code. The MMX instructions are specifically designed to handle things like 32-bit xRGB pixels efficiently, so I would expect using MMX would be the fastest way.

Of course it depends on what you do with the values next, once you've got them in separate registers, but I'd still be surprised if MMX doesn't win.

Richard.

Re: Setting a register to zero if it's < zero
Post by admin on Oct 9^th, 2011, 2:40pm

on Oct 9^th, 2011, 1:33pm, David Williams wrote:

It is allowed, but you have to use the correct mnemonic:

Code:

mov eax, [edi + 4 * esi]
movzx ebx, ah
movzx ecx, al
shr eax,16

Richard.

Re: Setting a register to zero if it's < zero
Post by David Williams on Oct 9^th, 2011, 2:55pm

on Oct 9^th, 2011, 2:35pm, Richard Russell wrote:

I know you didn't ask me, but....

I certainly had you in mind as well!

The trouble with the English language is that, in contrast to French and Arabic, it doesn't have a plural form of the word "you" (as in "you people", "you two gifted assembly language programmers", etc.).

on Oct 9^th, 2011, 2:35pm, Richard Russell wrote:

I wouldn't start from there, because constraining yourself to using the general-purpose registers (eax etc.) is going to limit the speed, however optimised the code. The MMX instructions are specifically designed to handle things like 32-bit xRGB pixels efficiently, so I would expect using MMX would be the fastest way.

I'll certainly look into it. I've got some MMX code of yours (alpha blending) which I might be able to adapt.

on Oct 9^th, 2011, 2:35pm, Richard Russell wrote:

Of course it depends on what you do with the values next, once you've got them in separate registers, but I'd still be surprised if MMX doesn't win.

Once they're in the registers, I compute the intensity using the fixed-point version of the formula:

i = 0.114*R + 0.587*G + 0.299*B

then I 'blend' i (which is in the range 0 to 255) with each of the R, G, B values by an amount specified as a factor ranging from 0.0 to beyond 1.0 (up to 10.0, say). Values less than 1.0 de-saturate the colour, values over 1.0 enhance saturation. (By the way, all the math is fixed-point, so this 'saturation factor' is multiplied by 2^20 (&100000) before it's passed to the routine).

I like the effect it produces (technically accurate or otherwise!), and will be a nice addition to GFXLIB.

Rgs,
David.

Re: Setting a register to zero if it's < zero
Post by David Williams on Oct 9^th, 2011, 2:58pm

on Oct 9^th, 2011, 2:40pm, Richard Russell wrote:

It is allowed, but you have to use the correct mnemonic:

Code:

mov eax, [edi + 4 * esi]
movzx ebx, ah
movzx ecx, al
shr eax,16

Richard.

Excellent. I can do away with those horrid XORs now. :)

Re: Setting a register to zero if it's < zero
Post by David Williams on Oct 9^th, 2011, 3:09pm

Okay, with Richard's correction, the fastest non-MMX/non-SIMD code yet is indeed:

Code:

        mov eax, [edi + 4 * esi]
        movzx ebx, ah
        movzx ecx, al
        shr eax,16

It took 5.93 s. in the timed test, approaching 2 seconds faster than the three MOVZX byte loads (Test #1).

Thanking MH and RTR once more.

Re: Setting a register to zero if it's < zero
Post by Michael Hutton on Oct 9^th, 2011, 3:14pm

Quote:

I would rather say 'completely meaningless'.

Quote:

overhead of the USR function and the NEXT statement

In defence, I wasn't commenting of the timing the BASIC loops, just the profiler report of the USR() line, but even that only has a resolution of 1ms.

I did say **seems to favour** over 10 million runs.... I am not sure "completely meaningless" really does apply, but I won't argue the point any further. I know what you are saying.

It obviously needs more rigourous testing.

re mmx:
I've been playing around trying to get this Visual C++ express to spew out SIMD codings but it doesn't want to play dice at the moment.

It's asm output for the ColourDrain routine is very similar to what you have already coded David. In fact, slower I would say by looking at it, but then I wouldn't trust my timings, especally the ones I've just done in my noodle, lol, and I would say my C++ code is not so hot.

Michael

Re: Setting a register to zero if it's < zero
Post by Michael Hutton on Oct 9^th, 2011, 3:19pm

Quote:

movzx ecx, al

I knew that but but I must admit to thinking that that it would just extend al into cx, not into ecx and was worried about high word garbage.

Oh well.

Michael

Re: Setting a register to zero if it's < zero
Post by David Williams on Oct 9^th, 2011, 3:27pm

on Oct 9^th, 2011, 3:14pm, Michael Hutton wrote:

It's asm output for the ColourDrain routine ...

Erm... you know it's occurred to me that "ColourDrain" (although I named it) is the kind of name that Fisher-Price might call it if they had written that routine. I'm thinking of calling it "DesaturateColour"! Isn't that more sensible?

Just on the off-chance that anyone wants to see the code (it's in the form of an external GFXLIB module), here it is:

Code:

      DEF PROCInitModule
      PROCInitModule( 0 )
      ENDPROC
      
      DEF PROCInitModule( V% )
      LOCAL codeSize%, I%, L%, P%, _{}, M$
      
      M$ = "ColourDrain"
      
      GFXLIB_CoreCode% += 0
      IF GFXLIB_CoreCode% = 0 THEN ERROR 0, "The GFXLIB core library appears not have been installed and initialised. The core library must be installed and initialised before attempting to install any external GFXLIB modules."
      
      codeSize% = 170
      DIM GFXLIB_ColourDrain% codeSize%-1, L% -1
      DIM _{fgtzero%, flt2p20%, loop%}
      
      IF V% THEN
        PRINT '" Assembling GFXLIB module " + CHR$34 + M$ + CHR$34 + "..."
      ENDIF
      
      FOR I% = 8 TO 10 STEP 2
        
        P% = GFXLIB_ColourDrain%
        
        [OPT I%
        
        ; REM. SYS GFXLIB_ColourDrain%, pBitmap%, numPixels%, f%
        
        ;
        ; Parameters -- pBitmap%, numPixels%, f%
        ;
        ;               pBitmap% - points to base address of 32-bpp ARGB bitmap
        ;               numPixels% - number of pixels in bitmap
        ;
        ;               f% (''colour-drain'' factor) is 12.20 fixed-point integer; range (0.0 to 1.0)*2^20  (Note 2^20 = &100000)
        ;
        ;               f% is clamped (by this routine) to 0 or 2^20 (&100000)
        ;
        
        pushad
        
        ; ESP!36 = pBitmap%
        ; ESP!40 = numPixels%
        ; ESP!44 = f% (= f * 2^20)
        
        sub esp, 4                               ; allocate space for one local variable
        
        ; And now...
        ;
        ; ESP!40 = pBitmap%         
        ; ESP!44 = numPixels%       
        ; ESP!48 = f% (= f * 2^20)  
        
        mov esi, [esp + 40]                      ; ESI = pBitmap%
        
        ; calc. address of final pixel
        
        mov ebp, [esp + 44]                      ; numPixels%
        sub ebp, 1                               ; numPixels% - 1 (because pixel index starts at zero)
        shl ebp, 2                               ; 4 * (numPixels% - 1) (because 4 bytes per pixel)
        add ebp, esi                             ; = addr of final pixel
        mov [esp], ebp                           ; ESP!0 = addr of final pixel
        
        mov edi, [esp + 48]                      ; EDI = f%
        
        ;REM. if f% < 0 then f% = 0
        cmp edi, 0                               ; f% < 0 ?
        jge _.fgtzero%
        xor edi, edi                             ; f% = 0
        ._.fgtzero%
        
        ;REM. if f% > 2^20 (&100000) then f% = 2^20
        cmp edi, 2^20                            ; f% > 2^20 ?
        jle _.flt2p20%
        mov edi, 2^20                            ; f% = 2^20
        ._.flt2p20%
        
        ._.loop%
        
        movzx ecx, BYTE [esi]                    ; ECX (cl) = blue byte (b&)
        movzx ebx, BYTE [esi + 1]                ; EBX (bl) = green byte (g&)
        movzx eax, BYTE [esi + 2]                ; EAX (al) = red byte (r&)
        
        xor ebp, ebp                             ; EBP = cumulative intensity (i) - initially zero
        
        ;REM. i += (0.114 * 2^20) * b&
        mov edx, ecx
        imul edx, (0.114 * 2^20)
        add ebp, edx
        
        ;REM. i += (0.587 * 2^20) * g&
        mov edx, ebx
        imul edx, (0.587 * 2^20)
        add ebp, edx
        
        ;REM. i += (0.299 * 2^20) * r&
        mov edx, eax
        imul edx, (0.299 * 2^20)
        add ebp, edx
        
        shr ebp, 20                              ; EDX (i&) now in the range 0 to 255
        
        ;REM. b`& = b& + (((i& - b&)*f%) >> 20)
        mov edx, ebp                             ; copy EBP (i&)
        sub edx, ecx                             ; i& - b&
        imul edx, edi                            ; (i& - b&)*f%
        shr edx, 20                              ; ((i& - b&)*f%) >> 20
        add BYTE [esi], dl                       ; write blue byte
        
        ;REM. g`& = r& + (((i& - g&)*f%) >> 20)
        mov edx, ebp                             ; copy EBP (i&)
        sub edx, ebx                             ; i& - g&
        imul edx, edi                            ; (i& - g&)*f%
        shr edx, 20                              ; ((i& - g&)*f%) >> 20
        add BYTE [esi + 1], dl                   ; write green byte
        
        ;REM. r`& = r& + (((i& - r&)*f%) >> 20)
        mov edx, ebp                             ; copy EBP (i&)
        sub edx, eax                             ; i& - r&
        imul edx, edi                            ; (i& - r&)*f%
        shr edx, 20                              ; ((i& - r&)*f%) >> 20
        add BYTE [esi + 2], dl                   ; write red byte
        
        add esi, 4                               ; next pixel address
        cmp esi, [esp]
        jle _.loop%
        
        add esp, 4                               ; free local variable space
        
        popad
        ret 12
        
        ]
        
      NEXT I%
      
      IF V% THEN
        PRINT " Assembled code size = "; (P% - GFXLIB_ColourDrain%);" bytes"
        WAIT V%
      ENDIF
      
      ENDPROC

(Edited some mistakes in the assembler code comments)

Rgs,

David.

Re: Setting a register to zero if it's < zero
Post by admin on Oct 9^th, 2011, 3:28pm

on Oct 9^th, 2011, 3:14pm, Michael Hutton wrote:

In defence, I wasn't commenting of the timing the BASIC loops, just the profiler report of the USR() line, but even that only has a resolution of 1ms.

Yes, fair point, I shouldn't have included NEXT in my comment, but USR will still take much longer than the code you're trying to benchmark.

Quote:

I am not sure "completely meaningless" really does apply

I was more meaning in respect of the lack of any alignment code. Modern CPUs typically fetch code in chunks of 32 bytes at a time, and execution speed can be affected depending on the alignment of the code with respect to those chunks (for example a short routine may fit entirely in one chunk, or be split between two).

David's got the right idea, by adding code to align P% before each subroutine, but he's only aligning to 4 bytes (which I don't think is very significant in respect of code, although of course it can be for data) rather than the 32 bytes that is necessary to eliminate alignment as a factor.

And as I said before, even when attempting to eliminate all confounding factors I still found that the code which ran the faster was always the routine which was tested second, and that applied both to my old P4 and an AMD Athlon 64.

So I think my 'meaningless' comment was justified, given the number of factors you didn't consider.

Richard.

Re: Setting a register to zero if it's < zero
Post by admin on Oct 9^th, 2011, 3:36pm

on Oct 9^th, 2011, 3:09pm, David Williams wrote:

Okay, with Richard's correction, the fastest non-MMX/non-SIMD code yet is indeed:
Code:

        mov eax, [edi + 4 * esi]
        movzx ebx, ah
        movzx ecx, al
        shr eax,16

I assume you appreciate that this code doesn't zero AH, so if your original pixel was indeed xRGB (rather than 0RGB) then eax will end up containing 'xR' not just 'R'.

On my P4, Test#1 (three MOVZX instructions) is the fastest. That doesn't surprise me, because the first MOVZX will load all 4 bytes into the L1 cache so the remaining two can execute as quickly as if they were loading from registers. It ends up faster because of there being no need to do the SHR (and AH is zeroed for free):

Code:

 Test #1 (three MOVZX instructions) took 2.719 s.
 Test #2 (one MOV instruction) took 3.953 s.
 Test #3 (one MOV instruction; other instructions re-ordered) took 3.828 s.
 Test #4 (one MOV instruction; Michael's solution (no XORs)) took 2.953 s.
 Test #5 (one MOV instruction; Michael's solution (with XORs)) took 3.266 s.
 Test #6 (one MOV instruction; Richard's solution) took 2.812 s.

Richard.

Re: Setting a register to zero if it's < zero
Post by David Williams on Oct 9^th, 2011, 4:14pm

on Oct 9^th, 2011, 3:36pm, Richard Russell wrote:

I assume you appreciate that this code doesn't zero AH, so if your original pixel was indeed xRGB (rather than 0RGB) then eax will end up containing 'xR' not just 'R'.

I'd like to make the unsafe assumption that the source ARGB32 bitmap will always have all the MSB bytes clear, but that probably won't be the case.
Having to clear that byte seems to spoil the beauty of the code somewhat! And slows it down, of course. It'll have to be done, though.

on Oct 9^th, 2011, 3:36pm, Richard Russell wrote:

On my P4, Test#1 (three MOVZX instructions) is the fastest.

(Sigh.) Well, that just goes to show, doesn't it!

on Oct 9^th, 2011, 3:36pm, Richard Russell wrote:

That doesn't surprise me, because the first MOVZX will load all 4 bytes into the L1 cache so the remaining two can execute as quickly as if they were loading from registers. It ends up faster because of there being no need to do the SHR (and AH is zeroed for free):

In light of the 'new' code (getting those R, G, B values into registers), I was going to set about modifying a dozen-or-so GFXLIB routines to rid them of those MOVZX byte-loading instructions. Now it looks like I ought not to bother!

on Oct 9^th, 2011, 3:36pm, Richard Russell wrote:

Code:

 Test #1 (three MOVZX instructions) took 2.719 s.
 Test #2 (one MOV instruction) took 3.953 s.
 Test #3 (one MOV instruction; other instructions re-ordered) took 3.828 s.
 Test #4 (one MOV instruction; Michael's solution (no XORs)) took 2.953 s.
 Test #5 (one MOV instruction; Michael's solution (with XORs)) took 3.266 s.
 Test #6 (one MOV instruction; Richard's solution) took 2.812 s.

Richard.

Assuming that those FOR...NEXT loops still had 5000 iterations each (as in the original), I'm curious that your presumably ancient P4 has
trounced my 1.83GHz Intel Centrino Duo-based laptop (in this test). Even if your clock speed was 3Ghz, still interesting.

David.

Re: Setting a register to zero if it's < zero
Post by David Williams on Oct 9^th, 2011, 4:20pm

on Oct 9^th, 2011, 3:14pm, Michael Hutton wrote:

It's asm output for the ColourDrain routine is very similar to what you have already coded David. In fact, slower I would say by looking at it, but then I wouldn't trust my timings, especally the ones I've just done in my noodle, lol, and I would say my C++ code is not so hot.

I'm curious. May I see that code?

I didn't think I could ever beat a modern C++ compiler in terms of code efficiency!

Rgs,

David.

Re: Setting a register to zero if it's < zero
Post by Michael Hutton on Oct 9^th, 2011, 5:13pm

I have to go out now but will post the relevant code for you later..

It is very interesting about the three movzx's being faster in tests. It is good to see these things thrashed out in a thread, to confirm or deny your prejudices/inclinations.

Michael

Re: Setting a register to zero if it's < zero
Post by admin on Oct 9^th, 2011, 5:32pm

on Oct 9^th, 2011, 4:14pm, David Williams wrote:

Assuming that those FOR...NEXT loops still had 5000 iterations each (as in the original)

Yes, the code was unchanged except that I modified it to align each routine on a 32-byte boundary (rather than 4). My CPU is a 2.8 GHz Pentium 4.

Here is a comparison of MMX and GP register versions of your 'luminance matrix' code:

Code:

      MODE 8 : OFF
      
      HIMEM = LOMEM + 2*&100000
      
      DIM gap1% 4096
      DIM bitmap% 4*(640*512 + 2)
      bitmap% = (bitmap% + 7) AND -8
      DIM  gap2% 4096
      
      REM. These 4 Kb gaps are probably way OTT, but just to be certain!
      
      PROC_asm
      
      REM. Fill bitmap with random values
      FOR I% = bitmap% TO (bitmap% + 4*640*512)-1 STEP 4
        !I% = RND
      NEXT
      
      G% = FNSYS_NameToAddress( "GetTickCount" )
      time0% = 0
      time1% = 0
      
      PRINT '" Conducting test #1, please wait..."'
      
      REM. Test #1 (DW's code)
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &80
      SYS G% TO time0%
      FOR I% = 1 TO 5000
        C%=USRA%
      NEXT
      SYS G% TO time1%
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &20
      
      PRINT '" Test #1 (DW's code) took ";
      PRINT ;(time1% - time0%)/1000; " s. (final result ";C% ")"'
      
      PRINT '" Conducting test #2, please wait..."'
      
      REM. Test #2 (RTR's code)
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &80
      SYS G% TO time0%
      FOR I% = 1 TO 5000
        C%=USRB%
      NEXT
      SYS G% TO time1%
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &20
      
      PRINT '" Test #2 (RTR's code) took ";
      PRINT ;(time1% - time0%)/1000; " s. (final result ";C% ")"'
      
      PRINT''" Finished."
      END
      ;
      DEF PROC_asm
      LOCAL I%, P%, code%, loop_A, loop_B, loop_C
      DIM code% 1000
      
      FOR I% = 0 TO 2 STEP 2
        P% = code%
        [OPT I%
        
        ] : P%=(P%+31) AND -32 : [OPT I%
        
        .A%
        mov esi, bitmap%
        .loop_A
        movzx ecx, BYTE [esi]                    ; ECX (cl) = blue byte (b&)
        movzx ebx, BYTE [esi + 1]                ; EBX (bl) = green byte (g&)
        movzx eax, BYTE [esi + 2]                ; EAX (al) = red byte (r&)
        
        xor ebp, ebp                             ; EBP = cumulative intensity (i) - initially zero
        
        ;REM. i += (0.114 * 2^20) * b&
        mov edx, ecx
        imul edx, (0.114 * 2^20)
        add ebp, edx
        
        ;REM. i += (0.587 * 2^20) * g&
        mov edx, ebx
        imul edx, (0.587 * 2^20)
        add ebp, edx
        
        ;REM. i += (0.299 * 2^20) * r&
        mov edx, eax
        imul edx, (0.299 * 2^20)
        add ebp, edx
        
        shr ebp, 20                              ; EDX (i&) now in the range 0 to 255
        
        add esi, 4
        cmp esi, (bitmap% + 640 * 512)
        jl loop_A
        mov eax,ebp
        ret
        
        ] : P%=(P%+31) AND -32 : [OPT I%
        
        .matrix
        dw 0.114 * 2^15
        dw 0.587 * 2^15
        dw 0.299 * 2^15
        dw 0
        
        .B%
        mov esi, bitmap%
        movq mm7, [matrix]
        
        .loop_B
        punpcklbw mm0,[esi]
        psrlw mm0,8
        pmaddwd mm0,mm7
        movd ebp,mm0
        pshufw mm0,mm0,%01001110
        movd eax,mm0
        add ebp,eax
        shr ebp,15
        
        add esi, 4
        cmp esi, (bitmap% + 640 * 512)
        jl loop_B
        mov eax,ebp
        ret
        
        ]
        
      NEXT I%
      ENDPROC
      ;
      DEF FNSYS_NameToAddress( f$ )
      LOCAL P%
      DIM P% LOCAL 5
      [OPT 0 : call f$ : ]
      =P%!-4+P%

On my PCs the MMX version doesn't run much, if any, faster but the potential gain comes from being able to use the other MMX registers to process (say) four pixels 'in parallel'.

Richard.

Re: Setting a register to zero if it's < zero
Post by Michael Hutton on Oct 9^th, 2011, 8:28pm

Code:

.loop_B
        punpcklbw mm0,[esi]
        psrlw mm0,8
        pmaddwd mm0,mm7
        movd ebp,mm0
        pshufw mm0,mm0,%01001110
        movd eax,mm0
        add ebp,eax
        shr ebp,15

Very nice!

Michael

Re: Setting a register to zero if it's < zero
Post by David Williams on Oct 9^th, 2011, 9:08pm

on Oct 9^th, 2011, 5:32pm, Richard Russell wrote:

Here is a comparison of MMX and GP register versions of your 'luminance matrix' code:

Thanks for the code (and apologies for referring to the luminance as "the intensity"!).

The MMX version is slightly slower on my laptop:

GR version: 3.67 s.
MMX version: 4.19 s

I know that when it comes to the 'blending' (that's my word for it!) part, MMX is likely to be faster because, IIRC, the MMX-powered alphablending code you kindly contributed to GFXLIB is about twice as fast as the non-MMX version.

Anyway, this has been an interesting discussion (it's certainly occupied much of my day). Things have been learnt, and I'm beginning to wonder if time is better spent not worrying too much about optimising code...because seemingly optimised code on one system is not so on another. (I did know this, but today's discussion has highlighted the fact).

Now I must get on with preparing version 2.03 of GFXLIB for release, along with the GFXLIB FAQ.

After Christmas, I may have another crack at producing a GFXLIB-Lite for the Trial version of BB4W, having taken on board the points/ideas raised by you and Michael (in private correspondence).

Rgs,
David.

Re: Setting a register to zero if it's < zero
Post by Michael Hutton on Oct 9^th, 2011, 9:15pm

David,

Here is the assembler output for the following routine in C++. I'm not really sure it adds anything or that we should take anything from it. I don't even know if I have put in the right optimising flags etc.

As an aside, the output bitmap does suffer from some aliasing with this routine.

It might also heavily dependent on how I access the bitmap - pData is defined as DWORD* whereas now we know that movzx *3 is better using unsigned char* could be better.
Code:

void SetRGBs(DWORD* pData, int num, int f)
{
	// Some set up
	unsigned int r,g,b,i,rgb;
	
	// Clamp f
	if (f < 0) {f = 0;} 
	if (f > 0x100000) {f = 0x100000;}	
	for (int x = 0; x < 256; x ++){	
		for (int y = 0; y < 192; y++){
		// get the rgb data	
			rgb = pData[(y * 256) + x];	
			r = (rgb >> 16) & 0xFF;   
			g = (rgb >> 8) & 0xFF;
		    b = rgb & 0xFF;
		// calculate cumulative intensity
			i  = r * 313524;
			i += g * 615514;
			i += b * 119537;
		// Refactor to 0-255
			i = i >> 20;
		// Calculate the new values
			r = r + (((i-r)*f) >> 20);
			g = g + (((i-g)*f) >> 20);
			b = b + (((i-b)*f) >> 20);
		// Write Pixel Back
			pData[(y * 256) + x] = (r << 16) | (g <<8) | b;
		}
	}
}

and the asm code is - take a deep breath!
Code:

_TEXT	SEGMENT
_y$33985 = -80						; size = 4
_x$33981 = -68						; size = 4
_rgb$ = -56						; size = 4
_i$ = -44						; size = 4
_b$ = -32						; size = 4
_g$ = -20						; size = 4
_r$ = -8						; size = 4
_pData$ = 8						; size = 4
_num$ = 12						; size = 4
_f$ = 16						; size = 4
?SetRGBs@@YAXPAKHH@Z PROC				; SetRGBs, COMDAT

; 302  : {

	push	ebp
	mov	ebp, esp
	sub	esp, 276				; 00000114H
	push	ebx
	push	esi
	push	edi
	lea	edi, DWORD PTR [ebp-276]
	mov	ecx, 69					; 00000045H
	mov	eax, -858993460				; ccccccccH
	rep stosd

; 303  : 	// Some set up
; 304  : 	unsigned int r,g,b,i,rgb;
; 305  : 	
; 306  : 	// Clamp f
; 307  : 	if (f < 0) {f = 0;} 

	cmp	DWORD PTR _f$[ebp], 0
	jge	SHORT $LN8@SetRGBs
	mov	DWORD PTR _f$[ebp], 0
$LN8@SetRGBs:

; 308  : 	if (f > 0x100000) {f = 0x100000;}

	cmp	DWORD PTR _f$[ebp], 1048576		; 00100000H
	jle	SHORT $LN7@SetRGBs
	mov	DWORD PTR _f$[ebp], 1048576		; 00100000H
$LN7@SetRGBs:

; 309  : 
; 310  : 	
; 311  : 	for (int x = 0; x < 256; x ++){	

	mov	DWORD PTR _x$33981[ebp], 0
	jmp	SHORT $LN6@SetRGBs
$LN5@SetRGBs:
	mov	eax, DWORD PTR _x$33981[ebp]
	add	eax, 1
	mov	DWORD PTR _x$33981[ebp], eax
$LN6@SetRGBs:
	cmp	DWORD PTR _x$33981[ebp], 256		; 00000100H
	jge	$LN9@SetRGBs

; 312  : 		for (int y = 0; y < 192; y++){

	mov	DWORD PTR _y$33985[ebp], 0
	jmp	SHORT $LN3@SetRGBs
$LN2@SetRGBs:
	mov	eax, DWORD PTR _y$33985[ebp]
	add	eax, 1
	mov	DWORD PTR _y$33985[ebp], eax
$LN3@SetRGBs:
	cmp	DWORD PTR _y$33985[ebp], 192		; 000000c0H
	jge	$LN1@SetRGBs

; 313  : 		// get the rgb data	
; 314  : 			rgb = pData[(y * 256) + x];	

	mov	eax, DWORD PTR _y$33985[ebp]
	shl	eax, 8
	add	eax, DWORD PTR _x$33981[ebp]
	mov	ecx, DWORD PTR _pData$[ebp]
	mov	edx, DWORD PTR [ecx+eax*4]
	mov	DWORD PTR _rgb$[ebp], edx

; 315  : 			r = (rgb >> 16) & 0xFF;   

	mov	eax, DWORD PTR _rgb$[ebp]
	shr	eax, 16					; 00000010H
	and	eax, 255				; 000000ffH
	mov	DWORD PTR _r$[ebp], eax

; 316  : 			g = (rgb >> 8) & 0xFF;

	mov	eax, DWORD PTR _rgb$[ebp]
	shr	eax, 8
	and	eax, 255				; 000000ffH
	mov	DWORD PTR _g$[ebp], eax

; 317  : 		    b = rgb & 0xFF;

	mov	eax, DWORD PTR _rgb$[ebp]
	and	eax, 255				; 000000ffH
	mov	DWORD PTR _b$[ebp], eax

; 318  : 		// calculate cumulative intensity
; 319  : 			i  = r * 313524;

	mov	eax, DWORD PTR _r$[ebp]
	imul	eax, 313524				; 0004c8b4H
	mov	DWORD PTR _i$[ebp], eax

; 320  : 			i += g * 615514;

	mov	eax, DWORD PTR _g$[ebp]
	imul	eax, 615514				; 0009645aH
	add	eax, DWORD PTR _i$[ebp]
	mov	DWORD PTR _i$[ebp], eax

; 321  : 			i += b * 119537;

	mov	eax, DWORD PTR _b$[ebp]
	imul	eax, 119537				; 0001d2f1H
	add	eax, DWORD PTR _i$[ebp]
	mov	DWORD PTR _i$[ebp], eax

; 322  : 		// Refactor to 0-255
; 323  : 			i = i >> 20;

	mov	eax, DWORD PTR _i$[ebp]
	shr	eax, 20					; 00000014H
	mov	DWORD PTR _i$[ebp], eax

; 324  : 		// Calculate the new values
; 325  : 			r = r + (((i-r)*f) >> 20);

	mov	eax, DWORD PTR _i$[ebp]
	sub	eax, DWORD PTR _r$[ebp]
	imul	eax, DWORD PTR _f$[ebp]
	shr	eax, 20					; 00000014H
	add	eax, DWORD PTR _r$[ebp]
	mov	DWORD PTR _r$[ebp], eax

; 326  : 			g = g + (((i-g)*f) >> 20);

	mov	eax, DWORD PTR _i$[ebp]
	sub	eax, DWORD PTR _g$[ebp]
	imul	eax, DWORD PTR _f$[ebp]
	shr	eax, 20					; 00000014H
	add	eax, DWORD PTR _g$[ebp]
	mov	DWORD PTR _g$[ebp], eax

; 327  : 			b = b + (((i-b)*f) >> 20);

	mov	eax, DWORD PTR _i$[ebp]
	sub	eax, DWORD PTR _b$[ebp]
	imul	eax, DWORD PTR _f$[ebp]
	shr	eax, 20					; 00000014H
	add	eax, DWORD PTR _b$[ebp]
	mov	DWORD PTR _b$[ebp], eax

; 328  : 		// Write Pixel Back
; 329  : 			pData[(y * 256) + x] = (r << 16) | (g <<8) | b;

	mov	eax, DWORD PTR _r$[ebp]
	shl	eax, 16					; 00000010H
	mov	ecx, DWORD PTR _g$[ebp]
	shl	ecx, 8
	or	eax, ecx
	or	eax, DWORD PTR _b$[ebp]
	mov	edx, DWORD PTR _y$33985[ebp]
	shl	edx, 8
	add	edx, DWORD PTR _x$33981[ebp]
	mov	ecx, DWORD PTR _pData$[ebp]
	mov	DWORD PTR [ecx+edx*4], eax

; 330  : 		}

	jmp	$LN2@SetRGBs
$LN1@SetRGBs:

; 331  : 	}

	jmp	$LN5@SetRGBs
$LN9@SetRGBs:

; 332  : }

	pop	edi
	pop	esi
	pop	ebx
	mov	esp, ebp
	pop	ebp
	ret	0
?SetRGBs@@YAXPAKHH@Z ENDP				; SetRGBs
_TEXT	ENDS

Re: Setting a register to zero if it's < zero
Post by David Williams on Oct 9^th, 2011, 10:17pm

Michael: Gracias for the C++ output asm code. I'll study it keenly.

For those who've been following the discussion, here's a quick demo of GFXLIB_ColourDesaturate (compiled EXE):

http://www.bb4wgames.com/misc/realtimecolourdesaturation.zip

It struggles to maintain the full 60 fps framerate on my laptop (typically ~50 fps) which disappoints me, but then the routine isn't intended for realtime desaturation of 640x480 images in a gaming situation!

I reckon it's fast enough.

Regards,
David.

Source (but don't try to run it!):

Code:

      HIMEM = LOMEM + 5*&100000
      HIMEM = (HIMEM + 3) AND -4
      
      PROCfixWindowSize
      
      ON ERROR PROCerror( REPORT$, TRUE )
      
      WinW% = 640
      WinH% = 480
      VDU 23, 22, WinW%; WinH%; 8, 16, 16, 0 : OFF
      
      INSTALL @lib$ + "GFXLIB2"
      PROCInitGFXLIB( d{}, 0 )
      
      INSTALL @lib$ + "GFXLIB_modules\ColourDrain"
      PROCInitModule
      
      INSTALL @lib$ + "GFXLIB_modules\PlotShapeHalfIntensity"
      PROCInitModule
      
      GetTickCount% = FNSYS_NameToAddress( "GetTickCount" )
      SetWindowText% = FNSYS_NameToAddress( "SetWindowText" )
      
      flowers% = FNLoadImg( @dir$ + "flowers_640x480.JPG", 0 )
      
      flowers_copy% = FNmalloc( 4 * 640*480 )
      
      ball% = FNLoadImg( @lib$ + "GFXLIB_media\ball1_64x64x8.BMP", 0 )
      
      numFrames% = 0
      
      *REFRESH OFF
      
      SYS GetTickCount% TO time0%
      REPEAT
        
        T% = TIME
        
        f = 0.5 * (1.0 + SIN(T%/100))
        SYS GFXLIB_DWORDCopy%, flowers%, flowers_copy%, 640*480
        SYS GFXLIB_ColourDrain%, flowers_copy%, 640*480, f*&100000
        SYS GFXLIB_BPlot%, d{}, flowers_copy%, 640, 480, 0, 0
        
        FOR I% = 0 TO 11
          X% = (320 - 32) + 220*SINRAD(I%*(360/12) + T%/2)
          Y% = (240 - 32) + 220*COSRAD(I%*(360/12) + T%/2)
          SYS GFXLIB_PlotShapeHalfIntensity%, d{}, ball%, 64, 64, X%-15, Y%-16
          SYS GFXLIB_Plot%, d{}, ball%, 64, 64, X%, Y%
        NEXT I%
        
        PROCdisplay
        
        SYS GetTickCount% TO time1%
        IF (time1% - time0%) >= 1000 THEN
          SYS SetWindowText%, @hwnd%, "Frame rate: " + STR$numFrames% + " fps"
          numFrames% = 0
          SYS GetTickCount% TO time0%
        ELSE
          numFrames% += 1
        ENDIF
        
      UNTIL FALSE
      END
      :
      :
      :
      :
      DEF PROCfixWindowSize
      LOCAL GWL_STYLE, WS_THICKFRAME, WS_MAXIMIZEBOX, ws%
      GWL_STYLE = -16
      WS_THICKFRAME = &40000
      WS_MAXIMIZEBOX = &10000
      SYS "GetWindowLong", @hwnd%, GWL_STYLE TO ws%
      SYS "SetWindowLong", @hwnd%, GWL_STYLE, ws% AND NOT (WS_THICKFRAME+WS_MAXIMIZEBOX)
      ENDPROC
      :
      :
      :
      :
      DEF PROCerror( msg$, L% )
      OSCLI "REFRESH ON" : ON
      COLOUR 1, &FF, &FF, &FF
      COLOUR 1
      PRINT TAB(1,1)msg$;
      IF L% THEN
        PRINT " at line "; ERL;
      ENDIF
      VDU 7
      REPEAT UNTIL INKEY(1)=0
      ENDPROC

Re: Setting a register to zero if it's < zero
Post by admin on Oct 9^th, 2011, 10:23pm

on Oct 9^th, 2011, 3:27pm, David Williams wrote:

I'm thinking of calling it "DesaturateColour"! Isn't that more sensible?

Yes!

If you're looking for ways to tidy up your code, please note that this:

Code:

      sub ebp, 1
      shl ebp, 2
      add ebp, esi

can be replaced by this (just 4 bytes):

Code:

      lea ebp,[esi+ebp*4-4]

There's no significant speed impact, because it's not in a loop, but in terms of elegance there's no contest!

Richard.

Re: Setting a register to zero if it's < zero
Post by David Williams on Oct 9^th, 2011, 10:28pm

on Oct 9^th, 2011, 10:23pm, Richard Russell wrote:

Code:

lea ebp,[esi+ebp*4-4]

There's no significant speed impact, because it's not in a loop, but in terms of elegance there's no contest!

There was no excuse for me to miss that one, really, especially since I had read this article not long ago:

http://bb4w.wikispaces.com/Using+the+lea+instruction

Thanks again.

Re: Setting a register to zero if it's < zero
Post by admin on Oct 10^th, 2011, 01:10am

on Oct 9^th, 2011, 10:17pm, David Williams wrote:

For those who've been following the discussion, here's a quick demo of GFXLIB_ColourDesaturate (compiled EXE)

Here's a MMX version of GFXLIB_ColourDrain:

Code:

        ; REM. SYS GFXLIB_ColourDrain%, pBitmap%, numPixels%, f%
        
        ;
        ; Parameters -- pBitmap%, numPixels%, f%
        ;
        ;               pBitmap% - points to base address of 32-bpp ARGB bitmap
        ;               numPixels% - number of pixels in bitmap
        ;
        ;               f% (''colour-drain'' factor) is 12.20 fixed-point integer; range (0.0 to 1.0)*2^20  (Note 2^20 = &100000)
        ;
        ;               f% is clamped (by this routine) to 0 or 2^20 (&100000)
        ;
        
        pushad
        
        ; ESP!36 = pBitmap%
        ; ESP!40 = numPixels%
        ; ESP!44 = f% (= f * 2^20)
        
        mov esi, [esp + 36]                      ; esi = pBitmap%
        
        mov ebp, [esp + 40]                      ; numPixels%
        lea ebp, [esi + ebp*4]
        
        mov edi, [esp + 44]                      ; edi = f%
        
        ;REM. if f% < 0 then f% = 0
        cmp edi, 0                               ; f% < 0 ?
        jge _.fgtzero%
        xor edi, edi                             ; f% = 0
        ._.fgtzero%
        
        ;REM. if f% >= 2^20 (&100000) then f% = 2^20-1
        cmp edi, 2^20                            ; f% > 2^20 ?
        jl _.flt2p20%
        mov edi, 2^20-1                          ; f% = 2^20-1
        ._.flt2p20%
        
        shr edi, 5
        movd mm6, edi
        pshufw mm6, mm6, %11000000
        movq mm7, [_.matrix%]
        
        ._.loop%
        
        punpcklbw mm0,[esi]
        punpckhbw mm1,[esi]
        psrlw mm0,8
        psrlw mm1,8
        movq mm2,mm0
        movq mm3,mm1
        pmaddwd mm0,mm7
        pmaddwd mm1,mm7
        pshufw mm4,mm0,%01001110
        pshufw mm5,mm1,%01001110
        paddd mm4,mm0
        paddd mm5,mm1
        pslld mm4,1
        pslld mm5,1
        pshufw mm4,mm4,%01010101
        pshufw mm5,mm5,%01010101
        psubw mm4,mm2
        psubw mm5,mm3
        pmulhw mm4,mm6
        pmulhw mm5,mm6
        psllw mm4,1
        psllw mm5,1
        paddw mm4,mm2
        paddw mm5,mm3
        packuswb mm4,mm5
        movq [esi],mm4
        
        add esi, 8                               ; next pixel address
        cmp esi, ebp
        jb _.loop%
        
        popad
        emms
        ret 12
        
        ._.matrix%
        dw 0.114 * 2^15
        dw 0.587 * 2^15
        dw 0.299 * 2^15
        dw 0

I haven't compared its speed with yours, but I would expect it to be faster. As I'm by no means an MMX expert it may well be that it can be improved.

Richard.

Re: Setting a register to zero if it's < zero
Post by David Williams on Oct 10^th, 2011, 06:55am

on Oct 10^th, 2011, 01:10am, Richard Russell wrote:

Here's a MMX version of GFXLIB_ColourDrain:

Gratefully received. :)

Okay, I had to make one little correction because I discovered that only one (or two?) pixels were being processed in the image.

The MMX version (MMXDesaturateColour) is nearly twice as fast (on my Centrino Duo laptop) as the non-MMX (GR) version.

1000 full-image operations on a 640x480 ARGB32 bitmap took:

4.84 s. (MMX version)
9.22 s (GR version)

The test (compiled EXE) can be downloaded here:

www.bb4wgames.com/misc/mmxdesaturatecolour_vs_colourdrain.zip

As I mentioned yesterday, I'll be dropping the Fisher-Price routine name (ColourDrain) and calling it DesaturateColour.

I won't just grab your MMX code and learn nothing from it, that you can be assured.

Thanks for the code.

David.

---

For the sake of completeness only, I'll list the source for the timed test here:

Code:

      HIMEM = LOMEM + 5*&100000
      HIMEM = (HIMEM + 3) AND -4
      
      PROCfixWindowSize
      
      ON ERROR PROCerror( REPORT$, TRUE )
      
      WinW% = 640
      WinH% = 480
      VDU 23, 22, WinW%; WinH%; 8, 16, 16, 0 : OFF
      
      INSTALL @lib$ + "GFXLIB2"
      PROCInitGFXLIB( d{}, 0 )
      
      INSTALL @lib$ + "GFXLIB_modules\ColourDrain"
      PROCInitModule
      
      INSTALL @lib$ + "GFXLIB_modules\MMXDesaturateColour"
      PROCInitModule
      
      GetTickCount% = FNSYS_NameToAddress( "GetTickCount" )
      
      flowers% = FNLoadImg( @dir$ + "flowers_640x480.JPG", 0 )
      flowers_copy% = FNmalloc( 4 * 640*480 )
      
      timeA_0% = 0
      timeA_1% = 0
      timeB_0% = 0
      timeB_1% = 0
      
      PRINT
      
      PRINT " Conducting timed tests (MMXDesaturateColour vs. ColourDrain)"'
      PRINT " (1000 colour desaturations of a 640x480 ARGB32 bitmap)"'
      
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &80
      
      PRINT " Timing MMXDesaturateColour..."
      df = 0.01
      f = 0.0
      G% = GFXLIB_MMXDesaturateColour%
      SYS GetTickCount% TO timeA_0%
      FOR I% = 1 TO 1000
        SYS GFXLIB_DWORDCopy%, flowers%, flowers_copy%, 640*480
        SYS G%, flowers_copy%, 640*480, f*&100000
        f += df
        IF f >= 1.0 THEN f = 0.0
      NEXT I%
      SYS GetTickCount% TO timeA_1%
      
      PRINT " Timing ColourDrain..."
      df = 0.01
      f = 0.0
      G% = GFXLIB_ColourDrain%
      SYS GetTickCount% TO timeB_0%
      FOR I% = 1 TO 1000
        SYS GFXLIB_DWORDCopy%, flowers%, flowers_copy%, 640*480
        SYS G%, flowers_copy%, 640*480, f*&100000
        f += df
        IF f >= 1.0 THEN f = 0.0
      NEXT I%
      SYS GetTickCount% TO timeB_1%
      
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &20
      
      timeA = (timeA_1% - timeA_0%) / 1000
      timeB = (timeB_1% - timeB_0%) / 1000
      
      SOUND OFF : SOUND 1, -10, 226, 1
      
      COLOUR 11 : ON
      PRINT '" Results" : PRINT " -------"'
      
      PRINT " MMXDesaturateColour took "; timeA; " s."'
      PRINT " ColourDrain took "; timeB; " s."''
      
      COLOUR 3 : PRINT " Finished!";
      REPEAT UNTIL INKEY(1)=0
      END
      :
      :
      :
      :
      DEF PROCfixWindowSize
      LOCAL GWL_STYLE, WS_THICKFRAME, WS_MAXIMIZEBOX, ws%
      GWL_STYLE = -16
      WS_THICKFRAME = &40000
      WS_MAXIMIZEBOX = &10000
      SYS "GetWindowLong", @hwnd%, GWL_STYLE TO ws%
      SYS "SetWindowLong", @hwnd%, GWL_STYLE, ws% AND NOT (WS_THICKFRAME+WS_MAXIMIZEBOX)
      ENDPROC
      :
      :
      :
      :
      DEF PROCerror( msg$, L% )
      OSCLI "REFRESH ON" : ON
      COLOUR 1, &FF, &FF, &FF
      COLOUR 1
      PRINT TAB(1,1)msg$;
      IF L% THEN
        PRINT " at line "; ERL;
      ENDIF
      VDU 7
      REPEAT UNTIL INKEY(1)=0
      ENDPROC

Re: Setting a register to zero if it's < zero
Post by admin on Oct 10^th, 2011, 08:32am

on Oct 10^th, 2011, 06:55am, David Williams wrote:

Okay, I had to make one little correction because I discovered that only one (or two?) pixels were being processed in the image.

Ah yes, was that the edi that should have been an esi? Oddly, it worked here despite the error.

There's another change you should really make. The code as listed affects all four bytes of the resulting pixel (including the most-significant 'alpha' byte). Presumably you would prefer it to leave that byte unchanged, in which case you should alter the third line here as shown:

Code:

        shr edi, 5
        movd mm6, edi
        pshufw mm6, mm6, %11000000

Quote:

As I mentioned yesterday, I'll be dropping the Fisher-Price routine name (ColourDrain) and calling it DesaturateColour

I know, but I only had the original version to work from. It seemed safer not to make any unnecessary changes.

Richard.

Re: Setting a register to zero if it's < zero
Post by David Williams on Oct 13^th, 2011, 7:40pm

GFXLIB_MMXDesaturateColour & GFXLIB_BoxBlur3x3:

http://www.bb4wgames.com/misc/mmxdesaturatecolour_example2c.zip (EXE; 163 Kb)

I can imagine using that kind of effect on the title page of some creepy RPG just before the game begins.

David.

======================================

Code:

      *ESC OFF
      
      REM Make 3 MB available for this program
      M%=3 : HIMEM = LOMEM + M%*&100000
      
      MODE 8 : OFF
      
      INSTALL @lib$ + "GFXLIB2" : PROCInitGFXLIB
      INSTALL @lib$ + "GFXLIB_modules\MMXDesaturateColour" : PROCInitModule
      INSTALL @lib$ + "GFXLIB_modules\BoxBlur3x3" : PROCInitModule
      
      bm% = FNLoadImg( @lib$ + "GFXLIB_media\bg1_640x512x8.bmp", 0 )
      
      *REFRESH OFF
      
      REPEAT
        
        REM. Display the image normally for two seconds
        SYS GFXLIB_BPlot%, dispVars{}, bm%, 640, 512, 0, 0
        PROCdisplay
        WAIT 200
        
        FOR I% = 1 TO 280
          SYS GFXLIB_BoxBlur3x3%, dispVars.bmBuffAddr%, dispVars.bmBuffAddr%, 640, 512
          IF I% MOD 2 = 0 THEN SYS GFXLIB_MMXDesaturateColour%, dispVars.bmBuffAddr%, 640*512, 0.01*&100000
          PROCdisplay
        NEXT I%
        
      UNTIL FALSE