BBC BASIC for Windows
« Setting a register to zero if it's < zero »

Welcome Guest. Please Login or Register.
Apr 5th, 2018, 11:27pm



ATTENTION MEMBERS: Conforums will be closing it doors and discontinuing its service on April 15, 2018.
Ad-Free has been deactivated. Outstanding Ad-Free credits will be reimbursed to respective payment methods.

If you require a dump of the post on your message board, please come to the support board and request it.


Thank you Conforums members.

BBC BASIC for Windows Resources
Online BBC BASIC for Windows documentation
BBC BASIC for Windows Beginners' Tutorial
BBC BASIC Home Page
BBC BASIC on Rosetta Code
BBC BASIC discussion group
BBC BASIC for Windows Programmers' Reference

« Previous Topic | Next Topic »
Pages: 1 2 3  Notify Send Topic Print
 veryhotthread  Author  Topic: Setting a register to zero if it's < zero  (Read 1438 times)
David Williams
Developer

member is offline

Avatar

meh


PM

Gender: Male
Posts: 452
xx Re: Setting a register to zero if it's < zero
« Reply #5 on: Oct 9th, 2011, 05:03am »

on Oct 8th, 2011, 9:38pm, Michael Hutton wrote:
push max
push min
cmp eax, max
cmovg eax,[esp+4]
cmp eax, min
cmovl eax,[esp]
add esp,8


Looks very elegant to my eyes. And totally branchless.

I think that with the help your solution, my bitmap colour saturation modifying routine will be fast enough to render dozens of colour-saturation-(de-)enhanced sprites in realtime, offering yet another creative option for the growing er.. army of BB4W game-makers.

Thanks again, Michael.


Rgs,
David.
User IP Logged

admin
Administrator
ImageImageImageImageImage


member is offline

Avatar




PM


Posts: 1145
xx Re: Setting a register to zero if it's < zero
« Reply #6 on: Oct 9th, 2011, 09:46am »

on Oct 9th, 2011, 05:03am, David Williams wrote:
Looks very elegant to my eyes. And totally branchless.

The downside, of course, is that (if we're talking about BB4W here) it means using the ASMLIB library, with all that implies in respect of disabling crunching - for example needing to put your source in a separate file which you assemble using CALL.

I assume your application is somewhat untypical, in that clipping is both common and unpredictable. In the more typical case (digital filtering is a classic example) clipping is the exception: the majority of values will be 'in range'. In that case the naive approach using conditional jumps will outperform the CMOV code on any modern CPU.

Incidentally, your original question was how to clip off negative values, i.e. clipping to zero. That's simpler than the general case of course, for example this is one branchless solution (if that's definitely what you want):

Code:
      cmp eax,&80000000
      sbb ebx,ebx
      and eax,ebx 

Richard.
User IP Logged

David Williams
Developer

member is offline

Avatar

meh


PM

Gender: Male
Posts: 452
xx Re: Setting a register to zero if it's < zero
« Reply #7 on: Oct 9th, 2011, 10:28am »

on Oct 9th, 2011, 09:46am, Richard Russell wrote:
The downside, of course, is that (if we're talking about BB4W here) it means using the ASMLIB library, with all that implies in respect of disabling crunching - for example needing to put your source in a separate file which you assemble using CALL.


Yes, I'm aware of this. I'll probably hand-assemble the instructions to get around the problem of not being able to crunch the program.

on Oct 9th, 2011, 09:46am, Richard Russell wrote:
I assume your application is somewhat untypical, in that clipping is both common and unpredictable. In the more typical case (digital filtering is a classic example) clipping is the exception: the majority of values will be 'in range'. In that case the naive approach using conditional jumps will outperform the CMOV code on any modern CPU.


I'll conduct some timed tests at some point, aware of course of the performance variations resulting from different CPU architectures.

on Oct 9th, 2011, 09:46am, Richard Russell wrote:
Incidentally, your original question was how to clip off negative values, i.e. clipping to zero. That's simpler than the general case of course, for example this is one branchless solution (if that's definitely what you want):

Code:
      cmp eax,&80000000
      sbb ebx,ebx
      and eax,ebx 

Richard.


Thanks for this. I think I might set up another web page at bb4wgames.com -- a reference page of handy code snippets and gems just like that one; various x86 asm tips & tricks, etc.

David.
« Last Edit: Oct 9th, 2011, 10:28am by David Williams » User IP Logged

Michael Hutton
Developer

member is offline

Avatar




PM

Gender: Male
Posts: 248
xx Re: Setting a register to zero if it's < zero
« Reply #8 on: Oct 9th, 2011, 10:31am »

Here are three links discussing the cmov instruction which seem not to be so impressed with it:

http://www.redhat.com/archives/rhl-devel-list/2009-February/msg00422.html
https://mail.mozilla.org/pipermail/tamarin-devel/2008-April/000455.html
http://ondioline.org/mail/cmov-a-bad-idea-on-out-of-order-cpus


So it's horses for courses as per usual.

Michael

User IP Logged

Michael Hutton
Developer

member is offline

Avatar




PM

Gender: Male
Posts: 248
xx Re: Setting a register to zero if it's < zero
« Reply #9 on: Oct 9th, 2011, 11:11am »

Ok, for what it is worth, a not so very accurate test but it seems to favour the cmov instruction with random numbers on a core i5. You can test for not so random numbers as well I suppose, and it will vary according to processor type.
Code:
          0:               INSTALL @lib$ + "ASMLIB"
         0:               DIM C% 100, L%-1
         0:               FOR pass=8 TO 10 STEP 2
         0:               P% = C%
         0:               ON ERROR LOCAL [OPT FN_asmext
         0:               [
         0:               OPT pass
         0:               .clampcmov
         0:               .M%
         0:               push 255
         0:               push 0
         0:               cmp eax, 255
         0:               cmovg eax, [esp+4]
         0:               cmp eax, 0
         0:               cmovl eax, [esp]
         0:               add esp,8
         0:               ret
         0:               
         0:               .clampbranch
         0:               .N%
         0:               cmp eax,5
         0:               jl testmin
         0:               mov eax,5
         0:               .testmin
         0:               cmp eax,0
         0:               jg end
         0:               xor eax, eax
         0:               .end
         0:               ret
         0:               
         0:               .clampcmov2
         0:               .O%
         0:               mov esi, 255
         0:               mov edi,0
         0:               cmp eax, 255
         0:               cmovg eax, esi
         0:               cmp eax, 0
         0:               cmovl eax, edi
         0:               ret
         0:               
         0:               
         0:               ]
         0:               NEXT
         0:               
        20:     0.08      FOR I%=1 TO 10000000
      1695:     7.13      A% = RND
      5503:    23.15      B% = USR(M%)
       628:     2.64      NEXT
         0:               
        41:     0.17      FOR I%=1 TO 10000000
      1637:     6.89      A% = RND
      5628:    23.67      B% = USR(N%)
       683:     2.87      NEXT
         0:               
        25:     0.11      FOR I%=1 TO 10000000
      1668:     7.02      A% = RND
      5552:    23.35      B% = USR(O%)
       693:     2.91      NEXT
         0:               
         0:               
         0:               END
         0:               
 


:-/

Michael
User IP Logged

David Williams
Developer

member is offline

Avatar

meh


PM

Gender: Male
Posts: 452
xx Re: Setting a register to zero if it's < zero
« Reply #10 on: Oct 9th, 2011, 12:01pm »

Interesting stuff, Michael. Thanks for links to the CMOV discussions, and the timing test.

Those discussions seem to suggest, overall, that the jury is still out?

Incidentally, when I do timing tests, I usually raise the process priority of my program to Higher (or Above Normal) -- certainly not Highest, because I think it produces more stable timings.

For my specific application (fast colour saturation enhancement/reduction - basically blending a colour RGB pixel with its own greyscale intensity -- another routine for GFXLIB, of course), I don't think the branches are very predictable, so perhaps CMOV will win for me. It remains to be seen!

===

Now, sorry to chop & change, but...

I'd like to hijack my own thread and ask what you think is the fastest way of getting the separate R, G, B values from a 32-bit xRGB pixel (&xxRrGgBb) into three registers... say EAX, EBX, ECX.

I've found that this seems fast:

Code:
movzx ecx, BYTE [edi + 4*esi]      ; blue
movzx ebx, BYTE [edi + 4*esi + 1]  ; green
movzx eax, BYTE [edi + 4*esi + 2]  ; red

; EDI points to ARGB32 bitmap base address
; ESI is the pixel index 


By the way, I found no discernible speed increase by loading (via LEA) the pixel address EDI+4*ESI into some register.

Why am I surprised that the above method is noticeably faster than this one below:

Code:
mov edx, [edi + 4*esi]   ; load 32-bit &xxRrGgBb pixel

mov eax, edx             ; copy EDX
mov ebx, edx             ; copy EDX (again)
mov ecx, edx             ; ...and again

and eax, &FF0000         ; EAX = &00 Rr 00 00 (red component * 2^16)
and ebx, &FF00           ; EBX = &00 00 Gg 00 (green component * 2^8)
and ecx, &FF             ; ECX = &00 00 00 Bb (blue component)
        
shr eax, 16              ; EAX (al) = &Rr (red byte)
shr ebx, 8               ; EBX (bl) = &Gg (green byte)
;                        ; ECX (cl) = &Bb (blue byte)
 


(My overly verbose ASM comments are for my own benefit - I don't intend to patronise! I nearly always verbosely comment my asm code.)

So, one single main memory access (in the latter case) versus three main memory accesses.

Would you say that the three EDX copying instructions are destroying parallel (U & V pipe) processing opportunities?

In any case, is there a faster way of getting those R, G, B values into EAX, EBX and ECX respectively?

Thanks in advance!


Rgs,
David.

PS. I notice that this forum system sometimes changes the word a_s_s_e_m_b_l_y into disagreembly. Bizarre.
« Last Edit: Oct 9th, 2011, 12:04pm by David Williams » User IP Logged

Michael Hutton
Developer

member is offline

Avatar




PM

Gender: Male
Posts: 248
xx Re: Setting a register to zero if it's < zero
« Reply #11 on: Oct 9th, 2011, 12:35pm »

Quote:
In any case, is there a faster way of getting those R, G, B values into EAX, EBX and ECX respectively?


Good god! Why are you asking me? ???

I am an amoebic 'stimulus - respond' man when it comes to asm! Normally, I just play with what you've posted!

But out of interest, I was thinking about that very thing the other day when you sent me the 'Colourdrain' code.. and I am in visual C++ now typing out the procedure to see if I can look at the asm output and seeing how it may change depending on the C++ code.

I too was surprised you used the :
Code:
movzx ecx, BYTE [edi + 4*esi]      ; blue
movzx ebx, BYTE [edi + 4*esi + 1]  ; green
movzx eax, BYTE [edi + 4*esi + 2]  ; red
 


but came to the conclusion that that it would be cached.

but as alternatives I played around initially with:
Code:
xor ebx,ebx
xor ecx,ecx
mov eax, [edi + 4 * esi]
mov bl, ah     <note edited from original>
mov bl, al      <note edited from original>
shr eax,16
; and eax, &FF (if your going to use eax rather than al)
 

and if you just want the byte values you could forget about the xor's and just use the al, bl and cl registers..

so I suppose the bare minimal would be:
Code:
mov eax, [edi + 4 * esi]
mov bl, ah <note edited from original>
mov cl, al  <note edited from original>
shr eax,16
 

and you'll have the rgb values in al, bl, cl respectively. If you want them 'tidied' to 32 bit values then add

Code:
and eax,&FF
and ebx,&FF
and ecx,&FF
 


I hope I haven't overlooked something blindingly obvious as per ususal.

Michael

« Last Edit: Oct 9th, 2011, 2:24pm by Michael Hutton » User IP Logged

David Williams
Developer

member is offline

Avatar

meh


PM

Gender: Male
Posts: 452
xx Re: Setting a register to zero if it's < zero
« Reply #12 on: Oct 9th, 2011, 1:33pm »

on Oct 9th, 2011, 12:35pm, Michael Hutton wrote:
Good god! Why are you asking me? ???


Yes, sorry if I put you on the spot a little. Actually, the question was meant to be addressed to you, Richard and anyone else who knows x86 assembly language.


on Oct 9th, 2011, 12:35pm, Michael Hutton wrote:
so I suppose the bare minimal would be:
Code:
mov eax, [edi + 4 * esi]
mov ebx, ah
mov ecx, al
shr eax,16
 



Sorry Michael, but the BB4W assembler didn't like MOVing an 8-bit register into a 32-bit one like that. It's hard to see why such an operation shouldn't be allowed, but then I'm not an Intel engineer!


Here is a timing test I did (excuse the obsession with data - and even instruction - alignment!):

Code:
      MODE 8 : OFF
      
      HIMEM = LOMEM + 2*&100000
      
      DIM gap1% 4096
      DIM bitmap% 4*(640*512 + 1)
      bitmap% = (bitmap% + 3) AND -4
      DIM  gap2% 4096
      
      REM. These 4 Kb gaps are probably way OTT, but just to be certain!
      
      PROC_asm
      
      REM. Fill bitmap with random values
      FOR I% = bitmap% TO (bitmap% + 4*640*512)-1 STEP 4
        !I% = RND
      NEXT
      
      G% = FNSYS_NameToAddress( "GetTickCount" )
      time0% = 0
      time1% = 0
      
      PRINT '" Conducting test #1, please wait..."'
      
      REM. Test #1 (three MOVZX instructions)
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &80
      SYS G% TO time0%
      FOR I% = 1 TO 5000
        CALL A%
      NEXT
      SYS G% TO time1%
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &20
      
      PRINT '" Test #1 (three MOVZX instructions) took ";
      PRINT ;(time1% - time0%)/1000; " s."'
      
      PRINT '" Conducting test #2, please wait..."'
      
      REM. Test #2 (one MOV instruction)
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &80
      SYS G% TO time0%
      FOR I% = 1 TO 5000
        CALL B%
      NEXT
      SYS G% TO time1%
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &20
      
      PRINT '" Test #2 (one MOV instruction) took ";
      PRINT ;(time1% - time0%)/1000; " s."'
      
      PRINT '" Conducting test #3, please wait..."'
      
      REM. Test #3 (one MOV instruction; other instructions re-ordered)
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &80
      SYS G% TO time0%
      FOR I% = 1 TO 5000
        CALL C%
      NEXT
      SYS G% TO time1%
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &20
      
      PRINT '" Test #3 (one MOV instruction; other instructions re-ordered) took ";
      PRINT ;(time1% - time0%)/1000; " s."'
      
      PRINT''" Finished."
      END
      :
      :
      :
      :
      DEF PROC_asm
      LOCAL I%, P%, code%, loop_A, loop_B, loop_C
      DIM code% 1000
      
      FOR I% = 0 TO 2 STEP 2
        P% = code%
        [OPT I%
        
        ] : P%=(P%+3) AND -4 : [OPT I%
        
        .A%
        mov edi, bitmap%
        xor esi, esi
        .loop_A
        movzx ecx, BYTE [edi + 4*esi]
        movzx ebx, BYTE [edi + 4*esi + 1]
        movzx eax, BYTE [edi + 4*esi + 2]
        add esi, 1
        cmp esi, (640 * 512)
        jl loop_A
        ret
        
        ] : P%=(P%+3) AND -4 : [OPT I%
        
        .B%
        mov edi, bitmap%
        xor esi, esi
        .loop_B
        mov edx, [edi + 4*esi]
        mov eax, edx
        mov ebx, edx
        mov ecx, edx
        and eax, &FF0000
        and ebx, &FF00
        and ecx, &FF
        shr eax, 16
        shr ebx, 8
        add esi, 1
        cmp esi, (640 * 512)
        jl loop_B
        ret
        
        ] : P%=(P%+3) AND -4 : [OPT I%
        
        .C%
        mov edi, bitmap%
        xor esi, esi
        .loop_C
        mov edx, [edi + 4*esi]
        
        mov eax, edx
        and eax, &FF0000
        shr eax, 16
        
        mov ebx, edx
        and ebx, &FF00
        shr ebx, 8
        
        mov ecx, edx
        and ecx, &FF
        
        add esi, 1
        cmp esi, (640 * 512)
        jl loop_C
        ret
        
        ]
        
      NEXT I%
      ENDPROC
      :
      :
      :
      :
      DEF FNSYS_NameToAddress( f$ )
      LOCAL P%
      DIM P% LOCAL 5
      [OPT 0 : call f$ : ]
      =P%!-4+P%


 


On my Intel Centrino Duo 1.whatever GHz laptop, I get the following results:

Test #1 (three MOVZX instructions): 7.84 seconds.

Test #2 (one MOV instruction): 10.92 seconds.

Test #3 (one MOV instructoin; other instructions re-ordered): 10.97 seconds.

By "MOV instruction", I mean of course the main memory/cache pixel data loads.

I'd say that, with a difference of nearly 3 seconds, the three MOVZX method is significantly faster.

I still wonder if there's an even faster way of doing it.


Regards,

David.
User IP Logged

Michael Hutton
Developer

member is offline

Avatar




PM

Gender: Male
Posts: 248
xx Re: Setting a register to zero if it's < zero
« Reply #13 on: Oct 9th, 2011, 1:39pm »

Code:
mov eax, [edi + 4 * esi]
mov ebx, ah
mov ecx, al
shr eax,16
 


well, you must have known I meant: ;)

Code:
mov eax, [edi + 4 * esi]
mov bl, ah
mov cl, al
shr eax,16
 


I shouldn't edit code in the conforums!

Put that in the timings program... I would do myself but I have only got the demo version on this laptop (Yes, Richard, I can't remember what the Serial Number reg key is actually called and exactly where it is!)

Michael
User IP Logged

David Williams
Developer

member is offline

Avatar

meh


PM

Gender: Male
Posts: 452
xx Re: Setting a register to zero if it's < zero
« Reply #14 on: Oct 9th, 2011, 2:03pm »

on Oct 9th, 2011, 1:39pm, Michael Hutton wrote:
well, you must have known I meant: ;)


I was just being dense. And having been up all night... etc.



on Oct 9th, 2011, 1:39pm, Michael Hutton wrote:
Code:
mov eax, [edi + 4 * esi]
mov bl, ah
mov cl, al
shr eax,16
 





Right, the updated timing test code with your solution(s):

Code:
      MODE 8 : OFF
      
      HIMEM = LOMEM + 2*&100000
      
      DIM gap1% 4096
      DIM bitmap% 4*(640*512 + 1)
      bitmap% = (bitmap% + 3) AND -4
      DIM  gap2% 4096
      
      REM. These 4 Kb gaps are probably way OTT, but just to be certain!
      
      PROC_asm
      
      REM. Fill bitmap with random values
      FOR I% = bitmap% TO (bitmap% + 4*640*512)-1 STEP 4
        !I% = RND
      NEXT
      
      G% = FNSYS_NameToAddress( "GetTickCount" )
      time0% = 0
      time1% = 0
      
      PRINT '" Conducting test #1, please wait..."'
      
      REM. Test #1 (three MOVZX instructions)
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &80
      SYS G% TO time0%
      FOR I% = 1 TO 5000
        CALL A%
      NEXT
      SYS G% TO time1%
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &20
      
      PRINT '" Test #1 (three MOVZX instructions) took ";
      PRINT ;(time1% - time0%)/1000; " s."'
      
      PRINT '" Conducting test #2, please wait..."'
      
      REM. Test #2 (one MOV instruction)
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &80
      SYS G% TO time0%
      FOR I% = 1 TO 5000
        CALL B%
      NEXT
      SYS G% TO time1%
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &20
      
      PRINT '" Test #2 (one MOV instruction) took ";
      PRINT ;(time1% - time0%)/1000; " s."'
      
      PRINT '" Conducting test #3, please wait..."'
      
      REM. Test #3 (one MOV instruction; other instructions re-ordered)
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &80
      SYS G% TO time0%
      FOR I% = 1 TO 5000
        CALL C%
      NEXT
      SYS G% TO time1%
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &20
      
      PRINT '" Test #3 (one MOV instruction; other instructions re-ordered) took ";
      PRINT ;(time1% - time0%)/1000; " s."'
      
      PRINT '" Conducting test #4, please wait..."'
      
      REM. Test #4 (one MOV instruction; Michael's solution (no XORs))
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &80
      SYS G% TO time0%
      FOR I% = 1 TO 5000
        CALL D%
      NEXT
      SYS G% TO time1%
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &20
      
      PRINT '" Test #4 (one MOV instruction; Michael's solution (no XORs)) took ";
      PRINT ;(time1% - time0%)/1000; " s."'
      
      PRINT '" Conducting test #5, please wait..."'
      
      REM. Test #5 (one MOV instruction; Michael's solution (with XORs))
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &80
      SYS G% TO time0%
      FOR I% = 1 TO 5000
        CALL E%
      NEXT
      SYS G% TO time1%
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &20
      
      PRINT '" Test #5 (one MOV instruction; Michael's solution (with XORs)) took ";
      PRINT ;(time1% - time0%)/1000; " s."'
      
      PRINT''" Finished."
      END
      :
      :
      :
      :
      DEF PROC_asm
      LOCAL I%, P%, code%, loop_A, loop_B, loop_C, loop_D, loop_E
      DIM code% 1000
      
      FOR I% = 0 TO 2 STEP 2
        P% = code%
        [OPT I%
        
        ] : P%=(P%+3) AND -4 : [OPT I%
        
        .A%
        mov edi, bitmap%
        xor esi, esi
        .loop_A
        movzx ecx, BYTE [edi + 4*esi]
        movzx ebx, BYTE [edi + 4*esi + 1]
        movzx eax, BYTE [edi + 4*esi + 2]
        add esi, 1
        cmp esi, (640 * 512)
        jl loop_A
        ret
        
        ] : P%=(P%+3) AND -4 : [OPT I%
        
        .B%
        mov edi, bitmap%
        xor esi, esi
        .loop_B
        mov edx, [edi + 4*esi]
        mov eax, edx
        mov ebx, edx
        mov ecx, edx
        and eax, &FF0000
        and ebx, &FF00
        and ecx, &FF
        shr eax, 16
        shr ebx, 8
        add esi, 1
        cmp esi, (640 * 512)
        jl loop_B
        ret
        
        ] : P%=(P%+3) AND -4 : [OPT I%
        
        .C%
        mov edi, bitmap%
        xor esi, esi
        .loop_C
        mov edx, [edi + 4*esi]
        
        mov eax, edx
        and eax, &FF0000
        shr eax, 16
        
        mov ebx, edx
        and ebx, &FF00
        shr ebx, 8
        
        mov ecx, edx
        and ecx, &FF
        
        add esi, 1
        cmp esi, (640 * 512)
        jl loop_C
        ret
        
        ] : P%=(P%+3) AND -4 : [OPT I%
        
        .D%
        mov edi, bitmap%
        xor esi, esi
        .loop_D
        
        mov eax, [edi + 4 * esi]
        mov bl, ah
        mov cl, al
        shr eax,16
        
        add esi, 1
        cmp esi, (640 * 512)
        jl loop_D
        ret
        
        ] : P%=(P%+3) AND -4 : [OPT I%
        
        .E%
        mov edi, bitmap%
        xor esi, esi
        .loop_E
        
        xor ebx, ebx
        xor ecx, ecx
        mov eax, [edi + 4 * esi]
        mov bl, ah
        mov cl, al
        shr eax,16
        
        add esi, 1
        cmp esi, (640 * 512)
        jl loop_E
        ret
        
        ]
        
      NEXT I%
      ENDPROC
      :
      :
      :
      :
      DEF FNSYS_NameToAddress( f$ )
      LOCAL P%
      DIM P% LOCAL 5
      [OPT 0 : call f$ : ]
      =P%!-4+P% 



Results:

Test#1 (three MOVZX's) : 7.81 s
Test#2 (one MOV) : 10.89 s
Test#3 (one MOV; instructions re-ordered): 10.83 s
Test#4 (one MOV; Michael's solution (no XORs)): 6.02 s
Test#5 (one MOV; Michael's solution (with XORs)): 7.45 s

Since I probably need (I can't remember off-hand) to have the 3 higher bytes of EAX, EBX and ECX all clear, your code with the XORs looks like the one to go with, so thanks for that.


(Edit: Hey, I think this is another prime candidate for my proposed BB4W "Handy Assembler Code Snippets" (or whatever) web page!)


Rgs,
David.
« Last Edit: Oct 9th, 2011, 2:26pm by David Williams » User IP Logged

Michael Hutton
Developer

member is offline

Avatar




PM

Gender: Male
Posts: 248
xx Re: Setting a register to zero if it's < zero
« Reply #15 on: Oct 9th, 2011, 2:21pm »

Well, I am chuffed it worked.

Michael
User IP Logged

admin
Administrator
ImageImageImageImageImage


member is offline

Avatar




PM


Posts: 1145
xx Re: Setting a register to zero if it's < zero
« Reply #16 on: Oct 9th, 2011, 2:27pm »

on Oct 9th, 2011, 11:11am, Michael Hutton wrote:
Ok, for what it is worth, a not so very accurate test

I would rather say 'completely meaningless'. sad

There are no alignment instructions to ensure that the different subroutines are being run 'on a level playing field', i.e. with each independently aligned to a multiple of 32 bytes.

Also, the only way you'll get any meaningful comparison is by looping in the assembler code, because otherwise the overhead of the USR function and the NEXT statement will be hugely more than the code you are trying to benchmark.

Even having taken care of those factors, when comparing my suggested code with CMOV I still found that if I swapped around the two subroutines it was always the second that ran the faster, so the difference clearly wasn't related to the actual code.

It is well established that using conditional jumps will always win if the jumps are accurately predicted by the CPU.

Richard.
User IP Logged

admin
Administrator
ImageImageImageImageImage


member is offline

Avatar




PM


Posts: 1145
xx Re: Setting a register to zero if it's < zero
« Reply #17 on: Oct 9th, 2011, 2:35pm »

on Oct 9th, 2011, 12:01pm, David Williams wrote:
I'd like to hijack my own thread and ask what you think is the fastest way of getting the separate R, G, B values from a 32-bit xRGB pixel (&xxRrGgBb) into three registers... say EAX, EBX, ECX.

I know you didn't ask me, but....

I wouldn't start from there, because constraining yourself to using the general-purpose registers (eax etc.) is going to limit the speed, however optimised the code. The MMX instructions are specifically designed to handle things like 32-bit xRGB pixels efficiently, so I would expect using MMX would be the fastest way.

Of course it depends on what you do with the values next, once you've got them in separate registers, but I'd still be surprised if MMX doesn't win.

Richard.
User IP Logged

admin
Administrator
ImageImageImageImageImage


member is offline

Avatar




PM


Posts: 1145
xx Re: Setting a register to zero if it's < zero
« Reply #18 on: Oct 9th, 2011, 2:40pm »

on Oct 9th, 2011, 1:33pm, David Williams wrote:
Sorry Michael, but the BB4W assembler didn't like MOVing an 8-bit register into a 32-bit one like that. It's hard to see why such an operation shouldn't be allowed, but then I'm not an Intel engineer!

It is allowed, but you have to use the correct mnemonic:

Code:
mov eax, [edi + 4 * esi]
movzx ebx, ah
movzx ecx, al
shr eax,16 

Richard.
User IP Logged

David Williams
Developer

member is offline

Avatar

meh


PM

Gender: Male
Posts: 452
xx Re: Setting a register to zero if it's < zero
« Reply #19 on: Oct 9th, 2011, 2:55pm »

on Oct 9th, 2011, 2:35pm, Richard Russell wrote:
I know you didn't ask me, but....


I certainly had you in mind as well!

The trouble with the English language is that, in contrast to French and Arabic, it doesn't have a plural form of the word "you" (as in "you people", "you two gifted assembly language programmers", etc.).



on Oct 9th, 2011, 2:35pm, Richard Russell wrote:
I wouldn't start from there, because constraining yourself to using the general-purpose registers (eax etc.) is going to limit the speed, however optimised the code. The MMX instructions are specifically designed to handle things like 32-bit xRGB pixels efficiently, so I would expect using MMX would be the fastest way.


I'll certainly look into it. I've got some MMX code of yours (alpha blending) which I might be able to adapt.

on Oct 9th, 2011, 2:35pm, Richard Russell wrote:
Of course it depends on what you do with the values next, once you've got them in separate registers, but I'd still be surprised if MMX doesn't win.


Once they're in the registers, I compute the intensity using the fixed-point version of the formula:

i = 0.114*R + 0.587*G + 0.299*B

then I 'blend' i (which is in the range 0 to 255) with each of the R, G, B values by an amount specified as a factor ranging from 0.0 to beyond 1.0 (up to 10.0, say). Values less than 1.0 de-saturate the colour, values over 1.0 enhance saturation. (By the way, all the math is fixed-point, so this 'saturation factor' is multiplied by 2^20 (&100000) before it's passed to the routine).

I like the effect it produces (technically accurate or otherwise!), and will be a nice addition to GFXLIB.


Rgs,
David.
« Last Edit: Oct 9th, 2011, 2:55pm by David Williams » User IP Logged

Pages: 1 2 3  Notify Send Topic Print
« Previous Topic | Next Topic »

| |

This forum powered for FREE by Conforums ©
Terms of Service | Privacy Policy | Conforums Support | Parental Controls