BBC BASIC for Windows - Setting a register to zero if it's

BBC BASIC for Windows

Programming

Assembly Language Programming (Moderator: admin)

Setting a register to zero if it's < zero

« Previous Topic | Next Topic »

Pages: 1 2 3

Author

Topic: Setting a register to zero if it's < zero (Read 1441 times)

Michael Hutton
Developer

member is offline
Avatar

Gender:

Posts: 248

Re: Setting a register to zero if it's < zero
« Reply #13 on: Oct 9^th, 2011, 1:39pm »

Code:

mov eax, [edi + 4 * esi]
mov ebx, ah
mov ecx, al
shr eax,16

well, you must have known I meant: ;)

Code:

mov eax, [edi + 4 * esi]
mov bl, ah
mov cl, al
shr eax,16

I shouldn't edit code in the conforums!

Put that in the timings program... I would do myself but I have only got the demo version on this laptop (Yes, Richard, I can't remember what the Serial Number reg key is actually called and exactly where it is!)

Michael

Logged

David Williams
Developer

member is offline
Avatar

meh

Gender:

Posts: 452

Re: Setting a register to zero if it's < zero
« Reply #14 on: Oct 9^th, 2011, 2:03pm »

on Oct 9^th, 2011, 1:39pm, Michael Hutton wrote:

well, you must have known I meant: ;)

I was just being dense. And having been up all night... etc.

on Oct 9^th, 2011, 1:39pm, Michael Hutton wrote:

Code:

mov eax, [edi + 4 * esi]
mov bl, ah
mov cl, al
shr eax,16

Right, the updated timing test code with your solution(s):

Code:

      MODE 8 : OFF
      
      HIMEM = LOMEM + 2*&100000
      
      DIM gap1% 4096
      DIM bitmap% 4*(640*512 + 1)
      bitmap% = (bitmap% + 3) AND -4
      DIM  gap2% 4096
      
      REM. These 4 Kb gaps are probably way OTT, but just to be certain!
      
      PROC_asm
      
      REM. Fill bitmap with random values
      FOR I% = bitmap% TO (bitmap% + 4*640*512)-1 STEP 4
        !I% = RND
      NEXT
      
      G% = FNSYS_NameToAddress( "GetTickCount" )
      time0% = 0
      time1% = 0
      
      PRINT '" Conducting test #1, please wait..."'
      
      REM. Test #1 (three MOVZX instructions)
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &80
      SYS G% TO time0%
      FOR I% = 1 TO 5000
        CALL A%
      NEXT
      SYS G% TO time1%
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &20
      
      PRINT '" Test #1 (three MOVZX instructions) took ";
      PRINT ;(time1% - time0%)/1000; " s."'
      
      PRINT '" Conducting test #2, please wait..."'
      
      REM. Test #2 (one MOV instruction)
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &80
      SYS G% TO time0%
      FOR I% = 1 TO 5000
        CALL B%
      NEXT
      SYS G% TO time1%
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &20
      
      PRINT '" Test #2 (one MOV instruction) took ";
      PRINT ;(time1% - time0%)/1000; " s."'
      
      PRINT '" Conducting test #3, please wait..."'
      
      REM. Test #3 (one MOV instruction; other instructions re-ordered)
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &80
      SYS G% TO time0%
      FOR I% = 1 TO 5000
        CALL C%
      NEXT
      SYS G% TO time1%
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &20
      
      PRINT '" Test #3 (one MOV instruction; other instructions re-ordered) took ";
      PRINT ;(time1% - time0%)/1000; " s."'
      
      PRINT '" Conducting test #4, please wait..."'
      
      REM. Test #4 (one MOV instruction; Michael's solution (no XORs))
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &80
      SYS G% TO time0%
      FOR I% = 1 TO 5000
        CALL D%
      NEXT
      SYS G% TO time1%
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &20
      
      PRINT '" Test #4 (one MOV instruction; Michael's solution (no XORs)) took ";
      PRINT ;(time1% - time0%)/1000; " s."'
      
      PRINT '" Conducting test #5, please wait..."'
      
      REM. Test #5 (one MOV instruction; Michael's solution (with XORs))
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &80
      SYS G% TO time0%
      FOR I% = 1 TO 5000
        CALL E%
      NEXT
      SYS G% TO time1%
      SYS "GetCurrentProcess" TO hprocess%
      SYS "SetPriorityClass", hprocess%, &20
      
      PRINT '" Test #5 (one MOV instruction; Michael's solution (with XORs)) took ";
      PRINT ;(time1% - time0%)/1000; " s."'
      
      PRINT''" Finished."
      END
      :
      :
      :
      :
      DEF PROC_asm
      LOCAL I%, P%, code%, loop_A, loop_B, loop_C, loop_D, loop_E
      DIM code% 1000
      
      FOR I% = 0 TO 2 STEP 2
        P% = code%
        [OPT I%
        
        ] : P%=(P%+3) AND -4 : [OPT I%
        
        .A%
        mov edi, bitmap%
        xor esi, esi
        .loop_A
        movzx ecx, BYTE [edi + 4*esi]
        movzx ebx, BYTE [edi + 4*esi + 1]
        movzx eax, BYTE [edi + 4*esi + 2]
        add esi, 1
        cmp esi, (640 * 512)
        jl loop_A
        ret
        
        ] : P%=(P%+3) AND -4 : [OPT I%
        
        .B%
        mov edi, bitmap%
        xor esi, esi
        .loop_B
        mov edx, [edi + 4*esi]
        mov eax, edx
        mov ebx, edx
        mov ecx, edx
        and eax, &FF0000
        and ebx, &FF00
        and ecx, &FF
        shr eax, 16
        shr ebx, 8
        add esi, 1
        cmp esi, (640 * 512)
        jl loop_B
        ret
        
        ] : P%=(P%+3) AND -4 : [OPT I%
        
        .C%
        mov edi, bitmap%
        xor esi, esi
        .loop_C
        mov edx, [edi + 4*esi]
        
        mov eax, edx
        and eax, &FF0000
        shr eax, 16
        
        mov ebx, edx
        and ebx, &FF00
        shr ebx, 8
        
        mov ecx, edx
        and ecx, &FF
        
        add esi, 1
        cmp esi, (640 * 512)
        jl loop_C
        ret
        
        ] : P%=(P%+3) AND -4 : [OPT I%
        
        .D%
        mov edi, bitmap%
        xor esi, esi
        .loop_D
        
        mov eax, [edi + 4 * esi]
        mov bl, ah
        mov cl, al
        shr eax,16
        
        add esi, 1
        cmp esi, (640 * 512)
        jl loop_D
        ret
        
        ] : P%=(P%+3) AND -4 : [OPT I%
        
        .E%
        mov edi, bitmap%
        xor esi, esi
        .loop_E
        
        xor ebx, ebx
        xor ecx, ecx
        mov eax, [edi + 4 * esi]
        mov bl, ah
        mov cl, al
        shr eax,16
        
        add esi, 1
        cmp esi, (640 * 512)
        jl loop_E
        ret
        
        ]
        
      NEXT I%
      ENDPROC
      :
      :
      :
      :
      DEF FNSYS_NameToAddress( f$ )
      LOCAL P%
      DIM P% LOCAL 5
      [OPT 0 : call f$ : ]
      =P%!-4+P%

Results:

Test#1 (three MOVZX's) : 7.81 s
Test#2 (one MOV) : 10.89 s
Test#3 (one MOV; instructions re-ordered): 10.83 s
Test#4 (one MOV; Michael's solution (no XORs)): 6.02 s
Test#5 (one MOV; Michael's solution (with XORs)): 7.45 s

Since I probably need (I can't remember off-hand) to have the 3 higher bytes of EAX, EBX and ECX all clear, your code with the XORs looks like the one to go with, so thanks for that.

(Edit: Hey, I think this is another prime candidate for my proposed BB4W "Handy Assembler Code Snippets" (or whatever) web page!)

Rgs,
David.

« Last Edit: Oct 9^th, 2011, 2:26pm by David Williams »

Logged

Michael Hutton
Developer

member is offline
Avatar

Gender:

Posts: 248

Re: Setting a register to zero if it's < zero
« Reply #15 on: Oct 9^th, 2011, 2:21pm »

Well, I am chuffed it worked.

Michael

Logged

admin
Administrator

member is offline

Posts: 1145

Re: Setting a register to zero if it's < zero
« Reply #16 on: Oct 9^th, 2011, 2:27pm »

on Oct 9^th, 2011, 11:11am, Michael Hutton wrote:

Ok, for what it is worth, a not so very accurate test

I would rather say 'completely meaningless'. sad

There are no alignment instructions to ensure that the different subroutines are being run 'on a level playing field', i.e. with each independently aligned to a multiple of 32 bytes.

Also, the only way you'll get any meaningful comparison is by looping in the assembler code, because otherwise the overhead of the USR function and the NEXT statement will be hugely more than the code you are trying to benchmark.

Even having taken care of those factors, when comparing my suggested code with CMOV I still found that if I swapped around the two subroutines it was always the second that ran the faster, so the difference clearly wasn't related to the actual code.

It is well established that using conditional jumps will always win if the jumps are accurately predicted by the CPU.

Richard.

Logged

admin
Administrator

member is offline

Posts: 1145

Re: Setting a register to zero if it's < zero
« Reply #17 on: Oct 9^th, 2011, 2:35pm »

on Oct 9^th, 2011, 12:01pm, David Williams wrote:

I'd like to hijack my own thread and ask what you think is the fastest way of getting the separate R, G, B values from a 32-bit xRGB pixel (&xxRrGgBb) into three registers... say EAX, EBX, ECX.

I know you didn't ask me, but....

I wouldn't start from there, because constraining yourself to using the general-purpose registers (eax etc.) is going to limit the speed, however optimised the code. The MMX instructions are specifically designed to handle things like 32-bit xRGB pixels efficiently, so I would expect using MMX would be the fastest way.

Of course it depends on what you do with the values next, once you've got them in separate registers, but I'd still be surprised if MMX doesn't win.

Richard.

Logged

admin
Administrator

member is offline

Posts: 1145

Re: Setting a register to zero if it's < zero
« Reply #18 on: Oct 9^th, 2011, 2:40pm »

on Oct 9^th, 2011, 1:33pm, David Williams wrote:

Sorry Michael, but the BB4W assembler didn't like MOVing an 8-bit register into a 32-bit one like that. It's hard to see why such an operation shouldn't be allowed, but then I'm not an Intel engineer!

It is allowed, but you have to use the correct mnemonic:

Code:

mov eax, [edi + 4 * esi]
movzx ebx, ah
movzx ecx, al
shr eax,16

Richard.

Logged

David Williams
Developer

member is offline
Avatar

meh

Gender:

Posts: 452

Re: Setting a register to zero if it's < zero
« Reply #19 on: Oct 9^th, 2011, 2:55pm »

on Oct 9^th, 2011, 2:35pm, Richard Russell wrote:

I know you didn't ask me, but....

I certainly had you in mind as well!

The trouble with the English language is that, in contrast to French and Arabic, it doesn't have a plural form of the word "you" (as in "you people", "you two gifted assembly language programmers", etc.).

on Oct 9^th, 2011, 2:35pm, Richard Russell wrote:

I wouldn't start from there, because constraining yourself to using the general-purpose registers (eax etc.) is going to limit the speed, however optimised the code. The MMX instructions are specifically designed to handle things like 32-bit xRGB pixels efficiently, so I would expect using MMX would be the fastest way.

I'll certainly look into it. I've got some MMX code of yours (alpha blending) which I might be able to adapt.

on Oct 9^th, 2011, 2:35pm, Richard Russell wrote:

Of course it depends on what you do with the values next, once you've got them in separate registers, but I'd still be surprised if MMX doesn't win.

Once they're in the registers, I compute the intensity using the fixed-point version of the formula:

i = 0.114*R + 0.587*G + 0.299*B

then I 'blend' i (which is in the range 0 to 255) with each of the R, G, B values by an amount specified as a factor ranging from 0.0 to beyond 1.0 (up to 10.0, say). Values less than 1.0 de-saturate the colour, values over 1.0 enhance saturation. (By the way, all the math is fixed-point, so this 'saturation factor' is multiplied by 2^20 (&100000) before it's passed to the routine).

I like the effect it produces (technically accurate or otherwise!), and will be a nice addition to GFXLIB.

Rgs,
David.

« Last Edit: Oct 9^th, 2011, 2:55pm by David Williams »

Logged

David Williams
Developer

member is offline
Avatar

meh

Gender:

Posts: 452

Re: Setting a register to zero if it's < zero
« Reply #20 on: Oct 9^th, 2011, 2:58pm »

on Oct 9^th, 2011, 2:40pm, Richard Russell wrote:

It is allowed, but you have to use the correct mnemonic:

Code:

mov eax, [edi + 4 * esi]
movzx ebx, ah
movzx ecx, al
shr eax,16

Richard.

Excellent. I can do away with those horrid XORs now. :)

Logged

David Williams
Developer

member is offline
Avatar

meh

Gender:

Posts: 452

Re: Setting a register to zero if it's < zero
« Reply #21 on: Oct 9^th, 2011, 3:09pm »

Okay, with Richard's correction, the fastest non-MMX/non-SIMD code yet is indeed:

Code:

        mov eax, [edi + 4 * esi]
        movzx ebx, ah
        movzx ecx, al
        shr eax,16

It took 5.93 s. in the timed test, approaching 2 seconds faster than the three MOVZX byte loads (Test #1).

Thanking MH and RTR once more.

Logged

Michael Hutton
Developer

member is offline
Avatar

Gender:

Posts: 248

Re: Setting a register to zero if it's < zero
« Reply #22 on: Oct 9^th, 2011, 3:14pm »

Quote:

I would rather say 'completely meaningless'.

Quote:

overhead of the USR function and the NEXT statement

In defence, I wasn't commenting of the timing the BASIC loops, just the profiler report of the USR() line, but even that only has a resolution of 1ms.

I did say **seems to favour** over 10 million runs.... I am not sure "completely meaningless" really does apply, but I won't argue the point any further. I know what you are saying.

It obviously needs more rigourous testing.

re mmx:
I've been playing around trying to get this Visual C++ express to spew out SIMD codings but it doesn't want to play dice at the moment.

It's asm output for the ColourDrain routine is very similar to what you have already coded David. In fact, slower I would say by looking at it, but then I wouldn't trust my timings, especally the ones I've just done in my noodle, lol, and I would say my C++ code is not so hot.

Michael

Logged

Michael Hutton
Developer

member is offline
Avatar

Gender:

Posts: 248

Re: Setting a register to zero if it's < zero
« Reply #23 on: Oct 9^th, 2011, 3:19pm »

Quote:

movzx ecx, al

I knew that but but I must admit to thinking that that it would just extend al into cx, not into ecx and was worried about high word garbage.

Oh well.

Michael

Logged

David Williams
Developer

member is offline
Avatar

meh

Gender:

Posts: 452

Re: Setting a register to zero if it's < zero
« Reply #24 on: Oct 9^th, 2011, 3:27pm »

on Oct 9^th, 2011, 3:14pm, Michael Hutton wrote:

It's asm output for the ColourDrain routine ...

Erm... you know it's occurred to me that "ColourDrain" (although I named it) is the kind of name that Fisher-Price might call it if they had written that routine. I'm thinking of calling it "DesaturateColour"! Isn't that more sensible?

Just on the off-chance that anyone wants to see the code (it's in the form of an external GFXLIB module), here it is:

Code:

      DEF PROCInitModule
      PROCInitModule( 0 )
      ENDPROC
      
      DEF PROCInitModule( V% )
      LOCAL codeSize%, I%, L%, P%, _{}, M$
      
      M$ = "ColourDrain"
      
      GFXLIB_CoreCode% += 0
      IF GFXLIB_CoreCode% = 0 THEN ERROR 0, "The GFXLIB core library appears not have been installed and initialised. The core library must be installed and initialised before attempting to install any external GFXLIB modules."
      
      codeSize% = 170
      DIM GFXLIB_ColourDrain% codeSize%-1, L% -1
      DIM _{fgtzero%, flt2p20%, loop%}
      
      IF V% THEN
        PRINT '" Assembling GFXLIB module " + CHR$34 + M$ + CHR$34 + "..."
      ENDIF
      
      FOR I% = 8 TO 10 STEP 2
        
        P% = GFXLIB_ColourDrain%
        
        [OPT I%
        
        ; REM. SYS GFXLIB_ColourDrain%, pBitmap%, numPixels%, f%
        
        ;
        ; Parameters -- pBitmap%, numPixels%, f%
        ;
        ;               pBitmap% - points to base address of 32-bpp ARGB bitmap
        ;               numPixels% - number of pixels in bitmap
        ;
        ;               f% (''colour-drain'' factor) is 12.20 fixed-point integer; range (0.0 to 1.0)*2^20  (Note 2^20 = &100000)
        ;
        ;               f% is clamped (by this routine) to 0 or 2^20 (&100000)
        ;
        
        pushad
        
        ; ESP!36 = pBitmap%
        ; ESP!40 = numPixels%
        ; ESP!44 = f% (= f * 2^20)
        
        sub esp, 4                               ; allocate space for one local variable
        
        ; And now...
        ;
        ; ESP!40 = pBitmap%         
        ; ESP!44 = numPixels%       
        ; ESP!48 = f% (= f * 2^20)  
        
        mov esi, [esp + 40]                      ; ESI = pBitmap%
        
        ; calc. address of final pixel
        
        mov ebp, [esp + 44]                      ; numPixels%
        sub ebp, 1                               ; numPixels% - 1 (because pixel index starts at zero)
        shl ebp, 2                               ; 4 * (numPixels% - 1) (because 4 bytes per pixel)
        add ebp, esi                             ; = addr of final pixel
        mov [esp], ebp                           ; ESP!0 = addr of final pixel
        
        mov edi, [esp + 48]                      ; EDI = f%
        
        ;REM. if f% < 0 then f% = 0
        cmp edi, 0                               ; f% < 0 ?
        jge _.fgtzero%
        xor edi, edi                             ; f% = 0
        ._.fgtzero%
        
        ;REM. if f% > 2^20 (&100000) then f% = 2^20
        cmp edi, 2^20                            ; f% > 2^20 ?
        jle _.flt2p20%
        mov edi, 2^20                            ; f% = 2^20
        ._.flt2p20%
        
        ._.loop%
        
        movzx ecx, BYTE [esi]                    ; ECX (cl) = blue byte (b&)
        movzx ebx, BYTE [esi + 1]                ; EBX (bl) = green byte (g&)
        movzx eax, BYTE [esi + 2]                ; EAX (al) = red byte (r&)
        
        xor ebp, ebp                             ; EBP = cumulative intensity (i) - initially zero
        
        ;REM. i += (0.114 * 2^20) * b&
        mov edx, ecx
        imul edx, (0.114 * 2^20)
        add ebp, edx
        
        ;REM. i += (0.587 * 2^20) * g&
        mov edx, ebx
        imul edx, (0.587 * 2^20)
        add ebp, edx
        
        ;REM. i += (0.299 * 2^20) * r&
        mov edx, eax
        imul edx, (0.299 * 2^20)
        add ebp, edx
        
        shr ebp, 20                              ; EDX (i&) now in the range 0 to 255
        
        ;REM. b`& = b& + (((i& - b&)*f%) >> 20)
        mov edx, ebp                             ; copy EBP (i&)
        sub edx, ecx                             ; i& - b&
        imul edx, edi                            ; (i& - b&)*f%
        shr edx, 20                              ; ((i& - b&)*f%) >> 20
        add BYTE [esi], dl                       ; write blue byte
        
        ;REM. g`& = r& + (((i& - g&)*f%) >> 20)
        mov edx, ebp                             ; copy EBP (i&)
        sub edx, ebx                             ; i& - g&
        imul edx, edi                            ; (i& - g&)*f%
        shr edx, 20                              ; ((i& - g&)*f%) >> 20
        add BYTE [esi + 1], dl                   ; write green byte
        
        ;REM. r`& = r& + (((i& - r&)*f%) >> 20)
        mov edx, ebp                             ; copy EBP (i&)
        sub edx, eax                             ; i& - r&
        imul edx, edi                            ; (i& - r&)*f%
        shr edx, 20                              ; ((i& - r&)*f%) >> 20
        add BYTE [esi + 2], dl                   ; write red byte
        
        add esi, 4                               ; next pixel address
        cmp esi, [esp]
        jle _.loop%
        
        add esp, 4                               ; free local variable space
        
        popad
        ret 12
        
        ]
        
      NEXT I%
      
      IF V% THEN
        PRINT " Assembled code size = "; (P% - GFXLIB_ColourDrain%);" bytes"
        WAIT V%
      ENDIF
      
      ENDPROC

(Edited some mistakes in the assembler code comments)

Rgs,

David.

« Last Edit: Oct 9^th, 2011, 3:58pm by David Williams »

Logged

admin
Administrator

member is offline

Posts: 1145

Re: Setting a register to zero if it's < zero
« Reply #25 on: Oct 9^th, 2011, 3:28pm »

on Oct 9^th, 2011, 3:14pm, Michael Hutton wrote:

In defence, I wasn't commenting of the timing the BASIC loops, just the profiler report of the USR() line, but even that only has a resolution of 1ms.

Yes, fair point, I shouldn't have included NEXT in my comment, but USR will still take much longer than the code you're trying to benchmark.

Quote:

I am not sure "completely meaningless" really does apply

I was more meaning in respect of the lack of any alignment code. Modern CPUs typically fetch code in chunks of 32 bytes at a time, and execution speed can be affected depending on the alignment of the code with respect to those chunks (for example a short routine may fit entirely in one chunk, or be split between two).

David's got the right idea, by adding code to align P% before each subroutine, but he's only aligning to 4 bytes (which I don't think is very significant in respect of code, although of course it can be for data) rather than the 32 bytes that is necessary to eliminate alignment as a factor.

And as I said before, even when attempting to eliminate all confounding factors I still found that the code which ran the faster was always the routine which was tested second, and that applied both to my old P4 and an AMD Athlon 64.

So I think my 'meaningless' comment was justified, given the number of factors you didn't consider.

Richard.

Logged

admin
Administrator

member is offline

Posts: 1145

Re: Setting a register to zero if it's < zero
« Reply #26 on: Oct 9^th, 2011, 3:36pm »

on Oct 9^th, 2011, 3:09pm, David Williams wrote:

Okay, with Richard's correction, the fastest non-MMX/non-SIMD code yet is indeed:
Code:

        mov eax, [edi + 4 * esi]
        movzx ebx, ah
        movzx ecx, al
        shr eax,16

I assume you appreciate that this code doesn't zero AH, so if your original pixel was indeed xRGB (rather than 0RGB) then eax will end up containing 'xR' not just 'R'.

On my P4, Test#1 (three MOVZX instructions) is the fastest. That doesn't surprise me, because the first MOVZX will load all 4 bytes into the L1 cache so the remaining two can execute as quickly as if they were loading from registers. It ends up faster because of there being no need to do the SHR (and AH is zeroed for free):

Code:

 Test #1 (three MOVZX instructions) took 2.719 s.
 Test #2 (one MOV instruction) took 3.953 s.
 Test #3 (one MOV instruction; other instructions re-ordered) took 3.828 s.
 Test #4 (one MOV instruction; Michael's solution (no XORs)) took 2.953 s.
 Test #5 (one MOV instruction; Michael's solution (with XORs)) took 3.266 s.
 Test #6 (one MOV instruction; Richard's solution) took 2.812 s.

Richard.

« Last Edit: Oct 9^th, 2011, 3:38pm by admin »

Logged

David Williams
Developer

member is offline
Avatar

meh

Gender:

Posts: 452

Re: Setting a register to zero if it's < zero
« Reply #27 on: Oct 9^th, 2011, 4:14pm »

on Oct 9^th, 2011, 3:36pm, Richard Russell wrote:

I assume you appreciate that this code doesn't zero AH, so if your original pixel was indeed xRGB (rather than 0RGB) then eax will end up containing 'xR' not just 'R'.

I'd like to make the unsafe assumption that the source ARGB32 bitmap will always have all the MSB bytes clear, but that probably won't be the case.
Having to clear that byte seems to spoil the beauty of the code somewhat! And slows it down, of course. It'll have to be done, though.

on Oct 9^th, 2011, 3:36pm, Richard Russell wrote:

On my P4, Test#1 (three MOVZX instructions) is the fastest.

(Sigh.) Well, that just goes to show, doesn't it!

on Oct 9^th, 2011, 3:36pm, Richard Russell wrote:

That doesn't surprise me, because the first MOVZX will load all 4 bytes into the L1 cache so the remaining two can execute as quickly as if they were loading from registers. It ends up faster because of there being no need to do the SHR (and AH is zeroed for free):

In light of the 'new' code (getting those R, G, B values into registers), I was going to set about modifying a dozen-or-so GFXLIB routines to rid them of those MOVZX byte-loading instructions. Now it looks like I ought not to bother!

on Oct 9^th, 2011, 3:36pm, Richard Russell wrote:

Code:

 Test #1 (three MOVZX instructions) took 2.719 s.
 Test #2 (one MOV instruction) took 3.953 s.
 Test #3 (one MOV instruction; other instructions re-ordered) took 3.828 s.
 Test #4 (one MOV instruction; Michael's solution (no XORs)) took 2.953 s.
 Test #5 (one MOV instruction; Michael's solution (with XORs)) took 3.266 s.
 Test #6 (one MOV instruction; Richard's solution) took 2.812 s.

Richard.

Assuming that those FOR...NEXT loops still had 5000 iterations each (as in the original), I'm curious that your presumably ancient P4 has
trounced my 1.83GHz Intel Centrino Duo-based laptop (in this test). Even if your clock speed was 3Ghz, still interesting.

David.

Logged

Pages: 1 2 3


« Previous Topic \| Next Topic »