BBC BASIC for Windows

BBC BASIC for Windows

General

General Board (Moderator: admin)

GFXLIB

« Previous Topic | Next Topic »

Pages: 1 ... 4 5 6 7 8 9

Author

Topic: GFXLIB (Read 2270 times)

David Williams
Developer

member is offline
Avatar

meh

Gender:

Posts: 452

Re: GFXLIB
« Reply #76 on: May 25^th, 2009, 12:42am »

A quick demo of a new routine called PlotBMColumn (plots a single 1-pixel-wide column of pixels from a bitmap):

http://www.bb4w-games.com/138519651/gfxlib_vplot_demo.zip (500 Kb)

If it seems a bit sluggish then bear in mind that the BB4W interpreter is doing a lot of work!

This routine will form the basis of several other routines.

« Last Edit: Jan 20^th, 2012, 11:39pm by David Williams »

Logged

David Williams
Developer

member is offline
Avatar

meh

Gender:

Posts: 452

Re: GFXLIB
« Reply #77 on: May 28^th, 2009, 02:48am »

GFXLIB's alpha blending routines are set to become a hell of a lot faster thanks to this sweet bit of
code I recently discovered on Avery Lee's VirtualDub site (http://tinyurl.com/obqpyt):

Code:

unsigned blend2(unsigned src, unsigned dst) {
unsigned alpha = src >> 24;
alpha += (alpha > 0);

unsigned srb = src & 0xff00ff;
unsigned sg = src & 0x00ff00;
unsigned drb = dst & 0xff00ff;
unsigned dg = dst & 0x00ff00;

unsigned orb = (drb + (((srb - drb) * alpha + 0x800080) >> 8)) & 0xff00ff;
unsigned og = (dg + (((sg - dg ) * alpha + 0x008000) >> 8)) & 0x00ff00;

return orb+og;
}

It works very well, and my ASM implementation of the above code is a big improvement over how
GFXLIB's routines currently perform the task (although here the code is modified to work with
a constant alpha value, rather than a per-pixel one as is done in the original code):

Code:

        .blend
        
        ;REM. ESP+4  -> src pxl (RGB32)
        ;REM. ESP+8  -> dst pxl (RGB32)
        ;REM. ESP+12 -> alpha (0-255)
        
        mov ebp, [esp + 12]          ; alpha value (0 to 255)
        mov esi, [esp + 4]           ; src pxl &xxRRGGBB
        mov edi, [esp + 8]           ; dest pxl &xxRRGGBB
        
        mov eax, esi                 ; copy ESI
        and eax, &FF00FF             ; EAX = srb
        and esi, &00FF00             ; ESI = sg
        
        mov edx, edi                 ; copy EDI
        and edi, &FF00FF             ; EDI = drb
        and edx, &00FF00             ; EDX = dg
        
        ;REM. EAX = srb
        ;REM. ESI = sg
        ;REM. EDI = drb
        ;REM. EDX = dg
        
        sub eax, edi                 ; srb - drb
        sub esi, edx                 ; sg - dg
        
        imul eax, ebp                ; (srb - drb)*alpha
        imul esi, ebp                ; (sg - dg)*alpha
        
        add eax, &800080             ; (srb - drb)*alpha + &800080
        add esi, &008000             ; (sg - dg)*alpha + &008000
        
        shr eax, 8                   ; ((srb - drb)*alpha + &800080) >> 8
        shr esi, 8                   ; ((sg - dg)*alpha + &008000) >> 8
        
        add eax, edi                 ; drb + ((srb - drb)*alpha + &800080) >> 8
        add esi, edx                 ; dg + ((sg - dg)*alpha + &008000) >> 8
        
        and eax, &FF00FF             ; (drb + ((srb - drb)*alpha + &800080) >> 8) AND &FF00FF
        and esi, &00FF00             ; (dg + ((sg - dg)*alpha + &008000) >> 8) AND &00FF00
        
        add eax, esi
        
        ret 12

If anyone can spot any optimisations that can be made, perhaps shaving off an instruction or two,
then please let me know!

Regards,

David.

« Last Edit: May 28^th, 2009, 02:50am by David Williams »

Logged

David Williams
Developer

member is offline
Avatar

meh

Gender:

Posts: 452

Re: GFXLIB
« Reply #78 on: Jun 2^nd, 2009, 11:16pm »

More trivial nonsense:

~~http://www.bb4w-games.com/138519651/example48.zip~~ (529Kb)

« Last Edit: Jan 20^th, 2012, 11:40pm by David Williams »

Logged

admin
Administrator

member is offline

Posts: 1145

Re: GFXLIB
« Reply #79 on: Jun 3^rd, 2009, 08:21am »

Quote:

More trivial nonsense:
http://www.bb4w-games.com/138519651/example48.zip (529Kb)

Very nice.

Richard.

Logged

David Williams
Developer

member is offline
Avatar

meh

Gender:

Posts: 452

Re: GFXLIB
« Reply #80 on: Jun 3^rd, 2009, 5:45pm »

Example 49:

~~http://www.bb4w-games.com/138519651/example49.zip~~ (557Kb)

Just demo'ing a few new routines, namely:

GFXLIB_BrushBlur
GFXLIB_BPlotBMRowList
GFXLIB_PlotBMColumnList
GFXLIB_Line

David.

« Last Edit: Jan 20^th, 2012, 11:40pm by David Williams »

Logged

David Williams
Developer

member is offline
Avatar

meh

Gender:

Posts: 452

Re: GFXLIB
« Reply #81 on: Jun 6^th, 2009, 8:01pm »

Another day, another new routine for an already bloated GFXLIB. This one's called GFXLIB_RotateScaleTile:

~~http://www.bb4w-games.com/138519651/rotatescaletile.zip~~

It uses a quite fast method of rotating a bitmap, although my code is far from optimal. Still, this
demo averages only 16% CPU load on my laptop which is rather good in my opinion.

David.

« Last Edit: Jan 20^th, 2012, 11:41pm by David Williams »

Logged

Michael Hutton
Developer

member is offline
Avatar

Gender:

Posts: 248

Re: GFXLIB
« Reply #82 on: Jun 7^th, 2009, 6:41pm »

Another great routine! When is the next version of GFXLIB coming out?

Michael

Logged

David Williams
Developer

member is offline
Avatar

meh

Gender:

Posts: 452

Re: GFXLIB
« Reply #83 on: Jun 7^th, 2009, 8:34pm »

on Jun 7^th, 2009, 6:41pm, Michael Hutton wrote:

Another great routine! When is the next version of GFXLIB coming out?

Thanks, Michael.

The next version (i.e., the second publicly released version) will probably be out
in a few months; plenty of work to do until then.

Currently trying to get an MMX-powered 'alpha blending' routine up and running.

David.

Logged

admin
Administrator

member is offline

Posts: 1145

Re: GFXLIB
« Reply #84 on: Jun 7^th, 2009, 9:58pm »

Quote:

Currently trying to get an MMX-powered 'alpha blending' routine up and running.

I presume you are aware of (and are probably using) this document:

ftp://download.intel.com/ids/mmx/MMX_App_Alpha_Blending.pdf

Richard.

Logged

David Williams
Developer

member is offline
Avatar

meh

Gender:

Posts: 452

Re: GFXLIB
« Reply #85 on: Jun 7^th, 2009, 10:31pm »

on Jun 7^th, 2009, 9:58pm, Richard Russell wrote:

I presume you are aware of (and are probably using) this document:

ftp://download.intel.com/ids/mmx/MMX_App_Alpha_Blending.pdf

Yes, thanks, I've downloaded that document twice now... first time was some months ago,
and then again a few days ago. I was put off by the fact that it 'outputs' 16-bit 5:5:5 colour
values, rather than 32-bit 8:8:8:8.

I'm trying to adapt the alpha-blending code I posted here a few days ago -- it should be easy,
or at least it will be when I've worked out how to do MMX multiplies.

David.

Logged

admin
Administrator

member is offline

Posts: 1145

Re: GFXLIB
« Reply #86 on: Jun 8^th, 2009, 09:10am »

Quote:

I was put off by the fact that it 'outputs' 16-bit 5:5:5 colour values, rather than 32-bit 8:8:8:8.

To be precise, both the 'background' input and the output are 16-bpp, whereas the 'foreground' input is 32-bpp ARGB. No doubt this is because, when that article was written, 16-bpp was a common setting for PC displays.

It's not too difficult to strip out the code that unpacks and packs the 16-bpp pixels and replace it with code for 32-bpp RGB (with the 'alpha' byte unused). The resulting code is simpler, too.

Later: I've modified the Intel code to work with 32-bpp input and output. On my PC it's taking 2.2ms for a 640x480 image. Are you interested in it, or have you got your own working?

Richard.

« Last Edit: Jun 8^th, 2009, 1:46pm by admin »

Logged

David Williams
Developer

member is offline
Avatar

meh

Gender:

Posts: 452

Re: GFXLIB
« Reply #87 on: Jun 8^th, 2009, 4:05pm »

on Jun 8^th, 2009, 09:10am, Richard Russell wrote:

Later: I've modified the Intel code to work with 32-bpp input and output. On my PC it's taking 2.2ms for a 640x480 image. Are you interested in it... ?

In one word: Yes !

I am definately interested in it, thank you.

Now...

I recently knocked-up an MMX version of a supposedly optimised '50%' alpha blender (simply averages
the RGB32 colour values of corresponding foreground and background pixels). Here's the inner loop
from the non-MMX version:

Code:

        mov edx, [edi + 4*esi]                  ; load RGB32 pixel from source bitmap
        mov ebx, [ecx + 4*esi]                  ; load RGB32 pixel from dest addr
        and edx, &FEFEFE
        and ebx, &FEFEFE
        shr edx, 1
        shr ebx, 1
        add edx, ebx
        mov [ecx + 4*esi], edx                  ; write RGB32 pixel to destination bitmap buffer

Here's my MMX version, operating on four pixels per iteration of the inner X-loop:

Code:

        .GFXLIB_MMXBPlotAvgNC__xloop
        
        mov ebx, ecx
        shl ebx, 4
        
        movq mm1, [edi + ebx + 0]             ; load 2 pxls from bg      \
        movq mm2, [esi + ebx + 0]             ; load 2 pxls from srcBm    \
        ;                                     ;                            > 4 pixels
        movq mm3, [edi + ebx + 8]             ; load 2 pxls from bg       /
        movq mm4, [esi + ebx + 8]             ; load 2 pxls from srcBm   /
        
        pand mm1, mm0
        pand mm2, mm0
        pand mm3, mm0
        pand mm4, mm0
        
        psrld mm1, 1
        psrld mm2, 1
        psrld mm3, 1
        psrld mm4, 1
        
        paddd mm1, mm2
        paddd mm3, mm4
        
        movq [edi + ebx + 0], mm1
        movq [edi + ebx + 8], mm3
        
        dec ecx
        jge GFXLIB_MMXBPlotAvgNC__xloop

To my amazement, it's really no faster (or just marginally so) than the non-MMX version.

I was expecting something approaching a 2x speed improvement. :-(

David.

Logged

admin
Administrator

member is offline

Posts: 1145

Re: GFXLIB
« Reply #88 on: Jun 8^th, 2009, 5:35pm »

Quote:

To my amazement, it's really no faster (or just marginally so) than the non-MMX version.

I'm amazed that you were amazed! Your code really takes no advantage of the MMX features at all (e.g. parallel operation on up to 4 independent values, automatic clipping of results etc.). Really all you're doing is using 64-bit wide registers rather than 32-bit wide registers.

In addition to this, the operations you are performing (basically just an AND and a SHIFT) are so simple the likelihood is that the speed is determined largely by the bottleneck of getting the data out of and into memory, and not by the processing. Since exactly the same amount of data is transferred in each case, it's not surprising the speed is similar.

(Incidentally I assume you ensured your data was always QWORD-aligned; if it isn't the MMX method will be substantially slower than it otherwise would be).

Quote:

I am definately interested in it, thank you.

Here's my alphablending code. I would hope and expect it to be substantially faster than a non-MMX version:

Code:

        ;
        ; eax = pointer to array of 32-bit ARGB pixels ('foreground')
        ; ebx = pointer to array of 32-bit xRGB pixels ('background')
        ; ecx = pixel count DIV 2
        ;
        .roundf  dd &00800080 : dd &00000080
        ;
        .blend
        mov esi, eax       ; foreground pointer
        mov edi, ebx       ; background pointer
        
        movq mm4, [roundf] ; mm4 = 0000 0080 0080 0080 (rounding factor)
        pxor mm5, mm5      ; mm5 = 0000 0000 0000 0000
        
        .blendloop
        movq mm6, [esi]    ; mm6 = a2r2 g2b2 a1r1 g1b1 (foreground)
        movq mm7, [edi]    ; mm7 = xxR2 G2B2 xxR1 G1B1 (background)
        
        movq mm0, mm6      ; mm0 = xxxx xxxx a1r1 g1b1
        movq mm2, mm7      ; mm2 = xxxx xxxx xxR1 G1B1
        
        punpcklbw mm0, mm5 ; mm0 = 00a1 00r1 00g1 00b1 (p1)
        punpcklbw mm2, mm5 ; mm2 = 00xx 00R1 00G1 00B1 (q1)
        
        movq mm1, mm0      ; mm1 = 00a1 xxxx xxxx xxxx
        punpckhwd mm1, mm1 ; mm1 = 00a1 00a1 xxxx xxxx
        punpckhdq mm1, mm1 ; mm1 = 00a1 00a1 00a1 00a1
        
        psubw mm0, mm2     ; mm0 = p1 - q1
        psllw mm2, 8       ; mm2 = q1 * 256
        paddw mm2, mm4     ; mm2 = q1 * 256 + 128
        
        pmullw mm0, mm1    ; mm0 = (p1 - q1) * a1
        paddw mm2, mm0     ; mm2 = (p1 - q1) * a1 + q1 * 256 + 128
        psrlw mm2, 8       ; mm2 = xxxx 00R1 00G1 00B1
        
        psrlq mm6, 32      ; mm6 >>= 32 for second pixel
        psrlq mm7, 32      ; mm7 >>= 32 for second pixel
        
        movq mm0, mm6      ; mm0 = xxxx xxxx a2r2 g2b2
        movq mm3, mm7      ; mm3 = xxxx xxxx xxR2 G2B2
        
        punpcklbw mm0, mm5 ; mm0 = 00a2 00r2 00g2 00b2 (p2)
        punpcklbw mm3, mm5 ; mm3 = 00xx 00R2 00G2 00B2 (q2)
        
        movq mm1, mm0      ; mm1 = 00a2 xxxx xxxx xxxx
        punpckhwd mm1, mm1 ; mm1 = 00a2 00a2 xxxx xxxx
        punpckhdq mm1, mm1 ; mm1 = 00a2 00a2 00a2 00a2
        
        psubw mm0, mm3     ; mm0 = p2 - q2
        psllw mm3, 8       ; mm3 = q2 * 256
        paddw mm3, mm4     ; mm3 = q2 * 256 + 128
        
        pmullw mm0, mm1    ; mm0 = (p2 - q2) * a2
        paddw mm3, mm0     ; mm3 = (p2 - q2) * a2 + q2 * 256 + 128
        psrlw mm3, 8       ; mm3 = xxxx 00R2 00G2 00B2
        
        packuswb mm2, mm3  ; mm2 = xxR2 G2B2 xxR1 G1B1 (result)
        movq [edi], mm2    ; save result
        mov byte [edi+3],0 ; zero 'alpha' of pixel 1 (if necessary)
        mov byte [edi+7],0 ; zero 'alpha' of pixel 2 (if necessary)
        
        add esi, 8         ; next pixel-pair (foreground)
        add edi, 8         ; next pixel-pair (background)
        loop blendloop
        
        emms
        ret

I haven't done any extensive 'pairing' optimisation so there's a possibility it could be made more efficient.

Richard.

Logged

David Williams
Developer

member is offline
Avatar

meh

Gender:

Posts: 452

Re: GFXLIB
« Reply #89 on: Jun 9^th, 2009, 06:24am »

on Jun 8^th, 2009, 5:35pm, Richard Russell wrote:

I thought I was operating on two pixels in parallel (four pixels per iteration of the inner loop); the operations on the second pixel of each pair being almost 'free'.

on Jun 8^th, 2009, 5:35pm, Richard Russell wrote:

(Incidentally I assume you ensured your data was always QWORD-aligned; if it isn't the MMX method will be substantially slower than it otherwise would be).

No, I didn't! But I sure as heck do now. I've modified GFXLIB's bitmap loading routine to ensure that the start address of the data is divisible by 8. I notice that the DIB section ('screen memory') base address always seems to be QWORD-aligned, too. So, all data is now QWORD-aligned.

on Jun 8^th, 2009, 5:35pm, Richard Russell wrote:

Here's my alphablending code. I would hope and expect it to be substantially faster than a non-MMX version:

Beautiful... thanks. I've done some timing and frame rate tests involving your MMX routine, a more recent non-MMX routine that I had believed was fast and not far from being optimal, and a what-I-thought-was the stinky old GFXLIB alphablend routine (GFXLIB_BPlotAlphaBlend2). The results are surprising and a bit puzzling.

Here are the timing results of alphablending two 640x512 32bpp bitmaps, 1000 blends (times given in seconds):

Your MMX routine: 2.4
New supposedly fast non-MMX routine: 5.6
'Old' GFXLIB_BPlotAlphaBlend2 routine: 3.0

So, MMX wins. But... it doesn't seem to be that much faster than GFXLIB_BPlotAlphaBlend2.

Compare the code from the inner loop of the new non-MMX alphablend routine with that of the old GFXLIB_BPlotAlphaBlend2:

New non-MMX

Code:

        ._esp dd 0         ; will be temporarily stashing ESP in here
        
        .blend2
        
        ; EAX = fg
        ; EBX = bg
        ; ECX = pixel count
        
        mov [_esp], esp               ; naughty ;-)
        
        mov esp, eax                  ; ESP = fg
        
        .blend2_lp
        mov esi, [esp + 4*ecx - 4]    ; fg pxl (ARGB)
        mov edi, [ebx + 4*ecx - 4]    ; bg pxl (xRGB)
        
        mov ebp, esi                  ; extract alpha value from src ARGB pxl
        ; EBP is shifted (>> 24) later (aids pipelining ?)

        mov eax, esi                 ; copy ESI
        and eax, &FF00FF             ; EAX = srb
        and esi, &00FF00             ; ESI = sg
        
        mov edx, edi                 ; copy EDI
        and edi, &FF00FF             ; EDI = drb
        and edx, &00FF00             ; EDX = dg
        
        shr ebp, 24
        adc ebp, 0
        
        ;REM. EAX = srb
        ;REM. ESI = sg
        ;REM. EDI = drb
        ;REM. EDX = dg
        
        sub eax, edi                 ; srb - drb
        sub esi, edx                 ; sg - dg
        
        imul eax, ebp                ; (srb - drb)*alpha
        imul esi, ebp                ; (sg - dg)*alpha
        
        add eax, &800080             ; (srb - drb)*alpha + &800080
        add esi, &008000             ; (sg - dg)*alpha + &008000
        
        shr eax, 8                   ; ((srb - drb)*alpha + &800080) >> 8
        shr esi, 8                   ; ((sg - dg)*alpha + &008000) >> 8
        
        add eax, edi                 ; drb + ((srb - drb)*alpha + &800080) >> 8
        add esi, edx                 ; dg + ((sg - dg)*alpha + &008000) >> 8
        
        and eax, &FF00FF             ; (drb + ((srb - drb)*alpha + &800080) >> 8) AND &FF00FF
        and esi, &00FF00             ; (dg + ((sg - dg)*alpha + &008000) >> 8) AND &00FF00
        
        add eax, esi
        
        mov [ebx + 4*ecx - 4], eax   ; write alpha-blended pixel
        
        loop blend2_lp
        
        mov esp, [_esp]
        
        ret

Old GFXLIB_BPlotAlphaBlend2

Code:

        .GFXLIB_BPlotAlphaBlend2__xloop

        movzx edx, BYTE [edi + 4*esi + 3]       ; load alpha mask byte
        neg edx
        add edx, 255
        imul edx, (1.0/255.0)*(2^20)            ; = mulfac
        
        movzx eax, BYTE [edi + 4*esi + 2]       ; load src bmp Red byte
        movzx ebx, BYTE [ecx + 4*esi + 2]       ; load dst bmp Red byte
        sub ebx, eax                            ; (dst - src)
        imul ebx, edx                           ; mulfac*(dst - src)
        shr ebx, 20                             ; (mulfac*(dst - src)) >> 20
        add eax, ebx
        mov BYTE [ecx + 4*esi + 2], al
        
        movzx eax, BYTE [edi + 4*esi + 1]       ; load src bmp Green byte
        movzx ebx, BYTE [ecx + 4*esi + 1]       ; load dst bmp Green byte
        sub ebx, eax                            ; (dst - src)
        imul ebx, edx                           ; mulfac*(dst - src)
        shr ebx, 20                             ; (mulfac*(dst - src)) >> 20
        add eax, ebx
        mov BYTE [ecx + 4*esi + 1], al
        
        movzx eax, BYTE [edi + 4*esi + 0]       ; load src bmp Blue byte
        movzx ebx, BYTE [ecx + 4*esi + 0]       ; load dst bmp Blue byte
        sub ebx, eax                            ; (dst - src)
        imul ebx, edx                           ; mulfac*(dst - src)
        shr ebx, 20                             ; (mulfac*(dst - src)) >> 20
        add eax, ebx
        mov BYTE [ecx + 4*esi + 0], al
        
        dec esi                                 ; X -= 1
        jge GFXLIB_BPlotAlphaBlend2__xloop      ; loop if X >= 0

How could GFXLIB_BPlotAlphaBlend2 possibly be faster than the new non-MMX routine? I'd like to know where the bottleneck lies with the new non-MMX routine.

BPlotAlphaBlend2 has 10 memory accesses (7 reads, 3 writes) per pixel.

New non-MMX has just 3 memory accesses (2 reads, 1 write) per pixel.

BPlotAlphaBlend2 has 4 multiply instructions (IMUL) per pixel. New non-MMX has 2.

Both routines have roughly the same number of instructions (26) in the inner loop.

ADC

The ADC instruction in the non-MMX routine is quite an execution speed killer, it would seem. Removing the ADC ebp,0 instruction (which is probably largely superfluous) lops off an incredible 2 seconds! So the new league table is as follows:

Your MMX routine: 2.4
'Old' GFXLIB_BPlotAlphaBlend2 routine: 3.0
New non-MMX routine (without ADC instruction): 3.5
New non-MMX routine: 5.6

Still, GFXLIB_BPlotAlphaBlend2 is the faster non-MMX routine.

Timing and frame rate test programs (executables)

The frame rate test programs are largely identical, except a different alphablending routine is used in each case.

Richard's MMX routine (averages 226 fps on my PC):
~~http://www.bb4w-games.com/138519651/alphablend_mmx.zip~~ (1 MB)

'Old' GFXLIB_BPlotAlphaBlend2 routine (averages 206 fps on my PC):
~~http://www.bb4w-games.com/138519651/alphablend_bplotalphablend2.zip~~ (1 MB)

New non-MMX routine (averages 127 fps on my PC):
~~http://www.bb4w-games.com/138519651/alphablend_non-mmx.zip~~ (1 MB)

New non-MMX routine without ADC instruction (averages 174 fps on my PC):
~~http://www.bb4w-games.com/138519651/alphablend_non-mmx_no-adc.zip~~ (1 MB)

Timing tests:

~~http://www.bb4w-games.com/138519651/alphablend_timingtest_2.zip~~ (1 MB)

By the way, I've been repeatedly referring to the 'New' non-MMX routine -- I don't intend to replace BPlotAlphaBlend2 with it until I can figure out (or someone else can tell me) where the bottleneck lies with it. Just to reiterate, New non-MMX *should* be faster than BPlotAlphaBlend2, but it isn't.

Anyway, Richard, your MMX-powered alphablending routine would be a fine and very welcome addition to GFXLIB (if I may !?).

Thanks again.

David.

« Last Edit: Jan 18^th, 2012, 5:08pm by David Williams »

Logged

admin
Administrator

member is offline

Posts: 1145

Re: GFXLIB
« Reply #90 on: Jun 9^th, 2009, 08:52am »

Quote:

I thought I was operating on two pixels in parallel

Because of the processor's superscalar architecture, it's entirely possible your non-MMX version was processing "two pixels in parallel" too. Anyway, as I said, they may both have been memory-bandwidth bound, in which case the speed would be much the same.

Quote:

The ADC instruction in the non-MMX routine is quite an execution speed killer, it would seem.

As I expect you appreciate, it's not that ADC is particularly slow but it's because of the dependencies it introduces. You have to think of the effect of the instruction not on its own, but in the context of the instructions which surround it.

In this particular case the main significance is that since adc ebp,0 depends on the state of the carry flag, and since the preceding shr ebp,24 affects the carry flag, the two instructions are forced to be serialised and run on the same execution unit.

I expect that, when you remove the ADC, it gives the processor more opportunity to take advantage of out-of-order execution, and of scheduling instructions on the different execution units. Therefore the speed improvement is disproportionate to the time taken by the ADC in isolation.

It's for this kind of reason that modern compilers can sometimes beat 'hand assembly' for speed. They understand (better than the average human programmer!) the internal architecture of the CPU, and when it can be of benefit to change the sequence of instructions to improve performance even if the clarity of the code suffers.

Whether a similar issue explains the 'anomalous' speed difference between your two non-MMX versions I can't say, but it may do.

Quote:

Anyway, Richard, your MMX-powered alphablending routine would be a fine and very welcome addition to GFXLIB (if I may !?).

Of course you may, with the appropriate grovelling acknowledgement!!

Richard.

Logged

Pages: 1 ... 4 5 6 7 8 9


« Previous Topic \| Next Topic »