BBC BASIC for Windows - Letting GCC do the hard work

BBC BASIC for Windows

Programming

Assembly Language Programming (Moderator: admin)

Letting GCC do the hard work

« Previous Topic | Next Topic »

Pages: 1

Author

Topic: Letting GCC do the hard work (Read 1773 times)

David Williams
Developer

member is offline
Avatar

meh

Gender:

Posts: 452

Re: Letting GCC do the hard work
« Reply #2 on: Aug 2^nd, 2014, 10:36pm »

on Aug 2^nd, 2014, 9:11pm, Richard Russell wrote:

Code:

      fucom st1 : push eax : fstsw ax : sahf : pop eax ;REM fucomi st0, st1
      fucomp st1 : push eax : fstsw ax : sahf : pop eax ;REM fucomip st0, st1

Thanks, noted. I can perhaps see why fucomi & fucomip are employed in the fsgn() function (returns as an integer the sign of a double), but they crop up again outside that function, in the inner loop, where no comparisons (in the source code) are performed (loop iterators excepted). GCC knows best, I suppose. It's quite brain-melting, I find, trying to understand GCC's generated assembler code.

I haven't yet substituted your equivalent encodings for fucomi & fucomip, but I'm a bit curious to see the likely time penalty (not that it matters because the code is not intended for real-time use; I'm generating frames for a YouTube video).

For completeness, for those that might be curious, I'll include my 'wet-behind-the-ears' C source below.

I think I'll be doing a fair amount of hybrid C/BB4W stuff over the next year or so, time permitting.

Thanks again.

David.

Code:

/*
// A beginner's attempt at some C
//
// Not intended to be run 'standalone' (doesn't do anything)
// Intended that the resulting assembler code be exported to BB4W
*/ 

/* might be worth implementing fsgn() inline */
int fsgn(double n){
	if (n > 0.0) return 1;
	if (n < 0.0) return -1;
	return 0;
}

void Wave(int W, int H, double *hMap, double *hMap2, double *vMap,
          double *scale, double *hDamp){
	
	/* W and H are the grid/map width and height respectively */
	
	int x, y, yp1w, ym1w, yw, ywpx, xm1, xp1;
	double hSum, sumHDiff, h, v;

	hSum = 0.0;
	
	for (y=1; y<H-2; y++){

	    /* pre-calc. some frequently accessed values (array indices) */
		yp1w = (y+1)*W;
		ym1w = (y-1)*W;
		yw = y*W;

		for (x=1; x<W-2; x++){

		    /* a few more pre-calculated array indices */
			xm1 = x-1;
			xp1 = x+1;
			ywpx = yw + x;

			/* 
			// calculate the sum of the 8 height values
			// surrounding the current x,y position in the height map (hMap)
			*/
			hSum = hMap[ yp1w + xm1 ]
			     + hMap[ yp1w + x   ]		 			
			     + hMap[ yp1w + xp1 ]		 			
			     + hMap[ yw   + xm1 ]		 			
			     + hMap[ yw   + xp1 ]	 			
			     + hMap[ ym1w + xm1 ]		 			
			     + hMap[ ym1w + x   ]		 			
			     + hMap[ ym1w + xp1 ];

			/* get height value at current x,y position */	 
			h = hMap[ ywpx ];

			/* calculate the sum of the height differences */ 			
			sumHDiff = hSum - 8*h; 

			/* retrieve and update the 'velocity' (i.e. change in height) */
			v = vMap[ ywpx ] + (*scale)*sumHDiff;

			/* store updated 'velocity' */
			vMap[ ywpx ] = v;			

			/* update the height, apply damping, and store it in hMap2 */	
			hMap2[ ywpx ] = h+v - fsgn(h+v)*(*hDamp);
			
		}
	}

	return;
}

main(){
	return;
}

Logged

rtr
Guest

Re: Letting GCC do the hard work
« Reply #3 on: Aug 3^rd, 2014, 10:19am »

on Aug 2^nd, 2014, 10:36pm, David Williams wrote:

I can perhaps see why fucomi & fucomip are employed in the fsgn() function (returns as an integer the sign of a double), but they crop up again outside that function, in the inner loop

I think you're misreading the generated assembler code. The fsgn function is never called and can be deleted without affecting the operation of your program! The use of fucomi and fucomip in the 'inner loop' are where GCC has automatically inlined the fsgn code, for performance reasons.

Incidentally this is a common C implementation of the signum function:

Code:

int fsgn(double val) {return (val > 0.0) - (val < 0.0);}

It relies on the fact that comparisons return 1 for true and 0 for false, which is guaranteed. I don't know whether the generated assembler code will be any simpler than for your version.

Quote:

I haven't yet substituted your equivalent encodings for fucomi & fucomip, but I'm a bit curious to see the likely time penalty

In human-written assembler code one would try to avoid the need to save and restore eax, which makes the overhead that much greater.

There is an argument that ASMLIB should have included fucomi and the other comparison instructions that were added to the Pentium Pro, but nobody has ever commented on the omission, or asked me to correct it.

Richard.

Logged

David Williams
Developer

member is offline
Avatar

meh

Gender:

Posts: 452

Re: Letting GCC do the hard work
« Reply #4 on: Aug 10^th, 2014, 8:49pm »

I'm pleased to say that, with the WavePlasma6 program, when the C source code (containing the subroutine that does the calculations - 'Wave') is compiled to a DLL, the performance really doesn't take much of a hit (both versions give around 80 to 90 fps on my laptop).

The only explicit GCC optimisation switch I'm specifying is -O2, but if anyone can suggest any others then please go ahead. (Without the -O2 switch, it runs at around 40 fps on my laptop - so nearly half the speed.)

For interested parties, the "DLL version" of WavePlasma6 is listed below, and the DLL itself (wave1d.dll) can be downloaded from here:

www.bb4wgames.com/temp/wave1d_dll.zip

Code:

      REM WavePlasma6b // 10-08-2014
      REM
      REM Requires wave1d.dll

      *ESC OFF
      *FLOAT 64

      ON CLOSE PROC_clean_up : QUIT
      ON ERROR PROC_clean_up : PROCError(REPORT$ + " at line " + STR$ERL)

      HIMEM = PAGE + 10*&100000

      SYS "LoadLibrary", "wave1d" TO wave1d_dll%
      IF wave1d_dll% = 0 PROCError("Can't load wave1d.dll (LoadLibrary returned 0)")
      SYS "GetProcAddress", wave1d_dll%, "Wave" TO Wave%
      IF Wave% = 0 PROC_clean_up : PROCError("Couldn't import the DLL function 'Wave' (GetProcAddress returned 0)")

      R% = RND(-50681821) : REM Seed BB4W's PRNG

      Delay% = TRUE

      ScrW% = 512
      ScrH% = 512
      PROCFixWndSz
      VDU 23,22,ScrW%;ScrH%;8,16,16,0 : OFF
      dibs% = FNCreateDIBSection

      GetTickCount% = FNSYS_NameToAddress("GetTickCount")
      InvalidateRect% = FNSYS_NameToAddress("InvalidateRect")
      Sleep% = FNSYS_NameToAddress("Sleep")

      gridW% = ScrW%
      gridH% = ScrH%
      gridSz% = 8*gridW%*gridH% : REM Grid size in bytes
      DIM hMap%  gridSz%+8 : hMap% =(hMap% +7) AND -8
      DIM hMap2% gridSz%+8 : hMap2%=(hMap2%+7) AND -8
      DIM vMap%  gridSz%+8 : vMap% =(vMap% +7) AND -8

      colTabSz% = 10000
      DIM colTab% 4*(colTabSz% + 1)
      colTab%=(colTab%+3) AND -4

      MaxExciters% = 8
      DIM exciter{( MaxExciters%-1 ) active%, x%, y%, \
      \ amp, theta, dtheta, life%, dying%}

      PROCFillColourTable

      PROC_asm

      REM Clear the grids
      FOR I%=hMap%  TO hMap%+gridSz%-1  STEP 8:|I% = 0.0:NEXT
      FOR I%=hMap2% TO hMap2%+gridSz%-1 STEP 8:|I% = 0.0:NEXT
      FOR I%=vMap%  TO vMap%+gridSz%-1  STEP 8:|I% = 0.0:NEXT

      dampTheta# = 0.0
      dampDTheta# = 0.0001
      dampAmp# = 0.05
      scale# = 0.05
      hDamp# = 0.1
      first% = TRUE
      frame% = 0
      *REFRESH OFF
      SYS GetTickCount% TO time0%
      REPEAT
        FOR I% = 0 TO MaxExciters%-1
          IF exciter{(I%)}.active% THEN
            X% = exciter{(I%)}.x%
            Y% = exciter{(I%)}.y%
            |(hMap% + 8*(Y%*gridW% + X%)) = \
            \ 1.0# * exciter{(I%)}.amp * SIN(exciter{(I%)}.theta)
            exciter{(I%)}.theta += exciter{(I%)}.dtheta
            IF exciter{(I%)}.life% > 0 THEN
              exciter{(I%)}.life%-=1
            ELSE
              exciter{(I%)}.dying% = TRUE
            ENDIF
            IF exciter{(I%)}.dying% THEN
              exciter{(I%)}.amp -= 1
              IF exciter{(I%)}.amp <= 0 THEN
                exciter{(I%)}.active% = FALSE
                |(vMap% + 8*(Y%*gridW% + X%)) = 0.0
              ENDIF
            ENDIF
          ELSE
            IF first%=TRUE OR RND(2500)=1 THEN
              IF first% THEN first% = FALSE
              exciter{(I%)}.active% = TRUE
              exciter{(I%)}.x% = RND(gridW%)-2
              exciter{(I%)}.y% = RND(gridH%)-2
              exciter{(I%)}.amp = 100+RND(800)
              exciter{(I%)}.theta = 0
              IF RND(10) > 1 THEN
                exciter{(I%)}.dtheta = 0.01 * RND(1)
              ELSE
                exciter{(I%)}.dtheta = 0.005 * RND(1)
              ENDIF
              exciter{(I%)}.life% = 500+RND(1000)
              exciter{(I%)}.dying% = FALSE
            ENDIF
          ENDIF
        NEXT I%
  
        hDamp# = 1.0#*(0.01+dampAmp#*ABSSIN(dampTheta#))
        dampTheta# += dampDTheta#
  
        SYS Wave%, gridW%, gridH%, hMap%, hMap2%, vMap%, ^scale#, ^hDamp#
        SYS DWordCopy, hMap2%, hMap%, 2*gridW%*gridH%
        CALL DrawHMap
  
        SYS InvalidateRect%, @hwnd%, 0, 0
        *REFRESH
        IF Delay% THEN SYS Sleep%, 2
  
        frame% += 1
        SYS GetTickCount% TO time1%
        IF time1%-time0% >= 1000 THEN
          SYS "SetWindowText", @hwnd%, STR$frame% + " fps"
          SYS GetTickCount% TO time0%
          frame% = 0
        ENDIF
      UNTIL FALSE
      END

      DEF PROCFillColourTable
      LOCAL I%,r%,g%,b%
      LOCAL t1,t2,t3,t4,t5,t6
      LOCAL dt1, dt2, dt3, dt4, dt5, dt6
      t1=2*PI*RND(1):dt1=0.1*RND(1)
      t2=2*PI*RND(1):dt2=0.1*RND(1)
      t3=2*PI*RND(1):dt3=0.1*RND(1)
      t4=2*PI*RND(1):dt4=0.1*RND(1)
      t5=2*PI*RND(1):dt5=0.1*RND(1)
      t6=2*PI*RND(1):dt6=0.1*RND(1)
      FOR I% = 0 TO colTabSz%-1
        r% = 128+127*SIN(t1)*SIN(t2)
        g% = 128+127*SIN(t3)*SIN(t4)
        b% = 128+127*SIN(t5)*SIN(t6)
        colTab%!(4*I%) = r%*&10000 + g%*&100 + b%
        t1+=dt1
        t2+=dt2
        t3+=dt3
        t4+=dt4
        t5+=dt5
        t6+=dt6
      NEXT I%
      ENDPROC

      DEF PROC_asm
      LOCAL P%, code%, pass%, gap1%, gap2%
      LOCAL xlp, ylp
      DIM gap1% 4095, code% 1023, gap2% 4095
      FOR pass%=0 TO 2 STEP 2
        P%=code%
        [OPT pass%
  
        .DrawHMap
        pushad
        sub esp, 16
        mov ebp, dibs%
        mov esi, colTab%
        finit
        xor edx, edx         ; Y loop index
        .ylp
        xor ecx, ecx         ; X loop index
        .xlp
        mov ebx, edx         ; copy Y
        imul ebx, gridW%     ; Y*gridW
        add ebx, ecx         ; Y*gridW + X
        shl ebx, 3           ; 8*(Y*gridW + X)
        add ebx, hMap%       ; hMap%+8*(Y*gridW + X)
        fld QWORD [ebx]      ; = h
        fistp DWORD [esp]    ; = h%
        mov edi, [esp]       ; EDI = h%
        add edi, colTabSz%DIV2 ; colTabSz%DIV2 + h%
        mov eax, [esi + 4*edi] ; get colour
        mov [ebp], eax
        add ebp, 4
        add ecx, 1
        cmp ecx, gridW%
        jl xlp
        add edx, 1
        cmp edx, gridH%
        jl ylp
        add esp, 16
        popad
        ret
  
        .DWordCopy
        ; srcAddr, destAddr, numDWORDs
        pushad
        ; ESP+36 = srcAddr
        ; ESP+40 = destAddr
        ; ESP+44 = numDWORDs
        mov esi, [esp + 36]
        mov edi, [esp + 40]
        mov ecx, [esp + 44]
        cld
        rep movsd
        popad
        ret 12
  
        ]
      NEXT pass%
      ENDPROC

      DEFFNCreateDIBSection
      LOCAL A%,B%,H%,O%
      DIM B% 19:!B%=44:B%!4=@vdu%!208:B%!8=@vdu%!212:B%!12=&200001
      SYS"CreateDIBSection",@memhdc%,B%,0,^A%,0,0TOH%
      IF H%=0 PROCError("Create DIBSection failed")
      SYS"SelectObject",@memhdc%,H%TOO%
      SYS"DeleteObject",O%
      CLS
      =A%

      DEF FNSYS_NameToAddress(f$)
      LOCALP%:DIMP%LOCAL5:[OPT 0:call f$:]:=P%!-4+P%

      DEF PROCFixWndSz
      LOCAL W%
      SYS"GetWindowLong",@hwnd%,-16 TO W%
      SYS"SetWindowLong",@hwnd%,-16,W% ANDNOT&40000 ANDNOT&10000
      ENDPROC

      DEF PROC_clean_up
      wave1d_dll% += 0
      IF wave1d_dll% <> 0 THEN SYS "FreeLibrary", wave1d_dll%
      ENDPROC

      DEF PROCError(s$)
      OSCLI "REFRESH ON"
      CLS : ON : VDU 7
      PRINT '" " + s$;
      REPEAT UNTIL INKEY(1)=0
      ENDPROC

David.
--

Logged

David Williams
Developer

member is offline
Avatar

meh

Gender:

Posts: 452

Re: Letting GCC do the hard work
« Reply #5 on: Aug 11^th, 2014, 9:30pm »

As part of my familiarization with C (with a view to writing hybrid C and BB4W programs), I revisited an eccentric old past-time of mine: searching for circular alignments (and other geometric shapes) amongst sets of randomly positioned points. Ten years ago when I was into this stuff, it would take BBC BASIC several minutes to search a 100-or-so random 'targets'. Now it takes around 20 seconds on my laptop (in BASIC). Admittedly, the circle-finding algorithm is not robust and nor is it the most efficient way of looking for circles (that would probably be some variant of the Hough Transform - a linear-time search).

Anyway, the compiled C version (after conversion to BB4W assembler code) takes ~0.2 seconds to search 100 random points for circles, some 80 to 100 times faster than the BASIC version. The DLL version takes around 1 second, some 4-5 times slower than the ASM version, but still significantly faster than the BASIC version.

The following link to a Zip folder contains different versions of the circle finder (BASIC, assembly language, DLL, C source):

http://www.bb4wgames.com/temp/circfinder.zip

The assembly language version makes (minor) use of ASMLIB (because GCC compiler emitted a CMOVxx instruction).

Parameters used (the most important ones):

Number of random (x,y) points: 100
Grid size: 20km x 16km
Minimum 'hits' required per circle: 8
Error tolerance: 60 metres
Minimum allowed radius: 2000 metres
Maximum allowed radius: 6000 metres

Note that none of this is based on final, 'production quality' code. I am, after all, merely learning C.

I think a "GCC assembler dump to BB4W assembler code converter" would save a lot of time, so perhaps that's a new project for me at some point.

David.
--

Logged

yee
New Member

member is offline

Posts: 4

Re: Letting GCC do the hard work
« Reply #6 on: Aug 12^th, 2014, 8:17pm »

Hi, everyone,

Just a note to say, - some years ago, I came across an article "Easy C " by pete Orlin and John Heath in may 1985 issue of Byte magazine (p 137-148)

it describes their use of the C's preprocesor to make their C code scripts - read a bit like BASIC code which one may find easier to "understand" for the beginner and infrequent C user ?

(I can e-mail a copy if you can't search/find a copy )

Logged

David Williams
Developer

member is offline
Avatar

meh

Gender:

Posts: 452

Re: Letting GCC do the hard work
« Reply #7 on: Aug 17^th, 2014, 08:13am »

Probably my last post for a while...

1000 depth-sorted 'vector balls' based on graphics routines written in C (including the Shell Sort code I borrowed from Rosetta Code). I get 60 fps on my laptop, which is quite impressive, considering (well, considering that I don't yet really know what I'm doing with C):

www.bb4wgames.com/temp/vector_balls.zip [EXE; 142 Kb]

Update: Two versions of glib.dll compiled using different GCC optimisation settings (-O2 and -O3 plus -ffast-math). Also edited this post to include the image link below.

Screenshot:
www.bb4wgames.com/temp/vecballs.jpg

This kind of performance is convincing me that hybridizing BB4W and C/C++ code is the way to go (for me personally).

I'll include the BB4W part of the source below for curious people.

David.
---

Code:

      *FLOAT 64
      *ESC OFF

      ON ERROR PROCerror

      HIMEM = PAGE + 2*&100000

      PROCFixWndSz : MODE 8 : OFF

      INSTALL @lib$ + "GLIB"
      PROCInitGLIB( @lib$ + "glib.dll", g{} )

      ON ERROR PROCCleanup : PROCerror
      ON CLOSE PROCCleanup : QUIT

      GetTickCount = FN`s("GetTickCount")

      LerpClr        = FNImport("LerpClr")
      Plot           = FNImport("Plot")
      InitExPoints   = FNImport("InitExPointList")
      ShellSort      = FNImport("ShellSortExPointListZValues")
      Rotate         = FNImport("RotateExPoints")
      MakeBallBitmap = FNImport("MakeBallBitmap1")

      REM Create a 48x48 ball bitmap:
      DIM ball% 4*(48*48 + 1)
      ball%=ball% + 3 AND -4
      SYS MakeBallBitmap, ball%, 48, &40AA20, &FF0020, 0.99*&10000, &10000

      N% = 1000

      DIM list{(N%-1) x#, y#, z#, x2#, y2#, z2#, key%, d0%, d1%, d2% }

      listBaseAddr% = ^list{(0)}.x#

      IF (listBaseAddr% AND 3) <> 0 THEN
        PRINT '" The coordinates list base address is not DWORD-aligned!"
        PRINT '" This may affect performance. Continuing in 5 seconds..."
        WAIT 500
      ENDIF

      REM Define objects (balls) 3D (x,y,z) coordinates:
      FOR I% = 0 TO N%-1
        list{(I%)}.x# = (RND(1)-0.5) * 800.0
        list{(I%)}.y# = (RND(1)-0.5) * 800.0
        list{(I%)}.z# = (RND(1)-0.5) * 800.0
        list{(I%)}.x2# = 0.0
        list{(I%)}.y2# = 0.0
        list{(I%)}.z2# = 0.0
      NEXT

      REM Initialise rotation angles:
      t1 = 2*PI*RND(1)
      t2 = 2*PI*RND(1)
      t3 = 2*PI*RND(1)

      F% = &10000 : REM Frequently used constant

      frames% = 0

      *REFRESH OFF

      SYS GetTickCount TO time0%
      REPEAT
  
        REM Draw background:
        SYS LerpClr, g{}, &102030, &805090
  
        REM Init key values for Z-sort
        REM (this could and perhaps should be moved into the rotation routine):
        SYS InitExPoints, listBaseAddr%, N%
  
        REM Rotate coordinates:
        SYS Rotate, listBaseAddr%, 1+2+4, N%, F%*t1, F%*t2, F%*t3, \
        \ F%*-100, F%*-200, F%*50, \
        \ F%*320, F%*256, F%*0, \
        \ 1, F%*300, F%*800
  
        REM Shell sort code (in C) courtesy of Rosetta Code, many thanks:
        SYS ShellSort, listBaseAddr%, N%, -1
  
        REM Draw the depth-sorted balls:
        FOR I% = 0 TO N%-1
          J% = list{(I%)}.key%
          SYS Plot, g{}, ball%, 48, 48, list{(J%)}.x2#-16, list{(J%)}.y2#-16
        NEXT I%
  
        PROCDisplay( TRUE )
  
        frames% += 1
  
        SYS GetTickCount TO time1%
        IF time1%-time0%>=1000 THEN
          SYS GetTickCount TO time0%
          SYS "SetWindowText", @hwnd%, STR$frames% + " fps"
          frames% = 0
        ENDIF
  
        REM Bump rotation angles:
        t1 += 0.02001
        t2 += 0.01905
        t3 += 0.00598
      UNTIL FALSE

      DEF PROCFixWndSz
      LOCAL W%
      SYS"GetWindowLong",@hwnd%,-16 TO W%
      SYS"SetWindowLong",@hwnd%,-16,W% ANDNOT&40000 ANDNOT&10000
      ENDPROC

      DEF PROCerror
      OSCLI "REFRESH ON" : CLS : ON : PRINT '" ";
      REPORT : PRINT " at line "; ERL;
      REPEAT UNTIL INKEY(1)=0
      ENDPROC

« Last Edit: Aug 17^th, 2014, 9:09pm by David Williams »

Logged

rtr
Guest

Re: Letting GCC do the hard work
« Reply #8 on: Aug 17^th, 2014, 10:32am »

on Aug 17^th, 2014, 08:13am, David Williams wrote:

This kind of performance is convincing me that hybridizing BB4W and C/C++ code is the way to go (for me personally).

Performance (if by that you mean speed) is not a good reason to go down the GCC route. Even though the code generators in modern compilers are very good, you will almost always be able to do better with hand-crafted assembler.

That is especially true when you are not targeting a particular CPU architecture, but want the code to run on a wide range of machines. In that case much of the clever code-optimising for a specific architecture, that GCC can do very well, will benefit some machines at the expense of others.

If you are compiling with the -march=native switch and then testing your code on the same machine you are getting a misleading impression of performance (unless of course you want to go down the route of including machine code for a range of different architectures and choosing the best one at run time).

Where using C does admittedly have advantages is in speed and ease of coding, and especially in time taken debugging. If those are the issues that most concern you, then fine.

Richard.

Logged

David Williams
Developer

member is offline
Avatar

meh

Gender:

Posts: 452

Re: Letting GCC do the hard work
« Reply #9 on: Aug 17^th, 2014, 11:38am »

on Aug 17^th, 2014, 10:32am, Richard Russell wrote:

If you are compiling with the -march=native switch and then testing your code on the same machine you are getting a misleading impression of performance [...]

Yes, I did use that switch and when after uploading the EXE for public consumption, I discovered that it crashed my 32-bit XP-based laptop (which I hardly use now!). I suspect my use of -march=native caused GCC to generate 64-bit code since the laptop the code was compiled on is a 64-bit machine. Lesson learned.

Re-compiling the vector balls demo without the aforementioned switch results in the code working on the 32-bit laptop, although the frame rate isn't as high as on the compilation machine (which is a little faster anyway, I think).

Quote:

Where using C does admittedly have advantages is in speed and ease of coding, and especially in time taken debugging. If those are the issues that most concern you, then fine.

For the vast majority of my 'applications', the speed of GCC's generated ASM code suffices (and in some cases, has exceeded the speed of my hand-written ASM code, which isn't too surprising!). I won't be touching -- or rather, writing -- assembler code again unless my life depends on it.

David.
--

Logged

Pages: 1


« Previous Topic \| Next Topic »