Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to optimize C code for NGPC?
#1
What are some good general guidelines for optimizing C code for NGPC?

Given that there isn't a big NGPC development scene, that also means there's not a whole lot of documentation/tutorials/guidelines available.

Gameboy, on the other hand, has some, and I found this page that has some advice for optimizing C code for GB:
https://gbdev.io/to_c_or_not_to_c.html#asm-help

Per the Gameboy optimization recommendations:
1. Use global variables instead of local if possible, because GB is slow at using the stack.
2. Inline instead of using functions if best performance is desired.

I have refactored much of my NGPC code to use a lot more global variables and at least avoid nested function calls, but I don't think it is making a perceivable difference.

In fact, I am getting what I see as very good performance, even with likely inefficient C code.

The main issue I am running into is with hblank effects bogging performance down noticeably when used on real hardware.  I can cut those out and get what feels like really good performance.

Do recommendations from GB coding carry over into NGPC and/or are there other recommendations for optimizing C code for NGPC?
Reply
#2
-O3

Smile
Reply
#3
a bit of asm in your c code ?

Personally, I use asm for hblank code, sometimes using also bank register switch (incf, decf or ldf)

Do you know ldir(w) and lddr(w) ? (check asm doc: e_900h_chap3_cpu_4.pdf) 

something like:

void moveRam(void*dst,void*src,u32 count)
{
__ASM(" ld xbc,(xsp+12) ");
__ASM(" ld xhl,(xsp+8) ");
__ASM(" ld xde,(xsp+4) ");
__ASM(" ldirw (xde+),(xhl+) ");
}


can be improved without using a function and direct access to global var to init dst & src (maybe you can use DMA for async mem transfert)

Most of the time, -O3 is better than my own asm "optimizations", that's why I often add a -S to check how it's compiled.
Reply
#4
(02-11-2022, 08:49 PM)sodthor Wrote: a bit of asm in your c code ?

Personally, I use asm for hblank code, sometimes using also bank register switch (incf, decf or ldf)

Do you know ldir(w) and lddr(w) ? (check asm doc: e_900h_chap3_cpu_4.pdf) 

something like:

void moveRam(void*dst,void*src,u32 count)
{
__ASM(" ld xbc,(xsp+12) ");
__ASM(" ld xhl,(xsp+8) ");
__ASM(" ld xde,(xsp+4) ");
__ASM(" ldirw (xde+),(xhl+) ");
}


can be improved without using a function and direct access to global var to init dst & src (maybe you can use DMA for async mem transfert)

Most of the time, -O3 is better than my own asm "optimizations", that's why I often add a -S to check how it's compiled.

It appears my makefile has always had -O3.  I can barely stumble my way through ASM, but I am attempting to learn.

In my case, everything needed by hblank is known during vblank.

Here is my hblank code that references some global variables.  I added notes where the pain point is for performance.

Code:
void __interrupt myHBL()
{
   u8 y = RAS_Y;
   
   myHBCounter++;
   
    if(y==8){
        
    
     if(showHUD){
         SCR1_Y=plane1.planeY;
         SCR1_X=plane1.planeX;
         SPR_Y=0;
         if(currentLevel.isBossRoom)SCR2_Y=plane2.planeY;
     }
    
    
    
    }
    //everything above works fine
    //everything below kills performance
    
    //performance suffers is screenSplit is true
    else if (screenSplit){
        if(y==60)   SCR2_X=split1;
        else if(y==100)  SCR2_X=split2;
    }
}


What is the best way to shift this to ASM?
Reply
#5
First, do you execute the hblank every line? if yes, maybe try only evry4 lines as 8, 60 and 100 are multiple of 4. You can also adapt your gfx to multiples of 8 (64 & 96 or 104).

Then compile with -S flag to check the compiled code of the hblank and change the condition order to see if it's simpler or not.

What is the hb counter used for? if you have ras_y...

Maybe to avoid conditions, you can do a specific hblank for each situation: showHUD, boss room and/or screenSplit...

I'll try on my side tomorrow to see if asm can be useful because as I said, most of the time, the compiler is better than me, and modifying my C code to help the compiler is often the best solution.
Reply
#6
(02-14-2022, 05:33 AM)sodthor Wrote: First, do you execute the hblank every line? if yes, maybe try only evry4 lines as 8, 60 and 100 are multiple of 4. You can also adapt your gfx to multiples of 8 (64 & 96 or 104).

Then compile with -S flag to check the compiled code of the hblank and change the condition order to see if it's simpler or not.

What is the hb counter used for? if you have ras_y...

Maybe to avoid conditions, you can do a specific hblank for each situation: showHUD, boss room and/or screenSplit...

I'll try on my side tomorrow to see if asm can be useful because as I said, most of the time, the compiler is better than me, and modifying my C code to help the compiler is often the best solution.
I very likely am running hblank every line. Which part of the setup controls the line interval?

I am using myHBCounter as a measure of how many hblanks there have been since the game loop started. I set it to 0 at the beginning of the loop and nop at the end until it reaches something like 20 as a means of running the game loop asynchronous to vblank. I could probably just check against ras_y and eliminate that.

Ah, I see, I don't have to have a one size fits all hblank. I could have the game loop initialize a specific hblank function for each scenario. That will probably help.

Thanks for the pointers!
Reply
#7
for line interval, check 8Bit.pdf from snk doc, page 16, it gives you the code Smile, put 4 instead of 1 in TREG0
Reply
#8
Example of optimization:
result from compilation of the code above (more or less)

Code:
_myHBL2:
    push    XBC
    push    WA
    ld    A,(0x8009)
    incb    0x1,(_myHBCounter)
    cp    A,0x8
    j    ne,L9
    cpb    (_showHUD),0x0
    j    eq,L12
    lda    XBC,_plane1
    ld    A,(XBC+0x1)
    ld    (0x8033),A
    ld    A,(XBC)
    ld    (0x8032),A
    ldb    (0x8021),0x0
    cpb    (_currentLevel),0x0
    j    eq,L12
    ld    A,(_plane2 + 0x1)
    ld    (0x8035),A
    j    L12
L9:  ;1
    cp    A,0x3c                        ;    '<' 60
    j    ne,L13
    cpb    (_screenSplit),0x0
    j    eq,L12
    ld    A,(_split1)
    ld    (0x8034),A
    j    L12
L13:  ;1
    cp    A,0x64                        ;    'd' 100
    j    ne,L12
    cpb    (_screenSplit),0x0
    j    eq,L12
    ld    A,(_split2)
    ld    (0x8034),A
L12:  ;8
    pop    WA
    pop    XBC
    reti


push xbc and pop xbc is done for all branches but xbc is only used in one of them, so it can be relocated in it:


Code:
_myHBL2:
    push    WA
    ld    A,(0x8009)
    incb    0x1,(_myHBCounter)
    cp    A,0x8
    j    ne,L9
    cpb    (_showHUD),0x0
    j    eq,L12
    push    XBC
    lda    XBC,_plane1
    ld    A,(XBC+0x1)
    ld    (0x8033),A
    ld    A,(XBC)
    pop    XBC
    ld    (0x8032),A
    ldb    (0x8021),0x0
    cpb    (_currentLevel),0x0
    j    eq,L12
    ld    A,(_plane2 + 0x1)
    ld    (0x8035),A
    j    L12
L9:  ;1
    cp    A,0x3c                        ;    '<' 60
    j    ne,L13
    cpb    (_screenSplit),0x0
    j    eq,L12
    ld    A,(_split1)
    ld    (0x8034),A
    j    L12
L13:  ;1
    cp    A,0x64                        ;    'd' 100
    j    ne,L12
    cpb    (_screenSplit),0x0
    j    eq,L12
    ld    A,(_split2)
    ld    (0x8034),A
L12:  ;8
    pop    WA
    reti
Reply
#9
(02-14-2022, 06:29 PM)sodthor Wrote: for line interval, check 8Bit.pdf from snk doc, page 16, it gives you the code Smile, put 4 instead of 1 in TREG0

In attempting to change TREG0 to some other interval, I discovered I need it set to 1 for udmadac, which uses timer 0 and timer 2.

At any rate, breaking up into different hblank code for each scenario seems to speed it up in the case where I am splitting plane2.  I am still trying to determine if there is a perceivable difference between doing the plane splitting and not.  I might be at a point where I can't notice the difference when testing on real hardware.
Reply
#10
(02-14-2022, 09:16 PM)sodthor Wrote: Example of optimization:
result from compilation of the code above (more or less)

Code:
_myHBL2:
    push    XBC
    push    WA
    ld    A,(0x8009)
    incb    0x1,(_myHBCounter)
    cp    A,0x8
    j    ne,L9
    cpb    (_showHUD),0x0
    j    eq,L12
    lda    XBC,_plane1
    ld    A,(XBC+0x1)
    ld    (0x8033),A
    ld    A,(XBC)
    ld    (0x8032),A
    ldb    (0x8021),0x0
    cpb    (_currentLevel),0x0
    j    eq,L12
    ld    A,(_plane2 + 0x1)
    ld    (0x8035),A
    j    L12
L9:  ;1
    cp    A,0x3c                        ;    '<' 60
    j    ne,L13
    cpb    (_screenSplit),0x0
    j    eq,L12
    ld    A,(_split1)
    ld    (0x8034),A
    j    L12
L13:  ;1
    cp    A,0x64                        ;    'd' 100
    j    ne,L12
    cpb    (_screenSplit),0x0
    j    eq,L12
    ld    A,(_split2)
    ld    (0x8034),A
L12:  ;8
    pop    WA
    pop    XBC
    reti


push xbc and pop xbc is done for all branches but xbc is only used in one of them, so it can be relocated in it:


Code:
_myHBL2:
    push    WA
    ld    A,(0x8009)
    incb    0x1,(_myHBCounter)
    cp    A,0x8
    j    ne,L9
    cpb    (_showHUD),0x0
    j    eq,L12
    push    XBC
    lda    XBC,_plane1
    ld    A,(XBC+0x1)
    ld    (0x8033),A
    ld    A,(XBC)
    pop    XBC
    ld    (0x8032),A
    ldb    (0x8021),0x0
    cpb    (_currentLevel),0x0
    j    eq,L12
    ld    A,(_plane2 + 0x1)
    ld    (0x8035),A
    j    L12
L9:  ;1
    cp    A,0x3c                        ;    '<' 60
    j    ne,L13
    cpb    (_screenSplit),0x0
    j    eq,L12
    ld    A,(_split1)
    ld    (0x8034),A
    j    L12
L13:  ;1
    cp    A,0x64                        ;    'd' 100
    j    ne,L12
    cpb    (_screenSplit),0x0
    j    eq,L12
    ld    A,(_split2)
    ld    (0x8034),A
L12:  ;8
    pop    WA
    reti

Thanks for this example!
Reply


Forum Jump:


Users browsing this thread: 1 Guest(s)