Thursday, March 26, 2009

Profiling on the DS

Fast, small, or easy. On this project, I've learnt that for PDF rendering you can have only two of those.

Since the hardware doesn't really support more than 4 mb of rendering, small is a requirement. Slowly, I've been moving from it being easy on the code side to fast on the user side; which is good. A difficulty I've faced is that mupdf is a large library; spread over roughly three modules in many files. This makes assessing what is taking the most time non-trivial.

Luckily there is a very good function level profiler that is available for the DS written by simonjhall; the guy who ported Quake I and Quake II to the DS.

Function Level Profiling

This section talks about what function level profiling is doing, and why it is cool. If you don't care or know all of it already, feel free to skip it, the more instructional stuff is in the next section.

In case you aren't familiar with the concept, gcc and g++ offer this great feature where they insert a call to a specialy designated function on every function call, and also insert a call to another special function on every function return. It gives both functions the address of the called function, as well as the place its being called from.

So lets say we have a simple program

void c(){ /* something */}
void b(){ c(); }
void a(){ b(); c();}
int void main(int c, char* args){
a();
}
If we just ran this program, execution would go something like:

  • main()
  • a()
  • b()
  • c()
  • c returns
  • b returns
  • c()
  • c returns
  • a returns
  • main returns
And now, although this program runs instantly, we want to figure out how much time the code spends in each function. With the ability to insert functions before and after function calls, we can have the special entry function grab the time on entry, record it, and when the exit function is called, figure out what the difference is; and then we have the amount of time that function ran. So With special _entry(void* calledFunction) and _exit(void* calledFunction) functions defined and enabled for our program; the execution goes something like:
  • main()
  • _entry(*a())
  • a()
  • _entry(*b())
  • b()
  • _entry(*c())
  • c()
  • c returns
  • _exit(*c())
  • b returns
  • _exit(*b())
  • _entry(*c())
  • c()
  • c returns
  • _exit(*c())
  • a returns
  • _exit(*a())
  • main returns
We won't be too concerned here with the implementation of these functions -- we provide files later that handle this for us(note: if you want to do this in non-arm processors, use the -pg flag). It is worth noting a couple of things though. Primarily, as this is a debug tool, the amount of overhead this adds is very signifigant. It is possible for memory to be an issue; the special functions need to maintain some form of stack of called functions so it can keep the entry data for functions that call other functions. However, most likely you will notice that the profiler adds alot of time to your program, and the flags for it should only be enabled when its needed.

On a more technical note; in general, you can only rely on this kind of profiling so much. There are many places online detailing why; but given the simplicity of the DS (ie, no cacheing to worry about), most don't apply here (note: If you can think of a way this kind of profiling can lead one astray on the DS, please comment :)).

How to profile on the DS

So download these three files: cyg-profile.c, cyg-profile.h, and resolve.pl(That last one is just 'resolve' on the server. Change it to 'resolve.pl'). You should add the c file and the h file to your project. Make sure though, if the rest of your project is C++ to '#define __cplusplus' before you include the h file.

This code won't actually profile everything; it will rely on you enabling the profiler within your code. This allows you to focus on profiling only where you know the slowdown to be. If you want to profile everything of course, just do it right off :).

Initialize the profiler with cygprofile_begin().
Enable the profiler with cygprofile_enable().
Temporarily disable the profiler with cygprofile_disable().
Permenatly disable the profiler and write the result file out with cygprofile_end(). (Nothing will be written unless you call this function at the end.)

Now you need to change your build process a little. Append to your c flags these two things: '-finstrument-functions -mpoke-function-name'. The first one tells the compiler to use the special functions in cyg-profile.c; the second tells the compiler to poke the function name into the binary so we can get a readable version later on.

One more thing is necessary. There are no real easy ways to get a real good reading for the current time on the DS. So this code instead relies on counting hblanks to provide accurate time measurements. It does not handle counting hblanks, that is something you need to do. You do this by adding an hblank interrupt and simply incrementing a global integer named 'hblank'. If you are already handling this interrupt; go ahead and add the global definition anywhere, and at the top of the hblank handler throw in a 'hblanks++'. If you don't, just call this code in your initialization stuff:

irqEnable(IRQ_HBLANK);
irqSet(IRQ_HBLANK, hblankCount);

And then add this code above it somewhere in the file:

int hblanks = 0;
static void hblankCount (void)
__attribute__ ((no_instrument_function));

void hblankCount(){
hblanks++;
}

If you already have an hblank handler, like I said, use it. But modify its declaration to have the '__attribute__ ((no_instrument_function))' modifier. This tells the compiler to not hijack this functions entry/exit; which could be hazardous in something that runs as an interrupt. I'd actually recommend doing this to all your interrupts; unless you really want to profile them specifically.

Have it all set up? Run it! If everything goes right, you'll generate a file cyglog.txt(it is up to you to enable the FAT filesystem) ; which is generally unreadable until we run it through resolve.py. We do this next.

This next part requires you to have perl installed. If you are like me, you don't have it yet. Go get it, come back.

Copy this file from your DS to your harddrive. Then open up a terminal/command prompt. Navigate to wherever the cyglog.txt file is, and make sure that perl and the arm toolchain are on your path. You'll also need to find a copy of the elf file for whichever processor generated the cyglog file. That file will provide the resolver with all the nice names for the ugly function pointers that the profiler gives you.

Moment of truth: run 'perl resolve.pl {program.elf} cyglog.txt'. It should dump out a large table straight to standard out. Once you get it working, you'll want to dump it to another file to look at. The next section is going to pretend you named that file 'result.txt' ('perl resolve.pl {program.elf} cyglog.txt > result.txt'), but you of course, don't have to.

Interpretting the Results

The table output will look like this:

address calls ms ms/call ms/frame lowest stack highest stack function name

0x2004284 153 0.0636 0.000 0.000 15384 15424 bgGetGfxPtr
0x200422c 153 0.0000 0.000 0.000 15360 15400 bgGetMapBase
0x2016718 9269 185.4962 0.020 0.309 14048 14904 fz_concat

Its a real good idea right now to throw this output in some kind of spread sheet app. That way you can sort the whole thing by individual columns.

The first column is the actual memory address of the program. Not too useful.
The second column is calls. This is the straight number of times the function was called.
Third we have ms. This gives us, predictably, the total time spent in that function in milliseconds.
Fourth, we have ms/call. This is the average time spent in the function per call. Pay special attention here; the amount of time a function takes can vary wildly.
Fifth, we have ms/frame. This number is a bit of a red herring, and not really meaningful unless you are profiling code that needs to occur within one vblank. This number is simply the number of total milliseconds divided by the number of frames processed as measured by the hblank you are incrementing.
Sixth: Highest/Lowest stack. This, presumably, tells you the min and max location of the function on the call stack -- ie, how 'nested' your function is. I don't know when this information would be useful; if you do please tell me :). A reader notes below that it is useful sometimes to tell if your program is about to overflow the stack.
And of course, last but not least, the function name. If you are coding in C++, and using classes heavily, get used to its name mangling. A function named repairFrame in the class PdfApp that takes a single fz_irect_s argument for instance, comes out as '_ZN6PdfApp11repairFrameE10fz_irect_s'.

The first thing you want to do is sort it by total time taken. Likely, you'll find your best optimizations by optimizing the functions at the top of this list specifically. If that function is called alot, it means that small things can go a big way in that function. If it isn't called that much, checked to see if you are doing something necessarily costly in that function (Like... floating point math) or if its calling expensive children.

Understand that the MOST time saved you'll get from modifying any of these functions is the total time they take to execute. Prioritize based on this. Optimizing your code is an iterative process; try to only optimize one thing at a time, and kind of keep track of what each optimization gave you. Its a good habit to save all the result files along the way to consult.

Selective Profiling

Elhobbs had a useful tip for setting up an easy way to specify which files to profile and which to not. Add the following to your makefile after you include the normal ds_rules:

%.clk.o: %.clk.c 
   @echo $(notdir $<) 
   $(CC) -MMD -MP -MF $(DEPSDIR)/$*.d $(CFLAGS) -finstrument-functions -mpoke-function-name -c $< -o $@ 


Then whenever you want to profile a file, change its name to *.c to *.clk.c. You will still have to enable the profiler, but there is virtually no overhead if you initialize the profiler but never use it. When you view your report later on, you should only see functions profiled that were in that file. 

Twentyseven points out below that it can be dangerous to profile code that occurs in interrupts; that extra overhead for profiling could cause it to do really goofy things. You can choose which functions to profile with the '__attribute__ ((no_instrument_function))' modifier I mention above. If the code starts acting up only when the profiler is on, its a good indication that you should cut back on what you are profiling within interrupts. 

And I guess thats all I have to say about profiling. Questions? Concerns? Did I fuck something up really bad and you want to tell me? Leave comments :).


2 comments:

  1. Nice Article.

    One thing to be aware of is the DS is a very interrupt heavy piece of kit.

    Adding significantly to your execution time of time-sensitive things like a vblank or hblank irq handler can break things, and you need to keep an eye on it.

    ReplyDelete
  2. the highest and lowest stack information if so you can see how close you are to overflowing the ds's tiny 16k stack.

    ReplyDelete