In my previous post I was talking about steps one needs to undertake and data they get when profiling an application. Now let’s use this knowledge in practice.
I will be profiling Blaze, a 2D physics engine based on Box2D. It was actually Mason’s newsgroup post which inspired me to blog about profiling.
Prerequisites
To repeat my experiments you will need:
- DMD compiler with Tango. The simplest is to get it from the Tango official download site. The “current release (including DMD 1.033)” is what I’m actually using. Here is a direct link to download it.
- Blaze source code. In this article I’m hacking trunk revision 409 so I encourage you to get the same version. You can do so using either a Subversion client:
svn checkout -r 409 http://svn.dsource.org/projects/blaze/trunk blaze
or the Blaze repository browser, the “Download in other formats: Zip Archive” link at the bottom of the page. Here’s a direct link if you wish to save some browsing.
- Derelict. I’m using trunk revision 336, but I think any latest version will suffice. Get it either via SVN
svn checkout http://svn.dsource.org/projects/derelict/trunk derelict
or from the repository browser, direct link included.
- SDL.dll. You can get a pre-built Windows version from the official downloads page.
That’s it. You won’t be needing anything else. Make sure you put DMD’s bin on the path, the rest of this article assumes that.
Build
Blaze comes with a testbed application and a set of DSSS configuration files. I won’t use DSSS though. I’ll use Jake instead—a minimalistic, one-step build utility similar to Bud and Rebuild but more robust because it relies upon DMD’s built-in dependency resolution system. And it’s simple. And it’s bundled with the Tango distribution mentioned above so you should already have it. In fact I’m using a similar tool of my own which works not only with Tango but with Phobos as well, but that’s another story. Jake is perfectly sufficient in the scope of this article.
I use the following command to build the testbed application:
jake blazeDemos.d -O -release -inline -profile -I..\blaze -I\home\snake\src\derelict\DerelictGL -I\home\snake\src\derelict\DerelictUtil -I\home\snake\src\derelict\DerelictGLU -I\home\snake\src\derelict\DerelictSDL
Of course you should change path to Derelict according to your setup. The build takes around 12 seconds on my 1.8 GHz Core2 Duo laptop. Don’t forget to copy the sdl.dll into the same directory where you build the testbed app. Now we are ready to
Profile!
Run the blazeDemos.exe. The app starts with the default domino demo. It runs slowly, much slower than without profiling enabled.
The profiling process only measures performance of code actually executed, so it is essential to let the application run for some time. I suggest you wait for everything stops moving except for the pendulum, then close the application. This way many code paths are executed and therefore measured, and the results are easy to reproduce. It takes around 30 seconds on my laptop for the scene to completely stabilize.
Let’s see what we’ve got. I want generic performance information now so I skip right to the overal timings section of trace.log starting with a row of equal signs. Here’s what I’ve got in several first lines:
======== Timer Is 3579545 Ticks/Sec, Times are in Microsecs ========
Num Tree Func Per
Calls Time Time Call
1 9718996 2105879 2105879 __Dmain
859 1767855 1100816 1281 _D5blaze9collision5nbody12sortAndSweep12SortAndSweep6searchMFZv
859 741745 618140 719 _D10blazeDemos9drawSceneFZv
53649 654964 571310 10 _D5blaze9collision8pairwise11collidePoly14edgeSeparationFC5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormiC5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormZf
2789 628572 542171 194 _D5blaze8dynamics7contact13contactSolver13ContactSolver5_ctorMFS5blaze5world8TimeStepAC5blaze8dynamics7contact7contact7ContactiZC5blaze8dynamics7contact13contactSolver13ContactSolver
1 361446 361442 361442 _D10blazeDemos14createGLWindowFAaiiibZv
859 322915 279146 324 _D5blaze9collision5nbody12sortAndSweep12SortAndSweep9shellSortMFZv
15712 914581 240098 15 _D5blaze9collision8pairwise11collidePoly17findMaxSeparationFKiC5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormC5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormZf
Let’s see. First comes __Dmain, that is, main(). It’s Tree Time is 9718996 microseconds which is almost 10 seconds. main() is the application starting point, so everything else is called from it—the testbed application is single-threaded. The main‘s tree time must be the total program execution time. But the measured program run time was 30 seconds. Why’s the difference? This is because profiling comes with huge overhead. That’s profiling code which runs 2/3 of the time. Without profiling the app would run 3 times faster. This is no surprise.
What actually surprises is the Func Time column. Func time is the total time spent in the body of this function but not in any functions it calls. As I explained earlier, this time is used to sort the function table to put most time consuming functions at the top of the list. And the top one is main! Even more, according to the measurements main consumes more than 20% of execution time:
2.1 / 9.72 * 100 = 21.6%
This is really unexpected. main() contains the most outer loop, body of which executes once per frame, and it only calls a small number of other functions. It’s easy to see the main loop activity looking at main‘s fan out:
__Dmain 0 34789587 7538091 1 _D10blazeDemos14createGLWindowFAaiiibZv 1 _D10blazeDemos10keyPressedFiZv 859 _D10blazeDemos11processTimeFZv 859 _D5tango4text7convert7Integer13__T6decodeTaZ6decodeFAaKaJaJiZv 859 _D5tango4text7convert7Integer16__T9formatterTaZ9formatterFAalaaiZAa 859 _D5tango4stdc7stringz9toStringzFAaAaZPa 859 _D5blaze5world5World4stepMFfiiZv 859 _D7dominos7Dominos6updateMFZv 859 _D10blazeDemos13processEventsFZv 859 _D10blazeDemos9drawSceneFZv 859 _D10blazeDemos8limitFPSFkZv
main() only calls 9 other functions, less than 900 times each. This shouldn’t take more than complicated collision detection. But it does. Here’s why.
Profiler takes into account only functions compiled with
-profileswitch. It is blind towards any other functions. If an instrumented function calls a non-instrumented one, the call time is added to the caller’s “func time” as if the called function were inlined.
Looking at main(), it is easy to come with a list of potentially offending functions:
glClear()- A direct call to OpenGL dynamic library
SDL_WM_SetCaption()- A direct call to SDL dynamic library
SDL_GL_SwapBuffers()- Again, a direct call to SDL dynamic library
Let’s wrap each into a separate D function, like mainClean, mainCaption, and mainSwap. Don’t forget to disable inlining for this trick to work.
Important! Every run of an instrumented application updates trace.log and trace.def to include new results. But the application doesn’t check whether trace.log and trace.def are from the same application. If they’re not, or if they’re from another build of the same application, they get corrupted. I think there should be some sort of a check sum at the start of these files so that app can compare against its own check sum and overwrite incompatible files completely. For now though, always erase trace.log and trace.def after re-building your test application.
The trick works. main() takes 93rd place, consuming only modest 21 milliseconds. And the first place is taken by mainCaption. That’s it. Setting window caption takes more than 1/5 of execution time of an optimized application.
Finding out why setting caption is so slow is probably out of scope of this article. I’ll just assume there is a reason and hack around it. There is a variable, frames, which is set to zero every time a frame rate counter, fps_, is refreshed. So I’ll add a check and will only update caption when this happens. Here’s the final code fragment:
if (frames == 0) {
// Create fps caption
char[] title = "Blaze Demo - Press Keys 0-9, q-e To Switch Demos - FPS: ";
title ~= Integer.format (new char[32], fps_);
void mainCaption() {
SDL_WM_SetCaption(toStringz(title), null);
}
mainCaption();
}
Rebuild, erase trace.*, run. Now mainCaption is at 28th place with only 1/4 second consumed. The first place is rightly taken by solveVelocityConstraints which is probably computation-intensive.
Final test, with inlining. Compile, erase, run:
======== Timer Is 3579545 Ticks/Sec, Times are in Microsecs ========
Num Tree Func Per
Calls Time Time Call
829 1691967 1065306 1285 _D5blaze9collision5nbody12sortAndSweep12SortAndSweep6searchMFZv
829 715255 600852 724 _D10blazeDemos9drawSceneFZv
52839 647945 565433 10 _D5blaze9collision8pairwise11collidePoly14edgeSeparationFC5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormiC5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormZf
2729 619133 535539 196 _D5blaze8dynamics7contact13contactSolver13ContactSolver5_ctorMFS5blaze5world8TimeStepAC5blaze8dynamics7contact7contact7ContactiZC5blaze8dynamics7contact13contactSolver13ContactSolver
1 345129 345126 345126 _D10blazeDemos14createGLWindowFAaiiibZv
829 305158 264505 319 _D5blaze9collision5nbody12sortAndSweep12SortAndSweep9shellSortMFZv
15472 902121 235004 15 _D5blaze9collision8pairwise11collidePoly17findMaxSeparationFKiC5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormC5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormZf
2729 355048 223354 81 _D5blaze8dynamics7contact13contactSolver13ContactSolver24solvePositionConstraintsMFfZb
8187 251732 218908 26 _D5blaze8dynamics7contact13contactSolver13ContactSolver24solveVelocityConstraintsMFZv
16368 251190 217956 13 _D5blaze9collision8pairwise8distance8inPointsFS5blaze6common4math5bVec2AS5blaze6common4math5bVec2iZb
4945 467027 170846 34 _D5blaze9collision8pairwise8distance103__T15distanceGenericTC5blaze9collision6shapes7polygon7PolygonTC5blaze9collision6shapes7polygon7PolygonZ15distanceGenericFKS5blaze6common4math5bVec2KS5blaze6common4math5bVec2C5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormC5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormZf
2726 1784432 167115 61 _D5blaze8dynamics6island6Island5solveMFS5blaze5world8TimeStepS5blaze6common4math5bVec2bZv
1 7668760 165340 165340 __Dmain
main() is at 13th place, consuming only 165 milliseconds in total—all title updates included because of inlining. Clean measured run time is now 7.67 seconds instead of 9.72. That’s more than 20% speed up. Not bad at all.
Afterword
You’ve probably heard of premature optimization. It’s when you waste time optimizing something just to find out later that it didn’t matter, but you could have had a 20% speed gain by writing a simple if in a completely different place you couldn’t even think of. Or when you realize that you must optimize your heavily optimized algorithm differently, essentially re-implementing it.
Don’t ever make this mistake. Use profiler.
Sergey,
Many thanks for this analysis and intro to profiling! I never expected main()’s OpenGL and SDL function calls to consume so much overhead! This entire time I’ve narrowly focused on the physics engine itself rather than the demo framework! In any case, I’m making your suggested changes… much appreciated!
I’ve also taken steps to optimize Blaze’s broadphase collision detection by making it more compliant with Box2d. The speedup is phenomenal!!! Check out the new demo (includes your optimizations) here:
http://svn.dsource.org/projects/blaze/downloads/blazeDemos.zip
The source trunk is updated.
Thanks,
Mason
You’re welcome! I’m glad that my posts actually help you.
I’ve checked out the latest trunk of Blaze, rev. 423, and it was really fast. It’s nice to see such a progress. Good job!
[...] second of Sergey’s Profiling with DMD on Windows articles is up. This time, he walks you step-by-step through profiling the testbed app of the Blaze [...]
Sergey
Thanks for post these two article. Some really usefully stuff for profilling in D. I will be reading these carefully.
regards
Nick B
Nice article, thanks.
Here is an idea of further improvement of Descent for Eclipse: parse that textfile trace.log for better readability, like demangle names via demangle.d and maybe, convert time units to something more human-readable form (like 1s/1000msec/..) , to show those data in grid with sortable columns in separate tab in Descent, to help pinpoint that nasty exact function. Or maybe, Dunit-like unittests for Descent, too.
Sketch of that feature could be written pretty fast in python & pyD /deelight/so on, as a separate script app (to be converted in plain D further).