Profiling with DMD on Windows: Getting Hands Dirty

In my previous post I was talking about steps one needs to undertake and data they get when profiling an application. Now let’s use this knowledge in practice.

I will be profiling Blaze, a 2D physics engine based on Box2D. It was actually Mason’s newsgroup post which inspired me to blog about profiling.

Prerequisites

To repeat my experiments you will need:

  • DMD compiler with Tango. The simplest is to get it from the Tango official download site. The “current release (including DMD 1.033)” is what I’m actually using. Here is a direct link to download it.
  • Blaze source code. In this article I’m hacking trunk revision 409 so I encourage you to get the same version. You can do so using either a Subversion client:
    svn checkout -r 409 http://svn.dsource.org/projects/blaze/trunk blaze

    or the Blaze repository browser, the “Download in other formats: Zip Archive” link at the bottom of the page. Here’s a direct link if you wish to save some browsing.

  • Derelict. I’m using trunk revision 336, but I think any latest version will suffice. Get it either via SVN
    svn checkout http://svn.dsource.org/projects/derelict/trunk derelict

    or from the repository browser, direct link included.

  • SDL.dll. You can get a pre-built Windows version from the official downloads page.

That’s it. You won’t be needing anything else. Make sure you put DMD’s bin on the path, the rest of this article assumes that.

Build

Blaze comes with a testbed application and a set of DSSS configuration files. I won’t use DSSS though. I’ll use Jake instead—a minimalistic, one-step build utility similar to Bud and Rebuild but more robust because it relies upon DMD’s built-in dependency resolution system. And it’s simple. And it’s bundled with the Tango distribution mentioned above so you should already have it. In fact I’m using a similar tool of my own which works not only with Tango but with Phobos as well, but that’s another story. Jake is perfectly sufficient in the scope of this article.

I use the following command to build the testbed application:

jake blazeDemos.d -O -release -inline -profile -I..\blaze -I\home\snake\src\derelict\DerelictGL -I\home\snake\src\derelict\DerelictUtil -I\home\snake\src\derelict\DerelictGLU -I\home\snake\src\derelict\DerelictSDL

Of course you should change path to Derelict according to your setup. The build takes around 12 seconds on my 1.8 GHz Core2 Duo laptop. Don’t forget to copy the sdl.dll into the same directory where you build the testbed app. Now we are ready to

Profile!

Run the blazeDemos.exe. The app starts with the default domino demo. It runs slowly, much slower than without profiling enabled.

The profiling process only measures performance of code actually executed, so it is essential to let the application run for some time. I suggest you wait for everything stops moving except for the pendulum, then close the application. This way many code paths are executed and therefore measured, and the results are easy to reproduce. It takes around 30 seconds on my laptop for the scene to completely stabilize.

Let’s see what we’ve got. I want generic performance information now so I skip right to the overal timings section of trace.log starting with a row of equal signs. Here’s what I’ve got in several first lines:

======== Timer Is 3579545 Ticks/Sec, Times are in Microsecs ========

  Num          Tree        Func        Per
  Calls        Time        Time        Call

      1     9718996     2105879     2105879     __Dmain
    859     1767855     1100816        1281     _D5blaze9collision5nbody12sortAndSweep12SortAndSweep6searchMFZv
    859      741745      618140         719     _D10blazeDemos9drawSceneFZv
  53649      654964      571310          10     _D5blaze9collision8pairwise11collidePoly14edgeSeparationFC5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormiC5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormZf
   2789      628572      542171         194     _D5blaze8dynamics7contact13contactSolver13ContactSolver5_ctorMFS5blaze5world8TimeStepAC5blaze8dynamics7contact7contact7ContactiZC5blaze8dynamics7contact13contactSolver13ContactSolver
      1      361446      361442      361442     _D10blazeDemos14createGLWindowFAaiiibZv
    859      322915      279146         324     _D5blaze9collision5nbody12sortAndSweep12SortAndSweep9shellSortMFZv
  15712      914581      240098          15     _D5blaze9collision8pairwise11collidePoly17findMaxSeparationFKiC5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormC5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormZf

Let’s see. First comes __Dmain, that is, main(). It’s Tree Time is 9718996 microseconds which is almost 10 seconds. main() is the application starting point, so everything else is called from it—the testbed application is single-threaded. The main‘s tree time must be the total program execution time. But the measured program run time was 30 seconds. Why’s the difference? This is because profiling comes with huge overhead. That’s profiling code which runs 2/3 of the time. Without profiling the app would run 3 times faster. This is no surprise.

What actually surprises is the Func Time column. Func time is the total time spent in the body of this function but not in any functions it calls. As I explained earlier, this time is used to sort the function table to put most time consuming functions at the top of the list. And the top one is main! Even more, according to the measurements main consumes more than 20% of execution time:

2.1 / 9.72 * 100 = 21.6%

This is really unexpected. main() contains the most outer loop, body of which executes once per frame, and it only calls a small number of other functions. It’s easy to see the main loop activity looking at main‘s fan out:

__Dmain	0	34789587	7538091
	    1	_D10blazeDemos14createGLWindowFAaiiibZv
	    1	_D10blazeDemos10keyPressedFiZv
	  859	_D10blazeDemos11processTimeFZv
	  859	_D5tango4text7convert7Integer13__T6decodeTaZ6decodeFAaKaJaJiZv
	  859	_D5tango4text7convert7Integer16__T9formatterTaZ9formatterFAalaaiZAa
	  859	_D5tango4stdc7stringz9toStringzFAaAaZPa
	  859	_D5blaze5world5World4stepMFfiiZv
	  859	_D7dominos7Dominos6updateMFZv
	  859	_D10blazeDemos13processEventsFZv
	  859	_D10blazeDemos9drawSceneFZv
	  859	_D10blazeDemos8limitFPSFkZv

main() only calls 9 other functions, less than 900 times each. This shouldn’t take more than complicated collision detection. But it does. Here’s why.

Profiler takes into account only functions compiled with -profile switch. It is blind towards any other functions. If an instrumented function calls a non-instrumented one, the call time is added to the caller’s “func time” as if the called function were inlined.

Looking at main(), it is easy to come with a list of potentially offending functions:

glClear()
A direct call to OpenGL dynamic library
SDL_WM_SetCaption()
A direct call to SDL dynamic library
SDL_GL_SwapBuffers()
Again, a direct call to SDL dynamic library

Let’s wrap each into a separate D function, like mainClean, mainCaption, and mainSwap. Don’t forget to disable inlining for this trick to work.

Important! Every run of an instrumented application updates trace.log and trace.def to include new results. But the application doesn’t check whether trace.log and trace.def are from the same application. If they’re not, or if they’re from another build of the same application, they get corrupted. I think there should be some sort of a check sum at the start of these files so that app can compare against its own check sum and overwrite incompatible files completely. For now though, always erase trace.log and trace.def after re-building your test application.

The trick works. main() takes 93rd place, consuming only modest 21 milliseconds. And the first place is taken by mainCaption. That’s it. Setting window caption takes more than 1/5 of execution time of an optimized application.

Finding out why setting caption is so slow is probably out of scope of this article. I’ll just assume there is a reason and hack around it. There is a variable, frames, which is set to zero every time a frame rate counter, fps_, is refreshed. So I’ll add a check and will only update caption when this happens. Here’s the final code fragment:

        if (frames == 0) {
            // Create fps caption
            char[] title = "Blaze Demo - Press Keys 0-9, q-e To Switch Demos - FPS: ";
            title  ~= Integer.format (new char[32], fps_);
            void mainCaption() {
                SDL_WM_SetCaption(toStringz(title), null);
            }
            mainCaption();
        }

Rebuild, erase trace.*, run. Now mainCaption is at 28th place with only 1/4 second consumed. The first place is rightly taken by solveVelocityConstraints which is probably computation-intensive.

Final test, with inlining. Compile, erase, run:

======== Timer Is 3579545 Ticks/Sec, Times are in Microsecs ========

  Num          Tree        Func        Per
  Calls        Time        Time        Call

    829     1691967     1065306        1285     _D5blaze9collision5nbody12sortAndSweep12SortAndSweep6searchMFZv
    829      715255      600852         724     _D10blazeDemos9drawSceneFZv
  52839      647945      565433          10     _D5blaze9collision8pairwise11collidePoly14edgeSeparationFC5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormiC5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormZf
   2729      619133      535539         196     _D5blaze8dynamics7contact13contactSolver13ContactSolver5_ctorMFS5blaze5world8TimeStepAC5blaze8dynamics7contact7contact7ContactiZC5blaze8dynamics7contact13contactSolver13ContactSolver
      1      345129      345126      345126     _D10blazeDemos14createGLWindowFAaiiibZv
    829      305158      264505         319     _D5blaze9collision5nbody12sortAndSweep12SortAndSweep9shellSortMFZv
  15472      902121      235004          15     _D5blaze9collision8pairwise11collidePoly17findMaxSeparationFKiC5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormC5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormZf
   2729      355048      223354          81     _D5blaze8dynamics7contact13contactSolver13ContactSolver24solvePositionConstraintsMFfZb
   8187      251732      218908          26     _D5blaze8dynamics7contact13contactSolver13ContactSolver24solveVelocityConstraintsMFZv
  16368      251190      217956          13     _D5blaze9collision8pairwise8distance8inPointsFS5blaze6common4math5bVec2AS5blaze6common4math5bVec2iZb
   4945      467027      170846          34     _D5blaze9collision8pairwise8distance103__T15distanceGenericTC5blaze9collision6shapes7polygon7PolygonTC5blaze9collision6shapes7polygon7PolygonZ15distanceGenericFKS5blaze6common4math5bVec2KS5blaze6common4math5bVec2C5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormC5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormZf
   2726     1784432      167115          61     _D5blaze8dynamics6island6Island5solveMFS5blaze5world8TimeStepS5blaze6common4math5bVec2bZv
      1     7668760      165340      165340     __Dmain

main() is at 13th place, consuming only 165 milliseconds in total—all title updates included because of inlining. Clean measured run time is now 7.67 seconds instead of 9.72. That’s more than 20% speed up. Not bad at all.

Afterword

You’ve probably heard of premature optimization. It’s when you waste time optimizing something just to find out later that it didn’t matter, but you could have had a 20% speed gain by writing a simple if in a completely different place you couldn’t even think of. Or when you realize that you must optimize your heavily optimized algorithm differently, essentially re-implementing it.

Don’t ever make this mistake. Use profiler.

5 Responses

  1. Sergey,

    Many thanks for this analysis and intro to profiling! I never expected main()’s OpenGL and SDL function calls to consume so much overhead! This entire time I’ve narrowly focused on the physics engine itself rather than the demo framework! In any case, I’m making your suggested changes… much appreciated!

    I’ve also taken steps to optimize Blaze’s broadphase collision detection by making it more compliant with Box2d. The speedup is phenomenal!!! Check out the new demo (includes your optimizations) here:

    http://svn.dsource.org/projects/blaze/downloads/blazeDemos.zip

    The source trunk is updated.

    Thanks,
    Mason

    • You’re welcome! I’m glad that my posts actually help you.

      I’ve checked out the latest trunk of Blaze, rev. 423, and it was really fast. It’s nice to see such a progress. Good job!

  2. [...] second of Sergey’s Profiling with DMD on Windows articles is up. This time, he walks you step-by-step through profiling the testbed app of the Blaze [...]

  3. Sergey

    Thanks for post these two article. Some really usefully stuff for profilling in D. I will be reading these carefully.

    regards
    Nick B

  4. Nice article, thanks.

    Here is an idea of further improvement of Descent for Eclipse: parse that textfile trace.log for better readability, like demangle names via demangle.d and maybe, convert time units to something more human-readable form (like 1s/1000msec/..) , to show those data in grid with sortable columns in separate tab in Descent, to help pinpoint that nasty exact function. Or maybe, Dunit-like unittests for Descent, too.
    Sketch of that feature could be written pretty fast in python & pyD /deelight/so on, as a separate script app (to be converted in plain D further).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.