Profiling in Haskell (SDL game series part 3)
In this post, I’ll show how I profiled the SDL project. The code shown is a continuation of the efforts in part one and part two. The starting point will be this version of the project: strat-0.0.1.tar.gz.
With version 0.0.1 from part two, I noticed that whenever I did a right mouse button drag to scroll the map, the redraw would always lag behind my mouse movement. Since I’m new to Haskell, I don’t yet have an instinctual feel for what code actually costs me at run time. I figured now is as good of a time as any to take a peek. I’m not interested in getting it as fast as possible, but I feel uncomfortable proceeding without at least checking out some numbers.
First, I would like to reference this blog post by Nathan Sanders. That page, and the profiling chapter in Real World Haskell, got me up and running. For the official version, check out GHC’s user guide on profiling.
The first step in profiling is rebuilding everything with profiling support if it has not been done already. You can execute cabal commands to do this on a package-by-package basis:
The ‘-p’ flag tells cabal to build the library with profiling support.
Doing this for everything on my system quickly got annoying. In the end, I deleted my ~/.ghc and ~/.cabal folder and started from scratch. After the initial ‘cabal update’, edit the ~/.cabal/config file and set library-profiling to True.
Once the dependencies were reinstalled with profiling enabled, I executed a new cabal configure command to enable profiling for the SDL game:
cabal configure --enable-executable-profiling
cabal build
When GHC generates code that can be profiled the binaries come out thicker. Unstripped, my non-profile version of strat is 2.4 MB. The profiled version is 4.0 MB.
To generate a profiling report you must pass a command line flag to the run-time system of the executable. Here’s the command to run from the source directory of the project:
This will start up the SDL game. Move the mouse around a bit and exit. Once the program stops, you’ll notice a file called strat.prof. Here’s what was generated for me with the current settings:
strat +RTS -p -RTS
total time = 2.28 secs (114 ticks @ 20 ms)
total alloc = 772,177,168 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
MAIN MAIN 100.0 100.0
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 1 0 100.0 100.0 100.0 100.0
CAF Main 348 13 0.0 0.0 0.0 0.0
CAF GHC.Float 325 7 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 286 2 0.0 0.0 0.0 0.0
CAF System.Posix.Internals 285 1 0.0 0.0 0.0 0.0
CAF GHC.Conc 269 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.Internals 260 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 254 5 0.0 0.0 0.0 0.0
CAF UIState 251 1 0.0 0.0 0.0 0.0
CAF Graphics.UI.SDL.General 227 1 0.0 0.0 0.0 0.0
CAF Graphics.UI.SDL.Events 223 1 0.0 0.0 0.0 0.0
CAF Graphics.UI.SDL.Video 215 3 0.0 0.0 0.0 0.0
CAF System.Random 210 1 0.0 0.0 0.0 0.0
CAF Data.Time.Clock.POSIX 203 2 0.0 0.0 0.0 0.0
CAF Data.Time.Clock.CTimeval 202 1 0.0 0.0 0.0 0.0
Not very helpful is it? It basically only profiled main and some CAFs (Constant Applicative Function – fancy name for functions without arguments).
What we need to do is pass some more build flags to tell GHC to profile more than just that.
The ‘-caf-all’ option tells GHC to profile all top-level CAFs separately instead of as a group (the default behavior, as shown above). The ‘-auto-all’ option tells GHC to profile all top-level functions. Lastly, the ‘-fforce-recomp’ option forces the recompilation of the files, thereby making sure we have the profiling code we want. With the new executable, try this:
Which generates this report:
strat +RTS -p -RTS
total time = 3.42 secs (171 ticks @ 20 ms)
total alloc = 1,303,852,488 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
drawTile Main 72.5 35.8
redrawScreen Main 22.2 57.9
getHexmapOffset Main 2.3 4.8
tileInViewPort Main 1.8 0.0
makeRandomMap Main 0.6 1.3
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 1 0 0.0 0.0 100.0 100.0
main Main 432 383 0.6 0.0 100.0 100.0
vpY UIState 465 320 0.0 0.0 0.0 0.0
vpX UIState 464 320 0.0 0.0 0.0 0.0
uiViewPort UIState 463 320 0.0 0.0 0.0 0.0
uiMouseButtonsDown UIState 462 3 0.0 0.0 0.0 0.0
redrawScreen Main 448 393 22.2 57.9 98.8 98.5
drawTile Main 449 3930000 72.5 35.8 76.6 40.6
==_aNo UIState 458 75680 0.0 0.0 0.0 0.0
tileInViewPort Main 452 3930000 1.8 0.0 1.8 0.0
getHexmapOffset Main 450 0 2.3 4.8 2.3 4.8
makeRandomMap Main 445 102 0.6 1.3 0.6 1.5
getRandomTerrain Main 446 100 0.0 0.3 0.0 0.3
toEnum_aOi UIState 461 155 0.0 0.0 0.0 0.0
loadArt Main 436 1 0.0 0.0 0.0 0.0
CAF:artFilePaths Main 426 1 0.0 0.0 0.0 0.0
artFilePaths Main 437 1 0.0 0.0 0.0 0.0
CAF:main18 Main 425 1 0.0 0.0 0.0 0.0
artFilePaths Main 439 0 0.0 0.0 0.0 0.0
CAF:main19 Main 424 1 0.0 0.0 0.0 0.0
artFilePaths Main 441 0 0.0 0.0 0.0 0.0
CAF:main20 Main 423 1 0.0 0.0 0.0 0.0
artFilePaths Main 443 0 0.0 0.0 0.0 0.0
CAF:tileWidth_r3OL Main 422 1 0.0 0.0 0.0 0.0
tileWidth Main 451 1 0.0 0.0 0.0 0.0
CAF:tileHeight_r3ON Main 421 1 0.0 0.0 0.0 0.0
tileHeight Main 454 1 0.0 0.0 0.0 0.0
CAF:windowWidth Main 420 1 0.0 0.0 0.0 0.0
windowWidth Main 453 1 0.0 0.0 0.0 0.0
CAF:windowHeight Main 419 1 0.0 0.0 0.0 0.0
windowHeight Main 455 1 0.0 0.0 0.0 0.0
CAF:mapColumns Main 418 1 0.0 0.0 0.0 0.0
mapColumns Main 447 1 0.0 0.0 0.0 0.0
CAF:main24 Main 417 1 0.0 0.0 0.0 0.0
artFilePaths Main 438 0 0.0 0.0 0.0 0.0
CAF:main23 Main 416 1 0.0 0.0 0.0 0.0
artFilePaths Main 440 0 0.0 0.0 0.0 0.0
CAF:main22 Main 415 1 0.0 0.0 0.0 0.0
artFilePaths Main 442 0 0.0 0.0 0.0 0.0
CAF:main21 Main 414 1 0.0 0.0 0.0 0.0
artFilePaths Main 444 0 0.0 0.0 0.0 0.0
CAF:main27 Main 413 1 0.0 0.0 0.0 0.0
main Main 434 0 0.0 0.0 0.0 0.0
CAF:main26 Main 412 1 0.0 0.0 0.0 0.0
main Main 435 0 0.0 0.0 0.0 0.0
CAF:main4 Main 411 1 0.0 0.0 0.0 0.0
main Main 466 0 0.0 0.0 0.0 0.0
CAF:main29 Main 410 1 0.0 0.0 0.0 0.0
CAF GHC.Float 387 7 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 348 2 0.0 0.0 0.0 0.0
CAF System.Posix.Internals 347 1 0.0 0.0 0.0 0.0
CAF GHC.Conc 331 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.Internals 322 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 316 5 0.0 0.0 0.0 0.0
CAF:terrainMaxBound UIState 307 1 0.0 0.0 0.0 0.0
terrainMaxBound UIState 459 1 0.0 0.0 0.0 0.0
fromEnum_aOn UIState 460 1 0.0 0.0 0.0 0.0
CAF:terrainTypes UIState 306 1 0.0 0.0 0.0 0.0
terrainTypes UIState 456 1 0.0 0.0 0.0 0.0
enumFrom_aOx UIState 457 1 0.0 0.0 0.0 0.0
CAF Graphics.UI.SDL.General 266 1 0.0 0.0 0.0 0.0
CAF Graphics.UI.SDL.Events 262 1 0.0 0.0 0.0 0.0
CAF Graphics.UI.SDL.Video 254 3 0.0 0.0 0.0 0.0
CAF System.Random 249 1 0.0 0.0 0.0 0.0
CAF Data.Time.Clock.POSIX 242 2 0.0 0.0 0.0 0.0
CAF Data.Time.Clock.CTimeval 241 1 0.0 0.0 0.0 0.0
Much better! Near the top of the report there is a short list of the most expensive functions. We can see that drawTile takes about 72% of execution time and redrawScreen takes about 22%. The next few functions after that barely cost anything by comparison.
If we wanted to see some memory profiling data on the garbage collector we can execute the binary with these options:
Which outputs this to stderr:
done
2,722,681,960 bytes allocated in the heap
644,419,088 bytes copied during GC
4,732,608 bytes maximum residency (78 sample(s))
149,912 bytes maximum slop
14 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 5106 collections, 0 parallel, 0.50s, 0.52s elapsed
Generation 1: 78 collections, 0 parallel, 0.42s, 0.42s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 4.59s ( 6.27s elapsed)
GC time 0.92s ( 0.94s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 5.51s ( 7.21s elapsed)
%GC time 16.7% (13.1% elapsed)
Alloc rate 593,646,382 bytes per MUT second
Productivity 83.3% of total user, 63.6% of total elapsed
There are ways to profile the heap as well. For this program, the results from that were not enlightening.
So what to do now? It does not appear that we are having major GC or memory errors. Lets change the two most expensive functions and manually place ‘call centers’ around most expressions. This way we can see the cost of almost everything in the functions.
> drawTile (UIState vp _ mainSurf terrainSurfs tm) (x,y) = do
> let sr = {-# SCC "dT-sr" #-} Just (SDL.Rect 0 0 tileWidth tileHeight)
> (tX, tY) = {-# SCC "dT-gP2V" #-} gamePoint2View vp $ {-# SCC "dT-gHO" #-} getHexmapOffset tileWidth tileHeight x y
> dr = {-# SCC "dT-dr" #-} Just $ SDL.Rect tX tY 0 0
> tt = DM.fromJust $ {-# SCC "dT-tmLookup" #-} DMap.lookup (x,y) tm
> terrainSurf = DM.fromJust $ {-# SCC "dT-tsLookup" #-} lookup tt terrainSurfs
> if {-# SCC "dT-tIVP" #-} tileInViewPort vp tileWidth tileHeight (tX,tY)
> then do
> {-# SCC "dT-blit" #-} SDL.blitSurface terrainSurf sr mainSurf dr
> return ()
> else
> return ()
> redrawScreen ui@(UIState vp _ mainSurf terrainSurfs terrainMap) = do
> {-# SCC "rS-fillRect" #-} SDL.fillRect mainSurf Nothing (SDL.Pixel 0)
> {-# SCC "rS-mDrawTile" #-} mapM_ (drawTile ui) $ {-# SCC "rS-tmKeys" #-} DMap.keys terrainMap
> {-# SCC "rS-flip" #-} SDL.flip mainSurf
> return ()
The “{-# SCC “LABEL” #-} set a cost center for the expression that follows it. With this, rebuild the project and run the binary with profiling enabled.
dist/build/strat/strat +RTS -p
This will generate a report that looks like this (only part of the report is shown):
strat +RTS -p -RTS
total time = 12.72 secs (636 ticks @ 20 ms)
total alloc = 4,299,436,864 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
dT-blit Main 58.2 0.8
rS-tmKeys Main 10.8 44.9
dT-gHO Main 7.2 24.7
gamePoint2View Main 6.1 24.7
rS-mDrawTile Main 4.1 0.0
dT-gP2V Main 3.9 0.0
drawTile Main 3.0 0.2
getHexmapOffset Main 1.7 3.7
rS-fillRect Main 1.6 0.0
tileInViewPort Main 1.1 0.0
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 1 0 0.0 0.0 100.0 100.0
main Main 452 987 0.5 0.0 100.0 100.0
vpY UIState 513 443 0.0 0.0 0.0 0.0
vpX UIState 512 443 0.0 0.0 0.0 0.0
uiViewPort UIState 511 443 0.0 0.0 0.0 0.0
uiMouseButtonsDown UIState 510 6 0.0 0.0 0.0 0.0
redrawScreen Main 472 1006 0.0 0.0 99.1 99.5
rS-flip Main 509 1006 0.0 0.0 0.0 0.0
rS-mDrawTile Main 474 10062012 4.1 0.0 97.5 99.5
drawTile Main 476 10060000 3.0 0.2 82.5 54.6
dT-tmLookup Main 505 109188 0.2 0.1 0.2 0.1
dT-tsLookup Main 501 109188 0.2 0.0 0.3 0.0
==_aO1 UIState 504 385235 0.2 0.0 0.2 0.0
dT-blit Main 488 109188 58.2 0.8 58.2 0.8
dT-dr Main 487 109188 0.0 0.2 0.0 0.2
dT-tIVP Main 482 10060000 0.8 0.0 1.9 0.0
tileInViewPort Main 483 10060000 1.1 0.0 1.1 0.0
dT-gP2V Main 477 10060000 3.9 0.0 19.0 53.2
gamePoint2View Main 479 10060000 6.1 24.7 6.1 24.7
dT-gHO Main 478 10060000 7.2 24.7 9.0 28.5
getHexmapOffset Main 480 0 1.7 3.7 1.7 3.7
rS-tmKeys Main 475 1006 10.8 44.9 10.8 44.9
rS-fillRect Main 473 1006 1.6 0.0 1.6 0.0
makeRandomMap Main 469 102 0.3 0.4 0.5 0.5
getRandomTerrain Main 470 100 0.2 0.1 0.2 0.1
toEnum_aOV UIState 508 294 0.0 0.0 0.0 0.0
loadArt Main 456 1 0.0 0.0 0.0 0.0
Now we have a decent idea what expressions are really costing us in execution.
The call to SDL.blitSurface is by far the most costly. This is natural as this is the function doing all the actual rendering IO work behind the scenes.
What caught me off guard was the hit taken by calling Data.Map.keys. Almost half of my memory allocations are coming from this call.
Lets add a list comprehension to generate the coordinates just once, since that’s all the terrainMap’s keys were anyway.
> mapCoordinates = [(x,y) | x <- [1..mapRows]
> , y <- [1..mapColumns]]
> redrawScreen :: UIState -> IO ()
> redrawScreen ui@(UIState vp _ mainSurf terrainSurfs terrainMap) = do
> {-# SCC "rS-fillRect" #-} SDL.fillRect mainSurf Nothing (SDL.Pixel 0)
> {-# SCC "rS-mDrawTile" #-} mapM_ (drawTile ui) $ {-# SCC "rS-mapCoords" #-} mapCoordinates
> {-# SCC "rS-flip" #-} SDL.flip mainSurf
> return ()
A recompile and another run with ‘+RTS -p’ yeilds a profile report with these as the top call centers:
strat +RTS -p -RTS
total time = 8.10 secs (405 ticks @ 20 ms)
total alloc = 1,868,126,656 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
dT-blit Main 64.7 1.4
dT-gHO Main 9.1 44.8
drawTile Main 4.9 0.4
gamePoint2View Main 4.2 44.8
dT-gP2V Main 3.7 0.0
rS-mDrawTile Main 3.5 0.0
rS-fillRect Main 3.0 0.0
getHexmapOffset Main 2.5 6.8
dT-tIVP Main 2.2 0.0
A run with ‘+RTS -sstderr’ gives this output:
2,082,792,480 bytes allocated in the heap
33,971,320 bytes copied during GC
4,853,488 bytes maximum residency (6 sample(s))
95,640 bytes maximum slop
13 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 3955 collections, 0 parallel, 0.07s, 0.06s elapsed
Generation 1: 6 collections, 0 parallel, 0.02s, 0.03s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 5.05s ( 6.54s elapsed)
GC time 0.09s ( 0.09s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 5.14s ( 6.63s elapsed)
%GC time 1.8% (1.3% elapsed)
Alloc rate 412,461,270 bytes per MUT second
Productivity 98.2% of total user, 76.2% of total elapsed
That change alone eliminated a ton of GC copies and drastically reduced the amount of time spent in the GC. The program felt snappier as well.
However, the big payoff comes from realizing that I need to cut down on my blits. Our main’s inner function eventLoop looks like this:
> redrawScreen ui
> e <- SDL.waitEvent
> checkEvent ui e
This means that for every event sent by SDL the screen is redrawn. Mouse movement generates lots of events! Recall how fast they printed out to console when we were writing tracing output at the beginning of part two?
A better solution is to process all of the events in the queue before redrawing the screen. After the screen is redrawn, we can block on waiting for an event. Replace the code above with this:
> e <- SDL.pollEvent
> checkEvent ui e
> checkEvent ui (SDL.NoEvent) = do
> redrawScreen ui
> e <- SDL.waitEvent
> checkEvent ui e
It adds another pattern match for checkEvent that will run when no events are available. This made an enormous difference in the scroll speed.
With that change, I’m happy with the performance of the game for now.
Here’s a preview of the new programmer art that will be used in the next post:
