Allinea DDT: Enriching the Poor Man's Profiler
Have you heard of the Poor Man's Profiler? Let me paraphrase for you:
"Poor man have no time to learn complicated new tools. Poor man need simulation results."
-- poormansprofiler.org, paraphrased by Allinea.
The idea behind the Poor Man's Profiler is that we often don't need to learn how to use a new, complicated tool to tell us exactly why our program is running so slowly.
If our code is running slowly enough to notice, we can usually just do this:
- Run the program under a debugger
- Stop after a while and look at the code that's being executed
- Repeat step 2 a dozen times or so
This is what most high-performance profilers do, but at a much higher resolution. And then they drown us in that data. Mostly, we don't need to know about every microsecond in a five hour job. If performance has become an issue, simple techniques like this can find it.
But there's a problem with the Poor Man's Profiler: it's a real pain to use it on parallel programs.
Allinea DDT: Enriching the Poor Man's Profiler
The Poor Man's philosophy is about using the tools we know to take us as far as they can. If you're running code on a cluster, you probably already have access to DDT - try "ddt" or "module load ddt" to see. If you're not in luck, you can just download a free version with a 30 day evaluation licence.
Here are the three steps to using DDT as an enriched, parallel version of the Poor Man's Profiler:
- Type: ddt -n 64 -start my-dir/slow-program.exe
- Press pause from time to time and look at the parallel stack view
- Enlightenment, joy and a deep sense of inner peace
One advantage in using an interactive, graphical debugger is that once we've seen where our code is slow, it can help us to understand why.
As an example, I tried this on a Stack Overflow question from a while back:
"The problem is the more processors I use the more time it takes. I thought it might be a problem with it taking more time to send the data to each slave than it takes to to just find the max" -- user1422751 on StackOverflow
First I compiled the code:
mpicc -g -O3 test/mpi-find-max.cpp -o mpi-find-max
And ran it under DDT:
ddt -n 4 -start test/mpi-find-max
Every time I paused the program, it stopped here:

Instantly, we see that process 0 is spending a lot of time initializing the array, on the call to rand(), and processes 1-3 are all waiting in MPI_Recv.
It seems like they're waiting for process 0; DDT's message queues can tell us for sure - just click on View->Message Queues:

So the reason the program doesn't scale is Amdahl's law - the scalar portion already dominates the computation. You could argue that the author is parallelizing the wrong thing!
You may have noticed that the parallelization used isn't ideal either - the synchronous MPI_Send/MPI_Recv means that the data out will be sent out to one process at a time. MPI_Scatter is a better way to do this. But in this case, that obvious implementation flaw wasn't what was making the program run slowly. Trying to optimize that would have been extremely frustrating, as it wouldn't have measurably improved wall-clock performance at all!
Avoiding this frustration is why everybody, from students to experts, should run their code through a profiler before making guesses about why it's running slowly. The beauty of using DDT for this is that we can do just that, without instrumenting our code or finding, installing and getting to grips with complex new software.
Like all good things, it just works: ddt -start.
Tell us how you get on.