Most sites we visit rely on their expert-level HPC support staff to analyze the efficiency of MPI programs. When we speak to the experts we often hear the same stories - users making common mistakes again and again - and worse, users with long-running, easily-fixed performance issues that they didn't even know they had.
Enter Allinea MAP...
With Allinea MAP, any MPI programmer can now take a quick look at the performance of their code - without modifying it or learning new syntax.
Here's how to do it with Allinea MAP:
The program runs with low overhead (1-5%) and produces a small data file for analysis. The initial views are intuitive enough for everyone to understand, but deep enough to help experts pinpoint complex problems:
What do these graphs mean?
On all charts the horizontal axis is time. The charts on the top show how the distribution of performance metrics varies at each point in time, the minimum at the bottom to the maximum at the top - so large areas of colour show imbalance across processes.It's clear that a lot of MPI time is spent in a regular pattern in the first 2/3 of the run. A quick click and drag zooms in for a closer look:
In the zoomed-in source code graphs we can see how the number of processes running each line of code changes over time as they move through the program.
Check it out for yourself
Can you work out why some ranks are waiting inside the MPI call here? Can your colleagues?
Find out if you're right in our 5-minute video, which also looks at deeper CPU performance metrics such as time spent on memory accesses and vectorization. Like what you see? Sign up a free download on release!

Memory leaks are a killer of long running applications - memory usage keeps growing until finally the memory supply is exhausted and it's "game over". If you’re lucky the system recognizes your application is at fault and terminates it. If you’re unlucky, the system itself will crash.
How do memory leaks come to exist and what are they?
Memory leak happen when processes allocate memory and forget to release it. C and C++ developers, we know this type of memory as heap memory obtained by calls to malloc or new, for example. Fortran90 coders know this as allocatable memory - and use the allocate primitive.
Allocatable memory is great but the infinite memory machine does not exist so you must take care to release the memory when you are done with it - allowing your application to reuse it the next time it needs some memory.
When you forget to deallocate, you’ve made a leak. If allocation happens repeatedly without being deallocated - your application will fail! C and C++ users know particularly well that allocations happen everywhere - so finding the leak can be a challenge.
How can we fix a memory leak?
One of my applications is dying - I know it’s due to memory exhaustion because “top” shows consumption growing, the machine is starting to whirr and grind at the disk as the swap memory kicks in, and finally the machine drops off the planet.
I fire up my favourite debugger - Allinea DDT - it has a great built-in memory leak detector. This finds the most prominent leaks, and lets you fix them. I simply tick the “Memory Debugging” option when starting to debug the application.
After a few minutes, I pull up a chart to see how my memory is doing. This is a multi-process application using the MPI library - and one of the processes looks to be using more memory than the others.
Something big is stirring on Process 0
Allinea DDT remembers the stack trace for each allocation, so that you can pinpoint which part of your code is wasting memory. I click on the largest block in bar chart (Packet::allocate on Process 0) to bring up the details.
Pinpointed!
That was handy - I know how large the allocation is and where it came from. I click again and Allinea DDT leaps to the source code of my application!
The line of allocation
Things are clear straight away.
Packet object “p” is allocating memory - and the memory is used but when “p” goes out of scope it never calls deallocate. That was clumsy but it is easy to fix now, thanks to Allinea DDT!
Have you heard of the Poor Man's Profiler? Let me paraphrase for you:
"Poor man have no time to learn complicated new tools. Poor man need simulation results."
-- poormansprofiler.org, paraphrased by Allinea.
The idea behind the Poor Man's Profiler is that we often don't need to learn how to use a new, complicated tool to tell us exactly why our program is running so slowly.
If our code is running slowly enough to notice, we can usually just do this:
- Run the program under a debugger
- Stop after a while and look at the code that's being executed
- Repeat step 2 a dozen times or so
This is what most high-performance profilers do, but at a much higher resolution. And then they drown us in that data. Mostly, we don't need to know about every microsecond in a five hour job. If performance has become an issue, simple techniques like this can find it.
But there's a problem with the Poor Man's Profiler: it's a real pain to use it on parallel programs.
Allinea DDT: Enriching the Poor Man's Profiler
The Poor Man's philosophy is about using the tools we know to take us as far as they can. If you're running code on a cluster, you probably already have access to DDT - try "ddt" or "module load ddt" to see. If you're not in luck, you can just download a free version with a 30 day evaluation licence.
Here are the three steps to using DDT as an enriched, parallel version of the Poor Man's Profiler:
- Type: ddt -n 64 -start my-dir/slow-program.exe
- Press pause from time to time and look at the parallel stack view
- Enlightenment, joy and a deep sense of inner peace
One advantage in using an interactive, graphical debugger is that once we've seen where our code is slow, it can help us to understand why.
As an example, I tried this on a Stack Overflow question from a while back:
"The problem is the more processors I use the more time it takes. I thought it might be a problem with it taking more time to send the data to each slave than it takes to to just find the max" -- user1422751 on StackOverflow
First I compiled the code:
mpicc -g -O3 test/mpi-find-max.cpp -o mpi-find-max
And ran it under DDT:
ddt -n 4 -start test/mpi-find-max
Every time I paused the program, it stopped here:

Instantly, we see that process 0 is spending a lot of time initializing the array, on the call to rand(), and processes 1-3 are all waiting in MPI_Recv.
It seems like they're waiting for process 0; DDT's message queues can tell us for sure - just click on View->Message Queues:

So the reason the program doesn't scale is Amdahl's law - the scalar portion already dominates the computation. You could argue that the author is parallelizing the wrong thing!
You may have noticed that the parallelization used isn't ideal either - the synchronous MPI_Send/MPI_Recv means that the data out will be sent out to one process at a time. MPI_Scatter is a better way to do this. But in this case, that obvious implementation flaw wasn't what was making the program run slowly. Trying to optimize that would have been extremely frustrating, as it wouldn't have measurably improved wall-clock performance at all!
Avoiding this frustration is why everybody, from students to experts, should run their code through a profiler before making guesses about why it's running slowly. The beauty of using DDT for this is that we can do just that, without instrumenting our code or finding, installing and getting to grips with complex new software.
Like all good things, it just works: ddt -start.
Tell us how you get on.
One of the great things about working at Allinea Software is meeting developers with real problems and improving their lives. When you have a tool that transforms the daily report to the boss from “still fixing that bug” to “developing new code” - you’ve just made someone happy. In fact, you probably made at least two people happy.
Not every developer knows that debugging tools exist - and some of us may be pretty experienced at debugging, but still won’t know every trick in the box. There are those who say they do not need to use debuggers, but you and I know differently. There’s nothing macho about printf!
Today I’m going to highlight four features of Allinea DDT to deal with crashing applications - where the debugger detects the problem instantly.
The Crash
Your application dies unexpectedly, every time. Fire up Allinea DDT:
ddt -start ./myapp {arguments}
This starts the debugger and your code, press the play button in the GUI - and it will run your code until the crash.

Oops - ‘y’ has a crazy value because the loop variable is wrong.
Instant fix #1.
You get extra points for using Allinea DDT’s offline mode - that’s a one-liner to use and summarizes the important bits of the crash in a nice, browsable HTML log file:
ddt -offline log.html ./myapp {arguments}
firefox log.html

You can read more about offline mode in our previous blog.
How Does That Happen
Is the only explanation for a crash that something has changed when it should not? Constant variables changing a bit too much? Watchpoints are your answer: An instant stop as soon as a memory location changes contents.
Load the debugger and start your application. Run to a known good point by searching for the function or source file and click “run to here” in the source. Right click on the variable you think is changing - select “Add to Watches” - and press the run button.

It’s right - the value is changing there. My array index is wrong - would I ever have guessed that?
Instant fix #2.
The Vanishing Bug?
Ever seen a bug that bites for only certain problem sizes, or on certain machines? That just might be a bug with your heap memory usage. When you read or write beyond the end of an allocation of memory, bad things happen - occasionally and unpredictably. It’s occasional bugs that are harder to reproduce, and so harder to fix.
Imagine a tool that made the occasional bug happen every time? That stopped your program as soon as it happened? Allinea DDT’s advanced memory debugging capabilities can do just this! All you need to do is enable the feature, a simple checkbox, and let the default settings (“fast”) do the rest. The magic “Guard Pages” will pick up the problem as soon as your application reads beyond an allocation.
Instant fix #3.
The Thousand Process Segmentation Fault?
Is your application crashing regularly, but unpredictably when running a large job? Simple - fire up Allinea DDT - as before - but with more processes! Debugging tens of thousands of processes is quick to start - quicker than brewing a nice cup of tea in most cases (see you can reduce your caffeine intake with Allinea DDT).
Run your code and wait for the crash - it’s as quick and responsive as running on a laptop!

There it is - a null pointer! Beat that, printf!
Instant fix #4.
...
not just because Allinea DDT makes it so (we're too modest to claim that) but because random things tend to happen more often when you have highly parallel or multithreaded applications. The more parallelism you have, the more chance that something apparently random and bad will happen whilst you're watching it.
When bad things happen, you can fix them: you have a reproducer, pure gold for a programmer. When bad things don't happen, "works for me" is a lousy response to give to a user - the user is not "me" and has no instructions on how to become "me".
I'd like to recollect a bug story from whilst we were developing the scalability in Allinea DDT. The machine was booted, and we debugged 100,000 cores: it worked. It worked every time. Awesome - no-one had done that before!
Then, a curious thing happened: it just stopped working. The machine had been "up" for a few days and we couldn't debug those jobs any more.
A Schroedinbug is something that works until a bug is first observed and thereafter never works again? This machine was used by Quantum Physicists, so perhaps this was their joke?
The problem was, everything seemed to work every time at 1,000 cores. There was a bug and we couldn't reproduce it except at large scale. We were lucky - we had unlimited access to our scalable debugger and ran at the scale of the problem. Allinea DDT debugged itself. Debugging a debugger is not as implausible as it sounds - and eating your own dog food is easier when it tastes good (dog owners, do not try this at home).
Our favourite parallel stack view and variable comparisons in Allinea DDT led us to the problem: we were looking for the wrong processes. A system library that we relied on had a fatal flaw.
The process ID counter is independent on each physical machine in a supercomputer, increasing every time a process started. It wraps around to 1 (actually the first available positive integer) when it reaches 32767. However, whenever this wrap-around happens, a bug in a system library was triggered: that library produced the wrong answers which Allinea DDT then used.
With a bug as unlikely as this - what's the chance that the counter is wrapping round for any host in a job? I'll leave you to calculate how many debugging sessions at only 32 processes would be needed to witness the bug with even 50% probability. Run at 100,000 cores and it's not just likely - it's inevitable - that you see the problem.
We could be forgiven for being a little smug - it's system library 0, Allinea DDT 1 - but there's a sting in the tail of this story.
One of our test cases failed recently - a test case for a function identifying the descendant processes of a process. It failed rarely, less than 1 in 1,000 runs. With Allinea DDT, it was simple enough to reproduce - we ran the test case many times - all in parallel - in a great big debug session. Sure enough the test case failed in seconds.
Under the debugger, an assertion in the test stopped the one process showing the issue at exactly the problem. The test case and its assertion were priceless. It pinpointed the problem - and narrowed down the search for a cause. Now we could use the debugger to explain why the problem had happened, not just where.
Two minutes later. The Eureka moment.
Guess what: Process ID wrap-round wasn't handled correctly. This time, it was our fault!
The moral of the story? We all are capable of creating bugs – however embarassing, but solving them quickly is what matters. Having good test cases and a great debugger helps!

Power. In computing, with great power comes great cost. Whether you're a smartphone user trying to last the day on one charge, or the power-socket vampire at the airport for that laptop, you know that computers eat power. In high-performance computing, machines don't snack on power, they eat it, day in, day out, and the cost of that feast runs to millions of dollars per year for the largest systems.
It's this power challenge that drove the creation of multi-core processors, and the arrival of graphics cards (GPUs) for non-graphical computing in scientific and compute intensive applications. Slower clock speeds coupled with more parallelism are a way to get more flops (floating point operations per second) per watt - more battery life for the smartphone, and fewer dollars on the electric bill for your scientific application.
This sounds like a good thing, so where's the catch?
It's software. You can't just take a regular piece of software, put it on a multi-core processor - or a GPU-computing enabled workstation or supercomputer, and expect it to fly. If anything, an unaltered application would run slower if you just threw more, slower, cores at it.
Software has to be adapted to take advantage of this concurrency, this parallelism, in the computer hardware. Applications often need some pretty fundamental changes in order to use the extra cores or the GPU. Programming for one core can be challenging enough, but if you are coding for dozens, or thousands, or millions - it can be truly daunting, as the interactions between threads or processes are pretty hard to imagine.
A number of programming languages and models exist to manage the problem of developing software - from OpenACC (for GPUs), through OpenMP to MPI for supercomputers - but how do we fix the inevitable bugs that software development always brings?
Understanding bugs in an application with many threads and processes sounds quite hard, harder than a simple single thread, doesn't it?
Thankfully there is a solution! Allinea DDT is designed for debugging this kind of complexity. It's trusted by developers of the most complex parallel software and some pretty extreme systems too. It simplifies debugging - intelligently handling concurrency, letting you focus on the differences or the commonality in threads - so that you can really understand what is happening.
Allinea has created a series of webinars to explore how Allinea DDT can help you get to grips with your software development challenges for multi-core, GPU and distributed computing applications. You can register to attend the next webinars and access previous webinars at www.allinea.com/webinars.

Unfamiliar with Allinea DDT? Here’s an overview:
Allinea DDT is a comprehensive graphical debugger designed to simplify the complex task of debugging scalar and parallel code. Software developers worldwide find that there are numerous benefits in using our products. Some of these are:
- Easy to use and intuitive allowing you to quickly start using our software tools

- Unique and advanced features allowing you to debug software at a very fast speed
- Ability to debug scalar and parallel code on a single workstation as well as on supercomputer environments
- Supports latest CUDA environments
- GUI and command line interface
- Free 30 day trial

It's January 2012 and I'm sitting on a cross-Atlantic flight. Sweat is beading on my brow and it's nothing to do with the cabin temperature. I am not a happy bunny. I'm a very unhappy bunny and somebody is going to pay.
On this fateful day I'm on my way to Chicago to run an Allinea DDT training seminar at NCSA. Having exhausted the list of second-rate in-flight movies, I've started working through our training material in preparation.
It's all been going swimmingly, when suddenly I spot a problem - a major problem. It looks like there's a bug in our just-hit-the-website 3.1 release. The one I'm going to demo at a hands-on workshop. Tomorrow!
The problem is this: in 3.1 we added a fancy new feature called sparklines, which draws a tiny graph next to each variable in the interface, comparing its value across all the processes, instantly. Normally this is really useful, but today it looked... wrong:

The graphs are all corrupt! The graph next to my_rank should be a nice diagonal line, showing that process 0 has a rank of 0 and process 9 has a rank of 9! And p is the size of the job, that should be the same across all processes, but there's some kind of peak in it!
Somebody has broken the build. And tomorrow I'm going to be running a hands-on training session with it. Definitely. Not. Happy.
My first instinct is to raise a positively incandescent bug report. I draft one that starts with "WHO BROKE MY #$@! SPARKLINES?!?!!11", but there's no in-flight WiFi so submitting it has to wait. Instead, I anxiously poke around in the interface to find out how bad the damage is.
The first thing I do is hover my mouse over the sparkline to see the range of values reported:

Ok, so there's clearly some junk in there. 1126236160 is definitely not a valid process rank.
That raises the question as to what the values all actually are, so I click once on the sparkline, which brings up a quick cross-process comparison dialog that shows me the actual values across every process:

That's odd, why would just three processes have the same random value? Suddenly, this doesn't feel quite like a problem with Allinea DDT any more. I right-click and make a group out of the three processes with the incorrect value and it all drops into place:

I'm not looking at a bug in Allinea DDT at all - I'm looking at a bug in the training program. All three of these processes are merrily looping around and around overwriting memory. The type of the tables array is shown underneath the variables list - it's just a 12 by 12. Yet these processes are already writing to tables[0][112623621] and beyond! They've trashed the stack, including my_rank, p and a whole lot of other variables. It's a small miracle the program hasn't crashed yet!
I look back at the training material. Oh, yes, there we are. Exercise 1: why does the program crash or loop indefinitely when run with 10 processes?
Glancing around to see if anybody has noticed, I delete the outraged bug report from my drafts folder and insert a note into the training material:
"An excellent use of sparklines is spotting memory corruption, even with data on the stack or when memory debugging is turned off."
I glance back at the screen and somewhat grudgingly accept that it's actually pretty cool. The relief is palpable, but I still need a drink. Stewardess!
In June 2009, Allinea announced a collaboration with the French organization CEA. We agreed to scale Allinea DDT up to debug 32,000 simultaneous cores - at this time this covered 98% of the systems in the TOP500.
Up until then Allinea DDT had used a simple, flat architecture similar to many traditional web servers of the time. Our GUI ran on a frontend node and received connections from each of the compute nodes, processed and visualized the data and sent out new commands.
To reach 32,000 processes it was clear that a change in Allinea DDT's architecture was required - after all, even high-traffic websites were using load-balancers to spread requests amongst a network of servers for processing. We had to do this in reverse - instead of having thousands of users sending requests to a single service, we had thousands of services sending data to a single user.
Allinea DDT already performed a lot of data aggregation before displaying this information to the user - clearly, even at moderate scales of hundreds of processes, you can't look at each one in turn. Way back in Allinea DDT 1.10 we started addressing this by introducing a parallel stack view, which shows you the broad picture of where your processes are - without overwhelming you with the specifics.
To implement this, Allinea DDT collects the stacks from every parallel process, but reduces them down into a manageable amount of information by merging common branches and tracking interesting metrics associated with each. This kind of reduce operation is a classic candidate for parallelization and conveniently enough Allinea DDT always finds itself running on a supercomputer powerful enough to do it in real time...a plan was formed.
Our solution was to make Allinea DDT's daemons running on the compute nodes assemble themselves into a tree, with the GUI talking only to the root node. All data from the nodes is distributed across the tree and aggregated through reduction operations at each level all the way up to the top. By parallelising data processing in this way Allinea DDT is able to scale O(log n) with the number of processes being debugged.
That was the theory, but could we make it work in practice?
Yes.
The first version with the new architecture was released after just six months, in December 2009 as Allinea DDT 2.5 and it exceeded all expectations.
Two years later, we released Allinea DDT 3.0 - the result of an intense collaboration with the US DOE's Oak Ridge National Laboratory that took Allinea DDT even further - to 225,000 simultaneous processes: the limit of the largest machine in the world at that time.
Now, with scale-by-default baked into our systems and development process, the sky is the limit. By 2013 we expect to see Allinea DDT running on systems with over 1,000,000 simultaneous processes. Now that will be something special!
We humans can survive in almost every environment on our planet and are beginning to step off it. We command fire hotter than the core of a star and freeze atoms at temperatures cooler than the depths of interstellar space. Not bad for squishy sacks of mostly water...

We can do all this because we love tools - we make them, we use them; they complete us.
At Allinea we spend all day, every day, crafting ever-better tools - in particular tools for working in the virtual environment inside computers. Our favourite ones are Allinea DDT, for finding and fixing problems on everything from a mobile ARM processor up to a 225,000 core supercomputer, and Allinea OPT, for effortlessly measuring and improving performance.
This week, we released Allinea DDT 3.1, the culmination of ten years of learning about and creating tools for parallel debugging.
Those ten years have taught us that not all tools are equal; the hallmark of a good tool is how it feels in your hands. Reliable. Solid. Well-balanced. Intuitive. The right tool for the job doesn’t try to engage you in a conversation, it fits into your grasp and extends your capabilities effortlessly.
We’re not all the way there yet, but each release takes us another few steps towards this holy grail. To see what we mean, take three of the changes we’ve brought in with Allinea DDT 3.1 to make it a better, more intuitive tool:
1. Effortless Offline Debugging: Improving on Print Statements
The best interface is one you never have have to use - instead of this:
$ mpiexec -np 1024 my-program arg1 arg2
Just write this:
$ ddt -offline output.html -np 1024 my-program arg1 arg2
Starting in Allinea DDT 3.1, we’ve added an offline mode that bypasses our interactive, graphical interface entirely. Any errors will be automatically collected, aggregated and presented in a beautiful html report when the job is done. All of Allinea DDT’s features are present - memory debugging, parallel crash stack traces, parallel variable comaprisons - without ever needing to use them directly.

We didn’t stop there; we also took on the task of parallelizing and scaling print statements - instead of editing your source code to print the values of, say, i and xarr(4,0) and recompiling, just run your program like this:
$ ddt -offline log.html -trace-at 'hello.f90:49,tag,someints(tag)' -np 10 ./hello
Allinea DDT adds a virtual print statement at line 49 of hello.f90 and gathers the values of tag and someints(tag) every time it is hit. Better than that, it does this across all the processes in the program and compares their values for you:

For many classes of bug, this is exactly what you need - especially if the alternative is waiting a week for a full-cluster interactive debugging slot!
2. Zero-click Comparison Across Processes
Those little comparison charts - sparklines - are now automatically calculated for every variable you see in Allinea DDT’s graphical interface, too:

Of course, you can still click on any variable to pull up a list of by-value groupings and drill down to individual processes and threads.
3. Always-On Static Analysis
This idea that a tool should constantly provide subtle, unobtrusive feedback was also the driver behind adding static code analysis to Allinea DDT:

Any code you look at in Allinea DDT is automatically checked with an appropriate static analysis tool and the warnings or errors seamlessly integrated into your view of the source code. There’s nothing to configure or learn, it just works - always and everywhere.
And much more
The best tools can be used in almost any environment, and we’ve been busy building out Allinea DDT’s support for more and more programming paradigms including UPC and co-array Fortran.
We’ll unpack these and many more features over the coming weeks, including the new GPU activity displays and more details on getting the most out of Allinea DDT’s new offline mode. Stay tuned and tool up!
Useful links:
Download Allinea DDT
Get a free trial of Allinea DDT