Allinea MAP helps iVEC achieve mission: Increase computational-science literacy
Dr. Rebecca Hartman-Baker from iVEC in Australia has blogged about her mission at iVEC to increase computational-science literacy and how Allinea MAP has become her secret weapon of choice. Read Rebecca's blog below.
I count myself among the luckiest people in the world, because my day job involves introducing people to some of the coolest things ever. I’m a computational scientist working at iVEC, an up-and-coming supercomputing center in Australia, with the mission to increase computational-science literacy and create a field of highly qualified users of high-performance computing resources.
My center recruited outstanding people from around the world to build up their capabilities. We have a mandate to develop the local talent, which is plentiful but not yet experienced with HPC. To support our mission, I decided to organize a team of undergraduates to compete in the Student Cluster Competition (SCC), to be held in November at SC13.
The SCC is an annual event in which teams of six undergraduates design, build, and run applications on a small supercomputer in a 48-hour non-stop competition. Teams are matched with hardware vendors, who supply them with the hardware to build a cluster, with a power limit of 3120 Watts. The teams train to build and run real scientific applications, such as this year’s codes, Graphlab, NEMO5, and WRF, on problem sets provided at the start of the event. A twist to the competition is this year’s introduction of the mystery application, which will be revealed at the competition.
The competition is open to teams worldwide, but we are the first-ever team from Down Under. We started training in March, at the beginning of the Australian school year, for the competition which will take place during the students’ second-semester exam week.
Given the limited amount of time that I had to hone the team’s skills, what could I do to expose them to what they would need for the competition, while eliminating as much unnecessary pain and suffering as possible?
To get good throughput in the competition, the team would need to develop an understanding of the codes and speed them up if possible. Recalling my own experience as a student learning to use a supercomputer and to program in parallel, I soon determined that they would need a profiling tool to develop a deeper understanding of the codes. But what tool would be most suitable? It needed to be simple to use, but powerful enough to supply useful information. It needed to be non-invasive, both in terms of not requiring substantial changes to the make process or explicit insertion of instrumentation, and also having a low overhead. And it needed to be something that was not platform-specific, but could run on any cluster. The only profiler that met all the requirements? Allinea MAP, of course!
So I pitched this opportunity to my treasured colleagues at Allinea Software, and they were kind enough to provide a unified Allinea MAP / Allinea DDT license for the team. We’ve been making good use of it ever since.
We’ve used Allinea MAP to profile all our codes, and come up with some very useful insights, but since I don’t want to give away our team secrets before the competition, I’ll show you some interesting results from running High-Performance Computing Challenge (HPCC, the benchmark for the Top500 list). The technique is the same, only the details are different.
First, a little about HPCC. It consists of seven benchmark tests that measure various properties of the machine. We have to run this code as part of the competition, so understanding the code so we can potentially optimize it is crucial.
The most notable of the benchmarks, High-Performance Linpack (HPL) is used to determine the FLOPs rating of a machine. It’s the last benchmark in the suite and it takes more than half the total runtime. In Figure 1, the default opening screen from Allinea MAP, intuitively you can see where each benchmark begins and ends. The last half of the timeline is the HPL part of the run.
Figure 1: Allinea MAP opening screen.
We wondered how the code was behaving in the HPL benchmark, so we focused on the latter part of the timeline in Figure 2.
Figure 2: Focusing on the HPL portion of the HPCC benchmark.
From the italic numbers we can see that the MPI calls are minimal in this region of the code, and of course in this part the focus must be on the CPU performance statistics.
Figure 3: CPU metrics on HPL portion of HPCC benchmark
We swapped from the default metrics to the CPU metrics to get a better feel for what’s going on with the CPU. In Figure 3, it looks like we are doing a large portion of floating-point vector operations, averaging at about 60%, but there is a fairly substantial spread of performance across processes. This is something that we could probably improve by using a more optimized version of BLAS (in this particular instance we were using ATLAS instead of MKL).
Another thing we were intrigued by was the large spread in MPI call durations beginning at about one-third of the way through – as highlighted in Figure 4 – what could that possibly be? Looking at the MPI communication metrics as in Figure 5, it still looks peculiar. There are hardly any calls, almost no data being passed – what in the world is going on?
Figure 4: A suspicious-looking MPI call duration.
Parallel stack view to the rescue! In Figure 6, expanding the parallel stack view of HPCC_SingleDGEMM reveals a very lengthy MPI_Bcast. It also reveals the section of code that is the culprit. In HPCC_SingleDGEMM, one MPI process is (pseudo)randomly chosen to perform the DGEMM (lines 62-67 of onecpu.c). The identity of the chosen process is then broadcast to all MPI processes, and the chosen process then performs the DGEMM metric (HPCC_TestDGEMM). Meanwhile, all the other MPI processes move on to the MPI_Bcast on the next code line, and await the DGEMM result. As time progresses, we can see the time spent waiting rises in that very interesting wedge shape.
In a scientific application code, this pattern would suggest a load imbalance. Indeed, in this code there is a load imbalance, where one process is doing all the work, but for a serial test not much can be done about that!
Figure 5: MPI metrics mode highlighting the suspicious-looking MPI call duration.
Figure 6: Finding the culprit -- a serial test for which the uninvolved processes wait in an MPI_Bcast call.
We found this exercise enlightening – fascinating enough that I even demonstrated the results to some of my colleagues.
The team has more Allinea MAP results on the other applications, but those are the top secret…at least for now. But suffice it to say we’re glad to have Allinea MAP, a lightweight tool for heavy problems, as a weapon in our arsenal.
Rebecca recently spoke to InsideHPC about iVEC's use of Allinea MAP. Listen here.