Characterization of HPC codes and problems

We’re going to look at four different runs of CP2K and HPL at 256 processes, both with and without problems.

Each case has its own unique pattern visible in the Allinea Performance Reports.

We’ll study them in turn to understand how these reports can be used to characterize applications and diagnose abnormal behavior.

This Allinea Performance Report is for CP2K, a molecular dynamics simulation code.

Allinea Performance Reports - cp2k 256pThe proportion of time spent waiting for memory accesses is rather high; improving this is one optimization target, and a source-level profiler (naturally, Allinea MAP) can help.

Equally, the 50 Mb/s point-to-point MPI communication rate suggests either lots of small inefficient messages or significant load imbalance.

More important than either of these, though, is the memory breakdown.

This run only used 7% of the peak node memory!

By far the easiest way to improve efficiency and use fewer CPU hours is to simply run such jobs at a lower scale, which will also reduce the amount of time spent in MPI communications. An 8x reduction in MPI processes down to just 32 would be a good start.

This Linpack benchmark run illustrates why increasing numbers of HPC sites believe it no longer represents a useful real-world measure of system performance.

Allinea Performance Reports - hpl_256p

Unlike the complicated real world of molecular dynamics, it’s possible to achieve extremely high speed up factors with linear algebra.

Here, we see that with an excellent 92% of time being spent in application code.

So what can we learn from this code?

Well, despite being highly-optimized this code still spends almost half its time waiting for main memory and only a quarter of its time using the maximum-bandwidth SSE/SEE2/AVX instructions.

Vendor-specific implementations of HPL often achieve higher values here, and they’ve done so by carefully examining the key loops and reordering or in some cases replacing parts of them with hand-written assembly.

Compiler auto-vectorization isn’t nearly as good as people generally like to believe!

What happened here? The results have fallen off a cliff!

Allinea Performance Reports - hpl_misconfiguredYet the same code was run on the same machine!

Let’s dig into the details!

In the MPI section we see all the time is being spent in collective calls instead of point-to-point calls in the previous report. And the effective transfer rate is 0 bytes/s!

This suggests many processes are waiting at e.g. collective barriers for long periods. Why could that be? A severe imbalance in the workload?

The memory section holds another clue.

The mean per-process memory usage is almost four times lower than the peak per-process usage!

This suggests the majority of the processes are working on substantially smaller data sets than others. Did we make a mistake configuring the input file for the run? A review of the input settings (helpfully captured in the “Notes” field) shows the block sizes are set up for 64 processes. HPL has happily computed on only 64 of our 256 allocated processes, silently leaving the others idling in MPI_Finalize!

Simply looking at the time of the run and the output from HPL could have led us to believe that we’d reached the scaling limits or to blame this on “system networking issues”.

By looking inside the application with an Allinea Performance Report can we rapidly deduce the true failure mode - user error!

The only thing more complex than a mature HPC code is its build system.

Allinea Performance Reports - cp2k_miscompiledThere are a lot of opportunities for poorly-optimized code to make its way into a production executable, from miscompiled libraries to runtime bound-checking. In this example, we see how one poorly-compiled module affects a performance report.

The summary instantly shows something has changed, and in the CPU breakdown we can see the effect of the compiler flags directly.

Although only one module of the code was compiled incorrectly, the amount of time spent in vectorized instructions has dropped by two thirds and the amount of time spent in memory accesses has increased to 69% of the total.

As Allinea Performance Reports are generated in both text and HTML formats, many people choose to run them as part of their regular regression tests during code development.

It’s very easy for a code change to accidentally make a key loop unvectorizable or to have unforseen implications for cache behavior, but the metrics in an Allinea Performance Report flag up and identify any changes immediately.

How do your applications match the hardware they’re running on? Are they configured optimally? Generate an Allinea Performance Report and find out!