allinea - Now part of ARM

Understanding I/O behavior: Many HPC workloads have an increasingly significant I/O component

However, the performance of I/O systems vary wildly both between clusters and over time. Rapidly identifying and diagnosing I/O bottlenecks is no longer something that can be done when the system is installed. We’re going to look at how Allinea Performance Reports show the I/O behaviour of the MADbench2 benchmark, based on real-world CMB out-of-core processing workloads.

This server was under a small amount of load - 4 processes reading and writing on a well-specified 12-core system.

Allinea Performance Reports - server_4pIn the I/O breakdown section, we see that one third of the time is spent reading, and this is done at around 1 Gb/s. That’s a pretty high I/O figure for a single server!

A glance at the Memory section suggests why - only 14% of the total memory is in use by the application, and each process is only using a few hundred megabytes, which means there is more than enough disk cache to hold all the data being read.

Two thirds of the time is spent in write operations, and these are at a fairly average 340 Mb/s. 
For a single node that’s pretty good performance, but many HPC clusters with networked filesystems could do better.

Let’s turn up the heat and see what happens when things get more challenging.

The most obvious difference at 9 processes is that now a substantial chunk of the time is spent in MPI communication. Why is that?

Allinea Performance Reports - server_9pA glance at the MPI breakdown shows that all that time is being spent in collective calls, with a truly miniscule transfer rate of around 40 bytes per second. Such low rates usually mean that there are MPI_Barriers causing many processes to wait - a clear sign of workload imbalance.

Looking at the I/O breakdown, we can guess where that imbalance has come from. The amount of time spent in writes has increased, because the effective write rate has dropped by a factor of 6 to just 70 Mb/s. Given that the same code is running, this can best be explained by heavy contention for the filesystem. Contention is a problem that affects many I/O-heavy HPC workloads, and it isn’t always clear in advance which communication patterns will cause it. A performance report will always flag up such problems quickly.

To improve the code on this system we should nominate a smaller group of the MPI processes to actually perform disk writes, avoiding this contention.

Adding just another 5 processes, up to 16 in total on this 12-core (24 with hyperthreading) system causes real havoc.

Allinea Performance Reports - server_16pNow less than 5% of the total time is spent computing, the rest is spent writing to disk or waiting for other processes to finish writing to disk!

As we saw in the previous example, the collective MPI transfer rate is clearly indicating load imbalance, and the I/O operations show that the catastrophic write performance is now clearly dominating the time.

It has dropped from 340Mb/s with 4 processes to just 7 Mb/s with 16.

Interestingly, the read performance hasn’t suffered nearly as much - this is again typical during high contention, as data reads can trivially be served from cache, but all data writes must eventually hit the disk and may do so with inefficient orderings.

Note also that the per-process read rate has dropped from 1 Gb/s with 4 processes to 270 Mb/s with 16.

This suggests that the system is maxed out at an aggregate read rate of 4 Gb/s from the disk cache, which seems reasonable.

Different systems can have dramatically different behavior.

Allinea Performance Report - laptop_4pHere we run the same code with the same configuration as the server_4p example on a 4-core laptop equipped with an SSD drive.

Unlike the high-contention server examples, which were using HDDs, the MPI breakdown shows that load imbalance is much less of a problem on this system.

That’s expected when writing to an SSD, as high-contention writing to a spinning drive is dominated by access times and the associated delays.

With an SSD the same failure mode doesn’t apply, as all writes have the same low access time.

The consumer-grade SSD is clearly being saturated, however, with a per-process write rate of 88 Mb/s, which translates to an aggregate of 350 Mb/s for the device.

The best way to further improve performance on this system is to add more nodes and spread the I/O out.

We can also see that money would be better spent on high-bandwidth I/O and not on a faster CPU!

How do your applications match the hardware they’re running on? Are they configured optimally? Generate an Allinea Performance Report and find out!