Software developers often work unusual hours – sometimes out of choice, but in HPC there can be a different reason: the machine.
Getting exclusive access to a substantial portion of a HPC system to fix a bug that occurs in production sized jobs at a reasonable time of day is often a waiting game, or an exercise in persuasion with the machine owner.
For that reason, some developers often think a debugger is not for them – they don’t want to be there when their application is finally scheduled. Instead, they will fill their code with print statements, because they are guaranteed that when their job is run, it will output the results, and they’ll be able to figure out what happened later. This can work – if the bug is predictable, or if the developer has near precise intuition about the problem – but as the scale of core counts increases, the volume of data generated soon becomes overwhelming.
In practice most time consuming bugs are far harder to find than this anyway. Memory corruption, race conditions, MPI deadlock and real crashes are hard to manage without being there to watch what happens: print statements just don’t help – more context is needed when errors occur, such as all the relevant variables, or the whole set of stack traces across the entire job.
Thankfully a new feature in Allinea DDT 3.1 solves the problem! There is an easy way to run jobs under the control of a debugger without you being there to drive it. This brings the kind of exact problem identification that only a debugger can bring, with the flexibility that lets the application run whilst you sleep.
Read more by downloading this white paper using the form opposite.