Debugging and Optimizing CUDA and OpenACC
Allinea Forge is a development tool suite for developing, debugging and optimizing CUDA and OpenACC codes - from GeForce to Tesla and the Kepler K80. Forge includes the parallel and multi-process CUDA debugger, Allinea DDT, and the profiler, Allinea MAP.
The Allinea DDT debugger enables:
- Breakpoints in CUDA threads at specific lines of CUDA or OpenACC code.
- Mixed CPU and GPU debugging - and even multi-process code - all in the same debugging session.
- View CPU and GPU threads - with our unique thread-consolidating parallel stack view to simplify the information you see and highlight the differences.
- Dynamic mode (often used for recursive CUDA).
- All memory types are visible - including register, shared (block) and global - and unified virtual addressing and GPU Direct.
- Step warps, blocks and entire kernels.
- Debug CUDA core dumps with CUDA 7 and above.
- Supports multiple GPUs simultaneously.
- Memory debugging for access errors - and memory leak reporting for global memory.
The Allinea MAP profiler enables:
- View memory transfers and global memory used.
- View GPU temperature as your job progresses.
- Profile line-level CPU code (line level GPU profiling not supported).
- View and analyze the time your CPU threads spend waiting for CUDA kernels to complete
Allinea DDT and Allinea MAP have support for the combinations that matter to you:
- Large range of supported compilers
- CUDA C and C++ from the NVIDIA compilers.
- CUDA Fortran and F90 and OpenACC from Portland.
- Cray OpenACC compiler.
- Inline PTX.
- The latest CUDA toolkits - CUDA 6.5, 7, 7.5 and 8.
- HPC clusters with CUDA (eg. MPI)
CUDA C, C++ and Fortran and OpenACC are fully supported by Allinea DDT. The world's biggest users of CUDA and OpenACC debug their applications with Allinea DDT - including on Oak Ridge National Laboratory’s Titan and CSCS’s Daint, the two largest GPU systems in the world.
Let’s start with some tips on how to use Allinea DDT for CUDA - or read the list of CUDA features.
Set a breakpoint
Just like a CPU debugger – you can set a breakpoint at any line of CUDA source code. Any time a block of CUDA threads gets to that line, the debugger will pause the whole application.
GPUs are massively SIMD – which means you can have thousands of threads active at any point in time. Use the debugger to select a CUDA thread by its index or select a thread that is on a particular line of code.
Stepping a thread is a great way to watch how a kernel progresses: CUDA GPUs are slightly different to CPUs as they actually execute threads in groups. It’s worth knowing that other GPU threads in the same “warp” (usually 32 threads) will also progress at the same time. Did the thread move through the code as you expected?
Each CUDA thread has its own register variables but shares other memory with threads in the same block, the whole device or even the host.
Whatever type of memory data resides in you need to check that it is what you expect it to be. Single values are easily seen – but the real neat trick is to visualize array data or filter to look for unusual values.
Perhaps you want to bring up a second visualization to compare GPU data to the CPU copy as a sanity check?
Verify memory usage
Before you start your application inside Allinea DDT – tick the option to debug CUDA memory usage. It’s easy to make an error that reads beyond an array in CUDA. Not all arrays are multiples of the warp-size, but a frequent error is to assume they are.
Those errors are not always fatal, but they cause non-deterministic behavior - which can cause failure at unexpected times. With memory debugging, they’ll be spotted by the debugger – so you can fix them before trouble happens.
Whenever you have a compute intensive code, you should profile that code in order to get the most performance. Allinea MAP is a profiler that profiles the performance of applications. It shows the lines of code that are executing for the most time in the CPU (host) code.
Step 1: Profile the initial code
Use the Allinea MAP profiler to discover which parts of your code are consuming the most CPU time and what they do. If scalar or floating point operations are dominating, you have a good candidate for GPU usage – but if I/O, branching or communication is dominating, you need to fix those issues first.
Step 2: Profile the results
Once you have a working CUDA or OpenACC code – identify if the performance has improved – and where the next target for optimization is. If you can reduce the number of times data is transferred from the CPU to GPU and vice-versa by combining sequences of CPU operations into one large GPU usage, performance should be improved.
Note: MAP is not able to give profile information for source lines or functions for CUDA or OpenACC kernels executed on the GPU.
Advanced CUDA: Overlap data transfer
Although we can't see the individual source lines when profiling CUDA, MAP still profiles how the CPU and GPU work together.
One optimization is to overlap GPU and CPU computation. CUDA makes it easy to do this with streams – but you still need take a look at how much time is spent in data transfer. Too little time waiting at the synchronization? Your GPU may have finished quicker than the CPU – try giving it more work. Too much time at synchronization? Your CPU has wasted cycles you could use!
More on CUDA
- Our blog on debugging CUDA dynamic parallelism
- CUDA energy and power profiling and optimization
- CUDA resources at NVIDIA
- The OpenACC.org group
- ORNL Titan case study
- CSCS Piz Daint case study
- Watch a Video on Debugging and Profiling CUDA and OpenACC