allinea - Now part of ARM

Deep Learning

Deep learning image

Deep learning needs high-speed computation like rockets need fuel. The faster you can compute, the more accurate your solutions will be. Fields such as computer vision and speech recognition are now producing solutions in many cases faster and better than humans.

Allinea's tools reduce the cost, time and energy of the exploration of and training of networks in deep learning.  We understand performance - our tools let you understand it too.

Key problems

Efficient exploration: Finding a good data representation, model and hyperparameters is the first step. The need to explore many different network configurations and hyperparameters means extensive computation and slow feedback loops - costly in time and in money.

During training, large data sets are run through the chosen network to train it. This means larger computation time per epoch - which leads to hours or days for convergence, even on powerful multi-GPU servers.

Because of the computation demands, projects can often move from initial exploration on a GPU workstation to using multiple GPUs and then to multiple servers, sometimes scaling to 100s of servers or above.

Doing deep learning more efficiently means better results, faster - and that means getting software to exploit the hardware more efficiently - but performance of software, whether on one server or thousands, is often hard to understand.

How Allinea solves the problem

Tuning performance: Our Performance Reports tool shows how your network is using processor, memory, GPU and communication: Are CPU cores idle? Are fast vectorization instructions being used by the CPU? Is the GPU fully-used or is it mostly idle? Is memory bandwidth stalling the processor?

Deep learning scientists and developers can tune their chosen framework's usage of available cores and libraries, or make better selection of hardware (cloud or on-premise) to deliver higher throughput - as we show by analyzing Torch and getting a 3x speed up over the default configuration.

Optimizing code: Developers working with native code - such as C++ extensions to Tensorflow or C++ frameworks such as Caffe can use our development tool suite - Allinea Forge - to profile, optimize and debug code.

Scaling up and parallel computing: As networks move from single node to multiple nodes, our tools continue to provide the answers. Allinea are the leaders in high performance computing (HPC) and supercomputing: our tools tackle parallel applications running on thousands of servers. Deep learning experts must address scalability too - as we show with one example of Tensorflow using the MPI communication library and find >2x speed up - and then scaling out to 1,500 cores of a Cray XC30 as we train a system to win the classic game, Pong - tackling problems of processor performance and workload balance in our two part blog - Supercomputer vs Pong .