Data Layout Considerations in Deep Learning

When you build an application that uses the generated CUDA^® C++ code, you must provide a CUDA C++ main function that calls the generated code. By default, for code generation of source code, static libraries, dynamic libraries, and executables by using the codegen command, GPU Coder™ generates example CUDA C++ main files (main.cu source file and main.h header file in the examples subfolder of the build folder). This example main file is a template that helps you incorporate generated CUDA code into your application. The example main function declares and initializes data, including dynamically allocated data. It calls entry-point functions but does not use values that the entry point functions return.

When generating code for deep convolutional neural networks (CNN), the code generator takes advantage of NVIDIA^® cuDNN, TensorRT for NVIDIA GPUs or the ARM^® Compute Library for the ARM Mali GPUs. These libraries have specific data layout requirements for the input tensor holding images, video, and other data. When authoring custom main functions for building an application, you must create input buffers that provide data to the generated entry-point functions in the format expected by these libraries.

Data Layout Format for CNN

For deep convolutional neural networks (CNN), a 4-D tensor descriptor is used to define the format for batches of 2-D images with the following letters:

N – the batch size
C – the number of feature maps (number of channels)
H – the height
W – the width

The most commonly used 4-D tensor formats is shown, where the letters are sorted in decreasing order of the strides.

NCHW
NHWC
CHWN

Of these, GPU Coder uses the NCHW format (column-major layout by default). To use row-major layout pass the -rowmajor option to the codegen command. Alternatively, configure your code for row-major layout by modifying the cfg.RowMajor parameter in the code generation configuration object.

For example, consider a batch of images with the following dimensions: N=1, C=3, H=5, W=4. If the image pixel elements are represented by a sequence of integers, the input images can be pictorially represented as follows.

When creating the input buffer in the main function, the 4-D image is laid out in the memory in the NCHW format as:

Beginning with the first channel (C=0), the elements are arranged contiguously in row-major order.
Continue with second and subsequent channels until the elements of all the channels are laid out.
Proceed to the next batch (if N > 1).

Data Layout Format for LSTM

A long short-term memory (LSTM) network is a type of recurrent neural network (RNN) that can learn long-term dependencies between time steps of sequence data. For LSTM, the data layout format can be described with the following letters:

N – the batch size
S – the sequence length (number of time steps)
d – the number of units in one input sequence

For LSTM, GPU Coder uses the SNd format by default.