Kernels from Element-Wise Loops
The simplest case of CUDA® kernel creation is from MATLAB® functions that contain scalarized, element-wise math operations. When
element-wise operations are enclosed within a for-loop body, concurrent CUDA threads can be invoked to compute each loop iteration in parallel. Because
CUDA threads do not execute in particular order, and are independent of each other,
it is essential that the iteration in your for
-loop does not depend on the
results of other iterations.
Element-Wise Math Example
This example shows how to create CUDA kernels from functions that contain element-wise math operations. Suppose that
you want to square each element of a matrix x
and scale by a factor of
1/(i+j)
, where i,j
are the row and column indexes.
You can implement this example as a MATLAB function.
function [y] = myFun(x) y = zeros(size(x)); for i = 1:size(x,1) for j = 1:size(x,2) y(i,j)=(x(i,j)^2)/(i+j); end end end
Preparing myFun for Code Generation
The first statement zeros(size(A))
in the myFun
function is to initialize result vector y
to zeros. For CUDA code generation, pre-allocate memory for y
without
incurring the overhead of initializing the memory to zeros. Replace this line with
coder.nullcopy(zeros(size(y)))
.
To create CUDA kernels from loops, GPU Coder™ provides another pragma coder.gpu.kernel
. Specifying this kernel pragma overrides
all parallel-loop analysis. If you do not specify a
parameter, GPU Coder determines the kernel bounds based on the loop bounds and input size. It
provides a way for you to specify kernel launch parameters such as
thread and block sizes. However, use it only
when you know that the loop is safe to parallelize. Because
the myFun
example is simple and does not require specification of the
kernel launch parameters, you can utilize the coder.gpu.kernelfun
pragma
to generate CUDA kernels.
With these modifications, the original myFun
function is suitable for
code generation.
function [y] = myFun(x) %#codegen y = coder.nullcopy(zeros(size(x))); coder.gpu.kernelfun(); for i = 1:size(x,1) for j = 1:size(x,2) y(i,j)=(x(i,j)^2)/(i+j); end end end
Generated CUDA Code
When you generate CUDA code by using the GPU Coder app or from the command line, GPU Coder creates a single kernel that performs squaring and scaling operation. The
following is a snippet of the myFun_kernel1
kernel code.
static __global__ __launch_bounds__(512, 1) void myFun_kernel1(const real_T *x, real_T *y) { ... threadId = ((((gridDim.x * gridDim.y * blockIdx.z + gridDim.x * blockIdx.y) + blockIdx.x) * (blockDim.x * blockDim.y * blockDim.z) + threadIdx.z * blockDim.x * blockDim.y) + threadIdx.y * blockDim.x) + threadIdx.x; i = (int32_T)(threadId / 512U); j = (int32_T)(threadId - (uint32_T)i * 512U); if ((!(j <= 512)) && (!(i <= 512))) { y[i + (j << 9)] = x[i + (j << 9)] * x[i + (j << 9)] / ((real_T)(i + j) + 2.0); } }
The following is a snippet of the main myFun
function. Before calling
myFun_kernel1
, there is a single cudaMemcpy
call
that transfers the matrix x
from the host (x
) to the
device (gpu_x
). The kernel has 512 blocks containing 512 threads per
block, consistent with the size of the input vector. A second cudaMemcpy
call copies the result of the computation back to the host.
cudaMemcpy((void *)gpu_x, (void *)x, 2097152ULL, cudaMemcpyHostToDevice); myFun_kernel1<<<dim3(512U, 1U, 1U), dim3(512U, 1U, 1U)>>>(gpu_x, gpu_y); cudaMemcpy((void *)y, (void *)gpu_y, 2097152ULL, cudaMemcpyDeviceToHost);
Limitations
If the loop bounds are of the unsigned data type, the code generator may add conditional checks to determine if the loop bounds are valid. These conditional checks may limit optimizations that are performed by the software and introduce reduction kernels that can cause performance change.
See Also
coder.gpu.kernel
| coder.gpu.kernelfun
| gpucoder.matrixMatrixKernel
| coder.gpu.constantMemory
| gpucoder.stencilKernel