Kernels from Element-Wise Loops
The simplest case of CUDA® kernel creation is from MATLAB® functions that contain scalarized, element-wise math operations. When
      element-wise operations are enclosed within a for-loop body, concurrent CUDA threads can be invoked to compute each loop iteration in parallel. Because
        CUDA threads do not execute in particular order, and are independent of each other,
      it is essential that the iteration in your for-loop does not depend on the
      results of other iterations.
Element-Wise Math Example
This example shows how to create CUDA kernels from functions that contain element-wise math operations. Suppose that
        you want to square each element of a matrix x and scale by a factor of
          1/(i+j), where i,j are the row and column indexes.
        You can implement this example as a MATLAB function.
function [y] = myFun(x) y = zeros(size(x)); for i = 1:size(x,1) for j = 1:size(x,2) y(i,j)=(x(i,j)^2)/(i+j); end end end
Preparing myFun for Code Generation
The first statement zeros(size(x)) in the myFun
        function is to initialize result vector y to zeros. To pre-allocate
        memory for y  in the generated code without incurring the overhead of
        initializing the memory to zeros, replace this line with
          coder.nullcopy(zeros(size(y))).
To create CUDA kernels from loops, GPU Coder™ provides another pragma coder.gpu.kernel. Specifying this kernel pragma overrides
        all parallel-loop analysis. If you do not specify a
        parameter, GPU Coder determines the kernel bounds based on the loop bounds and input size. It
        provides a way for you to specify kernel launch parameters such as
          thread and block sizes. However, use it only
        when you know that the loop is safe to parallelize. Because
        the myFun example is simple and does not require specification of the
        kernel launch parameters, you can utilize the coder.gpu.kernelfun pragma
        to generate CUDA kernels. 
With these modifications, the original myFun function is suitable for
        code generation.
function [y] = myFun(x) %#codegen y = coder.nullcopy(zeros(size(x))); coder.gpu.kernelfun(); for i = 1:size(x,1) for j = 1:size(x,2) y(i,j)=(x(i,j)^2)/(i+j); end end end
Generated CUDA Code
When you generate CUDA code by using the GPU Coder app or from the command line, GPU Coder creates a single kernel that performs squaring and scaling operation. The
        following is a snippet of the myFun_kernel1 kernel code.
static __global__ __launch_bounds__(512, 1) void myFun_kernel1(const real_T *x,
  real_T *y)
{
...
threadId = ((((gridDim.x * gridDim.y * blockIdx.z + gridDim.x * blockIdx.y) +
                blockIdx.x) * (blockDim.x * blockDim.y * blockDim.z) +
               threadIdx.z * blockDim.x * blockDim.y) + threadIdx.y * blockDim.x)
    + threadIdx.x;
  i = (int32_T)(threadId / 512U);
  j = (int32_T)(threadId - (uint32_T)i * 512U);
  if ((!(j <= 512)) && (!(i <= 512))) {
    y[i + (j << 9)] = x[i + (j << 9)] * x[i + (j << 9)] / ((real_T)(i + j) + 2.0);
  }
}The following is a snippet of the main myFun function. Before calling
          myFun_kernel1, there is a single cudaMemcpy call
        that transfers the matrix x from the host (x) to the
        device (gpu_x). The kernel has 512 blocks containing 512 threads per
        block, consistent with the size of the input vector. A second cudaMemcpy
        call copies the result of the computation back to the host.
cudaMemcpy((void *)gpu_x, (void *)x, 2097152ULL, cudaMemcpyHostToDevice); myFun_kernel1<<<dim3(512U, 1U, 1U), dim3(512U, 1U, 1U)>>>(gpu_x, gpu_y); cudaMemcpy((void *)y, (void *)gpu_y, 2097152ULL, cudaMemcpyDeviceToHost);
Limitations
- If the loop bounds are of the unsigned data type, the code generator may add conditional checks to determine if the loop bounds are valid. These conditional checks may limit optimizations that are performed by the software and introduce reduction kernels that can cause performance change. 
See Also
coder.gpu.kernel | coder.gpu.kernelfun | gpucoder.matrixMatrixKernel | coder.gpu.constantMemory | stencilfun