cuda

CUDA Python with Numba

Numba is a just-in-time, type-specializing, function compiler for accelerating numerically-focused Python for either a CPU or GPU

CPU CUDA compile

always check the speed-up performance before implementing jit

from numba import jit
from numpy import testing

@jit
def example_func(x,y):
    pass

#measure execuation time in jupyter
%timeit example_func(3,4)

#using testing modul to check result of jit compiled and uncompiled function
testing.assert_almost_equal(example_func(3, 4), example_func.py_func(3,4), decimal=2)

#njit means jit(nonpython=True), it does not go back to normal python if the type does not support jit
from numba import njit
@njit

ufuncs

ufuncs have broadcasting features Note that for the CUDA target, we need to use the scalar functions

compile ufuncs for CPU

from numba import vectorize

@vectorize
def add_ten(num):
    return num + 10

compile ufuncs for GPU

to enable ufuncs for GPU explicit data type signiture is needed

@vectorize(['int64(int64, int64)'], target='cuda')
def add_ufunc(x, y):
    return x + y

processes happened: Compiled a CUDA kernel to execute the ufunc operation in parallel over all the input elements. Allocated GPU memory for the inputs and the output. Copied the input data to the GPU. Executed the CUDA kernel (GPU function) with the correct kernel dimensions given the input sizes. Copied the result back from the GPU to the CPU. Returned the result as a NumPy array on the host.

CUDA Device Functions

To compile functions for the GPU that are not element wise, vectorized functions, we use numba.cuda.jit The argument device=True indicates that the decorated function can only be called from a function running on the GPU, and not from CPU host code numba.cuda.jit is not to be confused with the numba.jit decorator which optimizes functions for the CPU

from numba import cuda

@cuda.jit(device=True)
def polar_to_cartesian(rho, theta):
    x = rho * math.cos(theta)
    y = rho * math.sin(theta)
    return x, y

@vectorize(['float32(float32, float32, float32, float32)'], target='cuda')
def polar_distance(rho1, theta1, rho2, theta2):
    x1, y1 = polar_to_cartesian(rho1, theta1)
    x2, y2 = polar_to_cartesian(rho2, theta2)
    
    return ((x1 - x2)**2 + (y1 - y2)**2)**0.5

managing GPU memory

Minimize data transfer between the host and the device

#to keep data on GPU for operation is much efficienter
from numba import cuda

x_device = cuda.to_device(x)
y_device = cuda.to_device(y)
out_device = cuda.device_array(shape=(n,), dtype=np.float32)
%timeit add_ufunc(x_device, y_device, out=out_device)
out_host = out_device.copy_to_host()

Custom CUDA Kernels in Python with Numba

functions for the GPU called kernels GPU can run thousands of threads in parallel, a collection of threads is called block, a collection of blocks associated with a given kernel launch is a grid. Every block in teh grid contains teh same number of threads.

gridDim.x: number of blocks in the grid blockIdx.x. index of the current block within the grid blockDim.x: number of threads in a block threadIdx.x:index the thread within a block cuda.grid(dimension_grid): will return a thread unique index

from numba import cuda
@cuda.jit
def add_kernel(x, y, out):
    idx = cuda.grid(1)
    out[idx] = x[idx] + y[idx]

#the kernel is written so that each thread works on exactly one data element, it is essential for the number of threads in the grid equal the number of data elements
add_kernel[blocks_per_grid, threads_per_block](d_x, d_y, d_out)
#CPU will work asynchronously while the GPU is processing
cuda.synchronize()
print(d_out.copy_to_host()) ### Configuration Choices

The size of a block should be a multiple of 32 threads (the size of a warp), with typical block sizes between 128 and 512 threads per block. The size of the grid should ensure the full GPU is utilized where possible. Launching a grid where the number of blocks is 2x-4x the number of SMs on the GPU is a good starting place. Something in the range of 20 - 100 blocks is usually a good starting point. The CUDA kernel launch overhead does increase with the number of blocks, so when the input size is very large we find it best not to launch a grid where the number of threads equals the number of input elements, which would result in a tremendous number of blocks. Instead we use a pattern to which we will now turn our attention for dealing with large inputs.

Grid Stride Loops

it create flexible kernels where each thread is able to work on more than one data element, an essential technique for large datasets cuda.gridsize(): returning the number of threads in teh grid

from numba import cuda
@cuda.jit
def add_kernel(x, y, out):
    start = cuda.grid(1)   
    stride = cuda.gridsize(1)
    for i in range(start, x.shape[0], stride):
        # Assuming x and y inputs are same length
        out[i] = x[i] + y[i]

Atomic Operations and Avoiding Race Conditions

A common strategy to avoid both of these hazards is to organize your CUDA kernel algorithm such that each thread has exclusive responsibility for unique subsets of output array elements, and/or to never use the same array for both input and output in a single kernel call.

@cuda.jit
def thread_counter_race_condition(global_counter):
    global_counter[0] += 1  # This is bad
    
@cuda.jit
def thread_counter_safe(global_counter):
    cuda.atomic.add(global_counter, 0, 1)  # Safely add 1 to offset 0 in global_counter array

Multidimensional Grids and Shared Memory

2 and 3 Dimensional Blocks and Grids

import numpy as np
from numba import cuda
A = np.zeros(16).reshape(4, 4).astype(np.int32)
d_A = cuda.to_device(A)

@cuda.jit
def add_2D_coordinates(A):
    # By passing `2`, we get the thread's unique x and y coordinates in the 2D grid
    x, y = cuda.grid(2)
    
    A[y][x] = x + y

@cuda.jit
def mm(a, b, c):
    row, column = cuda.grid(2)
    sum = 0
    
    for i in range(a.shape[0]):
        sum += a[row][i] * b[i][column]
        
    c[row][column] = sum

Striding in Multiple Dimensions

cuda.gridsize(dims) to obtain the total number of threads in a grid refactoring it with multiple stiding makes that it can work on data sets of an arbitrary size, otherwise it only works when passed matrices are of the same size as the grid

import numpy as np
from numba import cuda
@cuda.jit
def add_2D_coordinates_stride(A):

    grid_y, grid_x = cuda.grid(2)
    # By passing `2`, we get the grid size in both the x an y dimensions
    stride_y, stride_x = cuda.gridsize(2)
    
    for data_i in range(grid_x, A.shape[0], stride_x):
        for data_j in range(grid_y, A.shape[1], stride_y):
            A[data_i][data_j] = grid_x + grid_y

import numpy as np
from numba import cuda
@cuda.jit
def mm_stride(A, B, C):

    grid_column, grid_row = cuda.grid(2)
    stride_column, stride_row= cuda.gridsize(2)
    
    for data_row in range(grid_row, A.shape[0], stride_row): # TODO: replace 0 with values that will correctly set data_row
        for data_column in range(grid_column, A.shape[1], stride_column): # TODO: replace 0 with values that will correctly set data_column
            sum = 0
            for i in range(A.shape[1]): # B.shape[0] would also be okay here
                sum += A[data_row][i] * B[i][data_column]
                
            C[data_row][data_column] = sum

shared memory

Shared memory is a programmer defined cache of limited size that depends on the GPU being used and is shared between all threads in a block user cases:

Caching memory read from global memory that will need to be read multiple times within a block.
Buffering output from threads so it can be coalesced before writing it back to global memory.
Staging data for scatter/gather operations within a block.

import numpy as np
from numba import types, cuda
@cuda.jit
def swap_with_shared(x, y):
    # Allocate a 4 element vector containing int32 values in shared memory.
    temp = cuda.shared.array(4, dtype=types.int32)        
    idx = cuda.grid(1)
   
    # Move an element from global memory into shared memory
    temp[idx] = x[idx]
    
    # cuda.syncthreads will force all threads in the block to synchronize here, which is necessary because...
    cuda.syncthreads()
    #...the following operation is reading an element written to shared memory by another thread.
    
    # Move an element from shared memory back into global memory
    y[idx] = temp[cuda.blockDim.x - cuda.threadIdx.x - 1] # swap elements

#an example to transpose matrix
import numba.types

@cuda.jit
def tile_transpose(a_in, a_out):
    # `tile_transpose` assumes it is launched with a 32x32 block dimension,
    # and that `a_in` is a multiple of these dimensions.
    
    # 1) Create 32x32 shared memory array.
    tile = cuda.shared.array((32, 32), numba.types.int32)

    # Compute offsets into global input array.
    row = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
    col = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y
    
    # 2) Make coalesced read from global memory into shared memory array.
    # Note the use of local thread indices for the shared memory write,
    # and global offsets for global memory read.
    tile[cuda.threadIdx.y, cuda.threadIdx.x] = a_in[col, row]

    # 3) Wait for all threads in the block to finish updating shared memory.
    cuda.syncthreads()
    
    # 4a) Calculate transposed location for the shared memory array tile
    # to be written back to global memory...
    row = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.x
    col = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.y

    # 4b) ...Write back to global memory,
    # transposing each element within the shared memory array.
    a_out[col, row] = tile[cuda.threadIdx.x, cuda.threadIdx.y]

python package for Cross-Vendor Graphics

Kompute

frameworks for cross platform and cross vendor graphics card acculeration