Streams

Overview

Tutorial: 10 min

Learn how target GPUs streams using Numba.

Understand how to create and use CUDA streams in Numba.

Streams are sequences of operations that are executed in order on the GPU. Operations in different streams can run concurrently, allowing for parallel execution and better utilization of GPU resources. CUDA streams in Numba allow you to manage and execute multiple tasks concurrently on a GPU, enhancing performance by overlapping computation and data transfer operations.

from numba import cuda
import numpy as np

# Define a simple kernel function
@cuda.jit
def add_kernel(a, b, c):
    tx = cuda.threadIdx.x
    ty = cuda.blockIdx.x
    bw = cuda.blockDim.x

    pos = tx + ty * bw

    if pos < a.size:
        c[pos] = a[pos] + b[pos]

# Create two streams
stream1 = cuda.stream()
stream2 = cuda.stream()

# Initialize data
size = 1000000
a_cpu = np.arange(size, dtype=np.float32)
b_cpu = np.arange(size, dtype=np.float32) * 2
c_cpu = np.zeros(size, dtype=np.float32)

# Allocate device memory
a_gpu = cuda.to_device(a_cpu)
b_gpu = cuda.to_device(b_cpu)
c_gpu = cuda.device_array(size, dtype=np.float32)

# Define block and grid dimensions
threads_per_block = 256
blocks_per_grid = (size + (threads_per_block - 1)) // threads_per_block

# Launch kernels in different streams
add_kernel[blocks_per_grid, threads_per_block, stream1](a_gpu, b_gpu, c_gpu)
add_kernel[blocks_per_grid, threads_per_block, stream2](b_gpu, c_gpu, a_gpu)

# Wait for the streams to complete
stream1.synchronize()
stream2.synchronize()

# Copy result back to host
c_cpu = c_gpu.copy_to_host()

Key Points

Streams can be used to run concurrent operations in GPUs.

Numba allows you to create and manage CUDA streams for parallel execution.