Streams ------- .. admonition:: Overview :class: Overview * **Tutorial:** 10 min #. Learn how target GPUs streams using Numba. #. Understand how to create and use CUDA streams in Numba. Streams are sequences of operations that are executed in order on the GPU. Operations in different streams can run concurrently, allowing for parallel execution and better utilization of GPU resources. CUDA streams in Numba allow you to manage and execute multiple tasks concurrently on a GPU, enhancing performance by overlapping computation and data transfer operations. .. code-block:: python :linenos: from numba import cuda import numpy as np # Define a simple kernel function @cuda.jit def add_kernel(a, b, c): tx = cuda.threadIdx.x ty = cuda.blockIdx.x bw = cuda.blockDim.x pos = tx + ty * bw if pos < a.size: c[pos] = a[pos] + b[pos] # Create two streams stream1 = cuda.stream() stream2 = cuda.stream() # Initialize data size = 1000000 a_cpu = np.arange(size, dtype=np.float32) b_cpu = np.arange(size, dtype=np.float32) * 2 c_cpu = np.zeros(size, dtype=np.float32) # Allocate device memory a_gpu = cuda.to_device(a_cpu) b_gpu = cuda.to_device(b_cpu) c_gpu = cuda.device_array(size, dtype=np.float32) # Define block and grid dimensions threads_per_block = 256 blocks_per_grid = (size + (threads_per_block - 1)) // threads_per_block # Launch kernels in different streams add_kernel[blocks_per_grid, threads_per_block, stream1](a_gpu, b_gpu, c_gpu) add_kernel[blocks_per_grid, threads_per_block, stream2](b_gpu, c_gpu, a_gpu) # Wait for the streams to complete stream1.synchronize() stream2.synchronize() # Copy result back to host c_cpu = c_gpu.copy_to_host() .. admonition:: Key Points :class: hint #. Streams can be used to run concurrent operations in GPUs. #. Numba allows you to create and manage CUDA streams for parallel execution.