GPU Paralellism in HPC

Overview

  • Tutorial: 30 min

  1. Learn how target GPUs using Numba.

  2. Understand the difference between kernel functions and device functions.

We will use the GPU programming in Numba to accelerate our code.

Kernel Function

A kernel function is a GPU function called from CPU code that cannot return values directly. It also define how GPU threads hierarchy (threads, blocks and grids) is used.

1@cuda.jit
2def polar_to_cartesian(rho, theta):
3    x = rho * math.cos(theta)
4    y = rho * math.sin(theta)

Device Functions

Device functions are used to perform computations on the GPU, and they can be invoked from within other device functions or kernels. Unlike a kernel function, a device function can return a value like normal functions.

1@cuda.jit(device=True)
2def polar_to_cartesian(rho, theta):
3    x = rho * math.cos(theta)
4    y = rho * math.sin(theta)
5    return x, y

@vectorize can also target GPU.

 1@cuda.jit(device=True)
 2def polar_to_cartesian(rho, theta):
 3    x = rho * math.cos(theta)
 4    y = rho * math.sin(theta)
 5    return x, y
 6
 7@vectorize(['float32(float32, float32, float32, float32)'], target='cuda')
 8def polar_distance(rho1, theta1, rho2, theta2):
 9    x1, y1 = polar_to_cartesian(rho1, theta1)
10    x2, y2 = polar_to_cartesian(rho2, theta2)
11
12    return ((x1 - x2)**2 + (y1 - y2)**2)**0.5

Thread Indexing

When launching a kernel, you should also specify the thread arrangements.

 1@cuda.jit
 2def increment_a_2D_array(an_array):
 3    x, y = cuda.grid(2)
 4    if x < an_array.shape[0] and y < an_array.shape[1]:
 5       an_array[x, y] += 1
 6
 7threadsperblock = (16, 16)
 8blockspergrid_x = math.ceil(an_array.shape[0] / threadsperblock[0])
 9blockspergrid_y = math.ceil(an_array.shape[1] / threadsperblock[1])
10blockspergrid = (blockspergrid_x, blockspergrid_y)
11increment_a_2D_array[blockspergrid, threadsperblock](an_array)

You can learn more about thread indexing in the tutorial Introduction to Parallel Programming Using Python .

Key Points

  1. @vectorize can target GPUs.

  2. Device functions can only be invoked from another device functions or kernel functions.