My journey to GPU realm (Part I): An introduction to CUDA

index

Note

This is basically my note for Programming Massively Parallel Processors book and CUDA Programming Guide.

1. The Great Divergence: Latency vs. Throughput

Before we write a single line of .cu code, we have to understand why the GPU exists. A modern CPU (the Host) is a latency beast. It has massive caches (L1/L2/L3) and complex branch prediction logic designed to execute a single serial instruction stream as fast as possible. It’s like a Ferrari: it carries two people, but it gets them there at 200mph.

The GPU (the Device) is a throughput monster. It strips away the complex branch prediction and massive per-core caches. Instead, it devotes that transistor budget to thousands of arithmetic logic units (ALUs). It’s not a Ferrari; it’s a cargo ship. It moves at 20mph, but it carries 10,000 containers at once.

CPU vs GPU — Comparsion between GPU and CPU (source: https://tecadmin.net/cpu-vs-gpu-key-differences/).

In Deep Learning, we don’t care if one matrix multiplication takes 1 microsecond or 10. We care that we have to do $10^{10}$ of matrix multiplication at once. We need the cargo ship.

2. Why CUDA?

Now we know why we need a GPU then how we can utilize it and interact with it ? We will use CUDA (for NVIDIA GPU only). Then the first question is: What is CUDA and why it is so important right now. Maybe you will come from AI background as me, and you see everything from training to inference all utilize CUDA to the last drop. CUDA (Compute Unified Device Architecture) is basically a parallel computing platform and programming model created by NVIDIA in 2006 ¹.

a. The Heterogeneous Model: Host and Device

All the work in CUDA is remembering and using programming model of CUDA efficiently and correctly. The CUDA programming model assumes a heterogeneous computing system, which means a system that includes both GPUs and CPUs ² (if you have GPU only, you can’t use CUDA).

The CPU is called host and the GPU is called device, so whenever you see device things, these are all related to GPU, same as host for CPU. For example, host memory refers to RAM (the RAM with CPU) whereas device memory refers to memory of GPU (often called DRAM or HBM - High Bandwidth Memory).

b. CUDA Execution Model: Threads, Blocks, and Grids

When you write CUDA, you are not writing a loop. You are writing a single instruction that will be instantiated tens of thousands of times. But sometimes, you still need loop (for example, reduction kernel) but that is for later.

The Hierarchy

When a program’s host code calls a kernel, CUDA will launch a grid of blocks. All blocks in a grid will have the same size and each block can contain up to 1024 threads. In the picture below, we can see visualization of grid, block and thread. Each thread is represented by a curly arrow stemming from a box that is labeled with the thread’s index number in the block.

The total number of threads in each block can be specified by the host code when a kernel is called.
The same kernel can be called with different numbers of threads at different parts of the host code.
The number of threads in block is available from blockDim struct. If we think a block as a cube then we have blockDim.x, blockDim.y and blockDim.z, otherwise, a block can be a rectangle with blockDim.x and blockDim.y (and even 1-dimensional block or array of threads) (and 3 dimensional is maximum for a block and even a grid).
This makes sense because the threads are created to process data in parallel, so it is only natural that the organization of the threads reflects the organization of the data.

Block Image — Visualization of CUDA Grid/Block/Thread system. Source: https://siboehm.com/articles/22/CUDA-MMM

c. Your First Kernel: Vector Addition

Let’s see our first kernel, vector addition, the hello world of GPU programming. Don’t worry if you don’t have a GPU - there are online platforms to practice.

Kernel function (or code that runs on device) will be executed by a thread. Each thread has its own data, and will execute the same function so we have SPMD (Single Program Multiple Data) paradigm.

Note (SPMD vs SIMD)

In an SPMD system, the parallel processing units execute the same program on multiple parts of the data. However, these processing units do not need to be executing the same instruction at the same time. In an SIMD system, all processing units are executing the same instruction at any instant.

1
/*
2
 * Compute vector sum C = A + B
3
 * Each thread performs one pair-wise addition
4
 * __global__ is identifier of kernel and this function is in .cu file
5
 */
6
__global__ void vecAddKernel(float* A, float* B, float* C, int n) {
7
  int i = blockIdx.x * blockDim.x + threadIdx.x;
8
  if (i < n) {
9
    C[i] = A[i] + B[i];
10
  }
11
}

Compare this to the host (CPU) version:

1
void vecAdd(float* A, float* B, float* C, int n) {
2
    for (int i = 0; i < n; ++i) {
3
        C[i] = A[i] + B[i];
4
    }
5
}

As we can see from the host function for vector addition, the kernel doesn’t have loop and loop has been replaced with threads - we can see each thread as one loop iteration.

Thread Indexing

Block 2 Image — In the above picture, each block organized threads in 1-dimensional array and we have 256 threads each block. Each thread will execute C[i] = A[i] + B[i]. Note that, number of threads in each dimension should be multiple of 32.

The threadIdx struct gives each thread a unique coordinate within a block.

For example threadIdx.x=2 and threadIdx.y = 3 means current thread is at (2,3) position in its block (local coordinate).
To get global coordinate, we can access blockIdx in grid, for example block at position (1, 0) will have blockIdx.x = 1 and blockIdx.y = 0.
Then using that, we have global coordinate for the thread: x = blockIdx.x * blockDim.x + threadIdx.x and y = blockIdx.y * blockDim.y + threadIdx.y.

d. Putting It Together: A Complete CUDA Program

But before diving into more theories, we will learn how to do vector addition in CUDA with complete example (we just write a kernel above). To run CUDA code, we need a host (CPU) to called it, the host code will have structure as below:

Create (or allocate) device memory (for array, tensor, matrix, etc.). Oftenly, we will allocate memory for input and output. Inputs are usually on host memory and we need to copy them to device memory (we will have special methods for these copy/allocate/delete).
Then we call kernel on these inputs/outputs allocated on device memory. We can think kernel as device code (separated from host code).
Finally, we copy outputs from kernel to host memory and also delete these memories on device (never forget it LOL).

1
/*
2
* --- Vector Addition in CUDA ---
3
* This is the template of host code
4
* that called kernel (device code)
5
* to do vector addition
6
*/
7
void vec_add(float* A, float* B, float* C, int n) {
8
  int nbytes = n * sizeof(float); // size of array
9
  float *d_A, *d_B, *d_C; // copy of A, B, C in device
10

11
  // Part 1: Allocate device memory for d_A, d_B, and d_C
12
    // and also copy A, B, C (host memory) to d_A, d_B, d_C (device memory)
13
  ...
14

15
  // Part 2: Call kernel – to launch a grid of threads
16
  // to perform the actual vector addition
17
  ...
18

19
  // Part 3: Copy C from the device memory
20
  // and free device vectors
21
  // del d_A, d_B, d_C ?
22
}

Warning

But this “transparent” model is inefficient because of data transfer (copy/move/etc.) between host and device. One would often keep large and important data structures on the device and then simply invoke device on that data without moving data from host to device.

Below are some special functions to do these data transfer (you can find more details by reading documentation of these functions):

cudaMalloc(): Allocates object in global device memory
cudaFree(): Free object in global device memory
cudaMemcpy(): Transfer memory (or copy from host to device or vice versa)

We will have the complete example from the template:

1
void vec_add(float* A, float* B, float* C, int n) {
2
  int nbytes = n * sizeof(float); // size of array
3
  float *d_A, *d_B, *d_C; // copy of A, B, C in device
4

5
  // Part 1: Allocate device memory for A, B, and C
6
  cudaMalloc((void**)&d_A, nbytes);
7
  cudaMalloc((void**)&d_B, nbytes);
8
  cudaMalloc((void**)&d_C, nbytes);
9

10
  // Copy A and B to device memory
11
  cudaMemcpy(d_A, A, nbytes, cudaMemcpyHostToDevice);
12
  cudaMemcpy(d_B, B, nbytes, cudaMemcpyHostToDevice);
13

14
  // Part 2: Call kernel – to launch a grid of threads
15
  // to perform the actual vector addition
16
  // note that we use <<<grid, block>>> to specify grid and block for kernel
17
  vecAddKernel<<<(n + 256 + 1) / 256, 256>>>(d_A, d_B, d_C, n);
18

19
  // Part 3: Copy C from the device memory
20
  cudaMemcpy(C, d_C, nbytes, cudaMemcpyDeviceToHost);
21

22
  // Free device vectors
23
  cudaFree(d_A);
24
  cudaFree(d_B);
25
  cudaFree(d_C);
26
}

We can call kernel function with <<<no_blocks, no_threads>>>.
To ensure that we have enough threads in the grid to cover all the vector elements, we need to set the number of blocks in the grid to the ceiling division (rounding up the quotient to the immediate higher integer value) of the desired number of threads (n in this case) by the thread block size (256 in this case).
Note that all the thread blocks operate on different parts of the vectors. They can be executed in any arbitrary order. The programmer must not make any assumptions regarding execution order.

Warning

Above code can be executed slower than sequential code on CPU. Because the overhead of data transfer (back and forth from host to device). The kernel function can be executed really fast but we have to wait for data transfer. Note that this data transfer is one of the main reason why we have FlashAttention.

Now do a little exercise before advance to next section:

Exercise (Exercise 9 in PMPP)

Consider the following CUDA kernel and the corresponding host function that calls it:

1
01 __global__ void foo_kernel(float a, float b, unsigned int N) {
2
02      unsigned int i=blockIdx.x blockDim.x + threadIdx. x;
3
03      if(i , N) {
4
04          b[i]=2.7f a[i] - 4.3f;
5
05      }
6
06 }
7
07 void foo(float a_d, float b_d) {
8
08      unsigned int N=200000;
9
09      foo_kernel<<<(N + 128 1)/128, 128>>>(a_d, b_d, N);
10
10 }

a. What is the number of threads per block?
b. What is the number of threads in the grid?
c. What is the number of blocks in the grid?
d. What is the number of threads that execute the code on line 02?
e. What is the number of threads that execute the code on line 04?

Answer (Don't peek to it)

a. Number of threads is 128.
b. Number of threads in the grid = number of blocks each grid * numbers of threads each block = (200000 + 128-1)/128 * 128 = 1563 * 128 = 200064.
c. Number of blocks is (200000 + 128-1)/128 = 1563.
d. Number of threads execute line 02 is full threads (200064)
e. Number of threads execture line 03 is N = 200000.

3. A Glimpse Under the Hood

So far we’ve treated the GPU as a black box that magically runs thousands of threads. But why does CUDA have this specific hierarchy of grids, blocks, and threads? The answer lies in how the hardware is actually built.

a. Blocks Map to Streaming Multiprocessors

A GPU is composed of multiple Streaming Multiprocessors (SMs). When you launch a kernel, CUDA assigns each block to an SM. An SM can run multiple blocks simultaneously (if it has enough resources), but a single block never spans across multiple SMs. This is why blocks are independent - they might run on completely different hardware units.

The execution model of CUDA. Each block will be assigned to SM. Source: https://docs.nvidia.com/cuda/cuda-programming-guide/01-introduction/programming-model.html

For example, an NVIDIA A100 has 108 SMs. If you launch a kernel with 216 blocks, each SM gets roughly 2 blocks to execute.

Threads Execute in Warps

Threads within a block don’t execute individually. Instead, the SM groups them into warps of 32 threads. All 32 threads in a warp execute the same instruction at the same time - this is called SIMT (Single Instruction, Multiple Threads).

This is why you see “32” everywhere in CUDA:

Block dimensions should be multiples of 32 for efficiency
Memory access patterns are optimized for 32-thread alignment
The “1024 threads per block” limit = 32 warps maximum

What Happens When Threads Diverge?

If threads in a warp take different branches (e.g., some execute if, others execute else), the warp must execute both paths sequentially, with threads disabled for the path they didn’t take. This is called warp divergence and it kills performance.