nguyen le

index

Note

Source code: Available on GitHub.
This is my journey to optimize a single-floating point (FP32) matrix multiplication (SGEMM) kernel on my Apple M2 laptop. Since I don’t have an NVIDIA card, I’m using Apple’s Metal API instead of CUDA. The core ideas are the same: making GEMM as fast as possible.

The theoretical FP32 peak of an 8-core M2 is ~2.84 TFLOPs (or 2840 GFLOPS). Can we even get close? Let’s find out.

TL;DR

How fast did we get? Here’s the summary of the journey.

Kernel	Best performance (GFLOPS)	Percent of peak performance
`naive`	$\approx 178$	$\approx 6.26\%$
`tile_16`	$\approx 269$	$\approx 9.12\%$
`tile_32`	$\approx 195$	$\approx 6.86\%$
`tile_threads`	$\approx 359$	$\approx 12.6\%$
`tile_simdgroup`	$\approx 421$	$\approx 17\%$

The Optimization Journey

This project was structured as a series of optimizations, with each new kernel building on the lessons of the last.

1. The Naive Kernel (`naive.metal`)

This kernel is the most straightforward implementation: one thread computes one element of the final $C$ matrix. It’s simple, easy to verify, but hammers global device memory (DRAM) with no data reuse.

1
kernel void matmul_naive(device const float * A [[buffer(0)]],
2
                         device const float * B [[buffer(1)]],
3
                         device float * C       [[buffer(2)]],
4
                         device const MatmulParams& params [[buffer(3)]],
5
                         uint2 block_pos [[ threadgroup_position_in_grid ]],
6
                         uint2 thread_pos [[ thread_position_in_threadgroup ]])
7
{
8
     // Thread index
9
     const uint thread_x = thread_pos.x; // CUDA: threadIdx.x
10
     const uint thread_y = thread_pos.y; // CUDA: threadIdx.y
11

12
     // Calculate global row and col
13
     const uint j = block_pos.x * params.BLOCK_SIZE_X + thread_x; // col
14
     const uint i = block_pos.y * params.BLOCK_SIZE_Y + thread_y; // row
15

16
     const uint M = params.M;
17
     const uint N = params.N;
18
     const uint K = params.K;
19

20
     if (i < M && j < N) {
21
          float sum = 0.f;
22
          for (uint p = 0; p < K; ++p) {
23
              // Read one row of A and one col of B from DRAM
24
              sum += A[i * K + p] * B[p * N + j];
25
          }
26
          C[i * N + j] = sum;
27
     }
28
}

Performance: $\approx 178$ GFLOPS. Not great, but it’s a start.

2. The Tiling (`tile_16.metal` & `tile_32.metal`)

This is the first real optimization. Instead of hitting DRAM for every multiplication, we load a “tile” of matrix $A$ and a “tile” of matrix $B$ into the fast, on-chip threadgroup memory (equivalent to CUDA’s __shared__ memory).

Each thread in the threadgroup helps load a piece of the tiles, we synchronize (threadgroup_barrier) (equivalent to CUDA’s __syncthreads), and then all threads in the group compute their part of the output using data from that fast memory.

1
kernel void matmul_tile_16(device const float * A [[buffer(0)]],
2
                           /* ... buffers ... */
3
                           uint2 thread_pos [[ thread_position_in_threadgroup ]])
4
{
5
     // create tile block
6
     constexpr uint TILE_SIZE = 16;
7
     threadgroup float tileA[TILE_SIZE][TILE_SIZE];
8
     threadgroup float tileB[TILE_SIZE][TILE_SIZE];
9

10
     // ... calculate row, col, block_x, block_y ...
11

12
     float sum = 0.0f;
13
     for (uint t = 0; t < (params.K + TILE_SIZE - 1) / TILE_SIZE; ++t) {
14
          uint tiledColA = t * TILE_SIZE + thread_x;
15
          uint tiledRowB = t * TILE_SIZE + thread_y;
16

17
          // Load tile A from global to threadgroup memory
18
          if (row < params.M && tiledColA < params.K)
19
              tileA[thread_y][thread_x] = A[row * params.K + tiledColA];
20
          else
21
              tileA[thread_y][thread_x] = 0.0f;
22

23
          // Load tile B from global to threadgroup memory
24
          if (tiledRowB < params.K && col < params.N)
25
              tileB[thread_y][thread_x] = B[tiledRowB * params.N + col];
26
          else
27
              tileB[thread_y][thread_x] = 0.0f;
28

29
          // Wait for all threads to finish loading
30
          threadgroup_barrier(mem_flags::mem_threadgroup);
31

32
          // fast matmul on tile (from fast memory)
33
          #pragma clang loop unroll(full)
34
          for (uint k = 0; k < TILE_SIZE; ++k) {
35
              sum += tileA[thread_y][k] * tileB[k][thread_x];
36
          }
37

38
          // Wait for all threads to finish computing
39
          threadgroup_barrier(mem_flags::mem_threadgroup);
40
     }
41

42
     if (row < params.M && col < params.N) {
43
          C[row * params.N + col] = sum;
44
     }
45
}

Performance: tile_16 ( $\approx 269$ GFLOPS) was a big jump! But… tile_32 ( $\approx 195$ GFLOPS) was slower. Why?

Note (Why is tile_16 faster than tile_32?)

The answer is Occupancy.

What is Occupancy? Occupancy is the ratio of active threadgroups to the maximum number of threadgroups that can run on a single GPU compute unit (CU) (or an SM in CUDA). High occupancy is critical for hiding memory latency. When one group of threads is stalled waiting for data from DRAM, the GPU scheduler can switch to another resident group and keep the compute units busy.
Resource Limits: A CU has a fixed amount of resources, including threadgroup memory.
- tile_16 kernel (16x16 = 256 threads):
  threadgroup memory = (16*16 + 16*16) * 4 bytes = 2048 bytes.
- tile_32 kernel (32x32 = 1024 threads):
  threadgroup memory = (32*32 + 32*32) * 4 bytes = 8192 bytes.
The Bottleneck: The M2 GPU’s CUs have a limited amount of threadgroup memory (e.g., 32 KB). The tile_32 kernel’s 8KB footprint is significant. If a single threadgroup consumes too large a chunk of the CU’s memory, the scheduler cannot fit as many concurrent threadgroups onto that CU.

With tile_32, we fit fewer groups per CU, leading to low occupancy. If those few groups stall on a memory read, the expensive ALU units sit idle. The tile_16 kernel, with its smaller 2KB footprint, allows many more threadgroups to be resident, effectively hiding memory latency.

3. More Work Per Thread (`tile_threads.metal`)

The next step was to reduce synchronization overhead and increase register reuse. In the previous kernel, each thread computed only one output value. Here, we use a smaller threadgroup (8x8) but make each thread compute a 4x4 block of the output tile.

These 16 accumulator values (C_reg[4][4]) are stored in the thread’s private registers, which are even faster than threadgroup memory.

1
kernel void matmul_tile_threads(device const float * A [[buffer(0)]],
2
                                /* ... buffers ... */
3
                                uint2 thread_pos [[thread_position_in_threadgroup]])
4
{
5
     // ... TILE_M=32, TILE_N=32, TILE_K=16 ...
6
     // ... TG_M=8, TG_N=8 ...
7

8
     // Work per thread
9
     constexpr uint WPT_M = TILE_M / TG_M; // 4 rows per thread
10
     constexpr uint WPT_N = TILE_N / TG_N; // 4 cols per thread
11

12
     const uint thread_m = thread_pos.y; // 0..7
13
     const uint thread_n = thread_pos.x; // 0..7
14

15
     // ...
16

17
     // 16 accumulator values stored in private registers
18
     float C_reg[WPT_M][WPT_N] = {{0.0f}};
19

20
     for (uint t = 0; t < params.K; t += TILE_K) {
21
          // ... complicated loading logic to fill tileA/tileB ...
22

23
          threadgroup_barrier(mem_flags::mem_threadgroup);
24

25
          // Compute on registers
26
          #pragma clang loop unroll(full)
27
          for (uint k = 0; k < TILE_K; ++k) {
28
              #pragma clang loop unroll(full)
29
              for (uint m = 0; m < WPT_M; ++m) {
30
                  float a_val = tileA[thread_m * WPT_M + m][k];
31
                  #pragma clang loop unroll(full)
32
                  for (uint n = 0; n < WPT_N; ++n) {
33
                      C_reg[m][n] += a_val * tileB[k][thread_n * WPT_N + n];
34
                  }
35
              }
36
          }
37
          threadgroup_barrier(mem_flags::mem_threadgroup);
38
     }
39

40
     // Write 16 results from registers to global memory
41
     for (uint m = 0; m < WPT_M; ++m) {
42
          for (uint n = 0; n < WPT_N; ++n) {
43
              // ... calculate c_row, c_col ...
44
              if (c_row < params.M && c_col < params.N) {
45
                  C[c_row * params.N + c_col] = C_reg[m][n];
46
              }
47
          }
48
     }
49
}

Performance: $\approx 359$ GFLOPS. Another solid jump! We’re doing more compute per memory load and reducing our reliance on threadgroup_barrier inside the inner loop.

4. Hardware Acceleration (`tile_simdgroup.metal`)

This is the game-changer. Modern GPUs have specialized hardware for matrix math (equivalent to Tensor Cores in NVIDIA’s GPU). Metal exposes this through simdgroup_matrix intrinsics.

Instead of writing for loops, we tell the hardware: “load an 8x8 matrix tile”, “load another 8x8 tile”, and “multiply-accumulate them”. The compiler maps this to the ultra-fast hardware units. The code becomes much simpler and much faster.

1
#include <metal_simdgroup_matrix>
2

3
kernel void matmul_tile_simdgroup(
4
     device const float* A [[buffer(0)]],
5
     device const float* B [[buffer(1)]],
6
     device float* C [[buffer(2)]],
7
     device const MatmulParams& params [[buffer(3)]],
8
     uint2 block_pos [[threadgroup_position_in_grid]],
9
     uint simd_id    [[simdgroup_index_in_threadgroup]]
10
) {
11
     const uint TILE_DIM = 8;
12

13
     // ... calculate c_row, c_col for this SIMD-group ...
14

15
     if (c_row >= params.M || c_col >= params.N) {
16
         return;
17
     }
18

19
     // Create an 8x8 accumulator matrix in SIMD-group registers
20
     simdgroup_float8x8 acc = make_filled_simdgroup_matrix<float, 8, 8>(0.0f);
21

22
     for (uint k = 0; k < params.K; k += TILE_DIM) {
23
         device const float* a_ptr = A + c_row * params.K + k;
24
         device const float* b_ptr = B + k * params.N + c_col;
25

26
         simdgroup_float8x8 a_tile;
27
         simdgroup_float8x8 b_tile;
28

29
         // Load 8x8 tiles from global memory
30
         simdgroup_load(a_tile, a_ptr, params.K);
31
         simdgroup_load(b_tile, b_ptr, params.N);
32

33
         // THE MAGIC: D = A * B + C
34
         simdgroup_multiply_accumulate(acc, a_tile, b_tile, acc);
35
     }
36

37
     // Store the 8x8 result tile to global memory
38
     simdgroup_store(acc, C + c_row * params.N + c_col, params.N);
39
}

Performance: $\approx 421$ GFLOPS. This is the fastest kernel yet, and the code is the cleanest. This shows that the best optimization is often to use the hardware as it was designed.

This was a fantastic journey into the guts of Apple Silicon. While 17% of peak may not sound high, it’s a 2.3x speedup over the naive baseline and taught me an incredible amount about occupancy, memory hierarchies, and hardware-specific intrinsics.

gemm-metal

TL;DR

The Optimization Journey

1. The Naive Kernel (naive.metal)

2. The Tiling (tile_16.metal & tile_32.metal)

3. More Work Per Thread (tile_threads.metal)

4. Hardware Acceleration (tile_simdgroup.metal)

1. The Naive Kernel (`naive.metal`)

2. The Tiling (`tile_16.metal` & `tile_32.metal`)

3. More Work Per Thread (`tile_threads.metal`)

4. Hardware Acceleration (`tile_simdgroup.metal`)