# Compilers and More: Optimizing GPU Kernels

My last column discussed some of the complexities of programming GPUs today, focusing on how to interface the host program with the GPU. Here we focus on programming the GPU itself. As with last time, we’ll look at a simple single-precision matrix multiplication, equivalent to the BLAS SGEMM routine.

Matmul is a highly parallel algorithm, but let me emphasize that parallelism does not equate to performance. We need to carefully sculpt our algorithm to match the parallelism available in the architecture in order to reap the benefits. This is true whether we are targeting a GPU, a multicore x64, or even a single core with packed SSE operations. As an example, I took the simple matmul loop (in C, but with the matrices stored column-major):

for( int j = 0; j < m; ++j )

for( int k = 0; k < p; ++k )

for( int i = 0; i < n; ++i )

a[i+pitch_a*j] += b[i+pitch_b*k] * c[k+pitch_c*j];

modified it several ways and ran it on an Intel Xeon (3GHz, 6MB cache, 16GB memory, Penryn) using 4096×4096 matrices (to compare with results we’ll see below). With the loop in the order shown (stride-1 inner loop), the program ran at 1.7 GFLOPs; this is compiled C performance (using `pgcc -fast`). We can improve that by *tiling* or *blocking* the loops, organizing the matmul as a a bunch of submatrix multiplications, sized so each submatrix matmul fits in the processor cache. This improves performance to 5.7 GFLOPs, and it jumps to over 22 GFLOPs when we use OpenMP directives and run on all four cores. Advanced compilers help by automatically managing the vectorization, unrolling, memory alignments, adding prefetch instructions, and so forth.

We’re going to see several matmul GPU kernels, with performance on our GPU development system, with an NVIDIA GeForce GTX 280 (1GB memory, 30 multiprocessors), using NVIDIA’s CUDA language. The host is a Linux (OpenSUSE 11.0) triple-core AMD Phenom (2.1 GHz, 500KB cache, 4GB memory), though the host hardly matters; the performance for these experiments is entirely dominated by the GPU code.

As on the CPU, performance on a GPU can be fragile; small changes to the program can make large differences in performance. It’s easy to write a slow program. This was a characteristic of High Performance Fortran, one that (my opinion) was a major cause of its downfall; while HPF made it easier to write parallel programs, it didn’t make parallel programs fast. That is the job of the HPC programmer; the same will be true for accelerators, GPUs, and even multicore CPUs.

GPUs deliver their dramatic high performance through a well-balanced, carefully managed, highly parallel architecture. Algorithms running on the GPU must be parallelized and balanced as well; this does not come for free. Program development may cost extra time and effort to understand and use the appropriate programming model, a model that may not match the simple scalar processor with cache model we are comfortable with on x64 hosts. However, the analysis and programming techniques used to develop GPU algorithms will probably help you develop multicore programs as well. A good programming model with good compilers and tools can relieve you of much busywork, but you still have to think, and you still have to understand algorithms and architecture, and you should expect no less.

From here on below, I show many versions of matmul; if you’re not a programmer or want to skip over the details, look for the __performance__ tags below, until the __Summary__; you don’t want to miss the conclusions. If you are a programmer and want to see all the code, you’ll find all the sources in a kernels tarfile at the PGI Web site.

In my last column, I proposed a simple matmul kernel for the GPU and focused on the host code to drive the kernel. We’ll use that simple kernel to start the discussion. What I had done is taken the matmul loop (as shown above), strip-mined the stride-1 `i` loop to the CUDA SIMD width of 32:

for( int is = 0; is < n; is += 32 )

for( int i = is; i < is+32; ++i )

for( int j = 0; j < m; ++j )

for( int k = 0; k < p; ++k )

a[i+pitch_a*j] += b[i+pitch_b*k] * c[k+pitch_c*j];

run the `i` element loop as a thread block, and run the `is` strip loop and `j` loop in parallel:

parfor( int is = 0; is < n; is += 32 ) /* K1 */

parfor( int j = 0; j < m; ++j )

SIMDfor( int i = is; i < is+32; ++i )

for( int k = 0; k < p; ++k )

a[i+pitch_a*j] += b[i+pitch_b*k] * c[k+pitch_c*j];

then optimized by hand just a little. The parallel (grid) loops and the SIMD (thread block) loop are handled implicitly by the GPU hardware and firmware, so they don’t appear in the kernel code. All that’s left is the body, the `k` loop. The final kernel in all its glory, cut-and-pasted from my CUDA source file, is:

extern "C" __global__ void /* K1 */

mmkernel( float* a, float* b, float* c,

int pitch_a, int pitch_b, int pitch_c,

int n, int m, int p )

{

int i = blockIdx.x*32 + threadIdx.x;

int j = blockIdx.y;

float sum = 0.0;

for( int k = 0; k < p; ++k )

sum += b[i+pitch_b*k] * c[k+pitch_c*j];

a[i+pitch_a*j] = sum;

}

**Performance: 28 GFLOPs** – SIMD width 32 *next*

This version runs at 28 GFLOPs on our system (on 4096×4096 matrices). In the interest of full disclosure, I compiled the kernels discussed here with NVIDIA’s NVCC compiler version 2.0 with the `-O` option, and compiled the driver routine with `pgcc -fast`; I ran each program three times and report the middle performance score, rounding the GFLOPs down to an integer value. I will generally show a GFLOPs number most directly comparable to a host matmul, including the overhead of transmitting the operand matrices to the GPU memory and the result matrix back. I will sometimes give the performance of just the matmul kernel on the GPU; while the two numbers are often quite close, the kernel-only number is useful to expose more clearly the effect of changes to the kernel program (since the overhead stays the same). I show results for 4096×4096 matrices, which is close to the peak performance for each kernel. For version K1, the host-to-host and kernel-only performance were 28 and 29 GFLOPs, respectively.

That may sound like good performance, but we’re not nearly taking full advantage of the available parallelism. Recall the NVIDIA architecture description; the card I’m using has 30 multiprocessors, each with eight thread processors, quad-clocked to get a SIMD width of 32. The kernel above is a scalar program, but the card runs 32 copies of it in SIMD mode (or SIMT mode, to use NVIDIA’s term); the 32 copies comprise a *warp*. Each multiprocessor uses multithreading to support up to 32 warps (1024 scalar threads). The 32 warps can come from different *thread blocks* (different iterations of `is` or `j`) or from wider thread blocks (more than 32 scalar threads). There are limits in this generation of the card: each multiprocessor can support up to eight simultaneous thread blocks, and a thread block can support up to 16 warps.

**Performance: 36 GFLOPs** – SIMD width 64 *next*

The K1 kernel has only one warp per thread block, so at most eight thread blocks will be active on each multiprocessor, out of the possible 32. I can improve that by strip-mining the `i` loop to a width of 64, changing the 32 to a 64 in kernel K1, and running with 64 threads in each thread block. With this version, if eight thread blocks are scheduled on each multiprocessor, we get up to 16 warps, so the multithreading is more effective. And we see a performance increase, to 36 GFLOPs (38 kernel-only).

**Performance: 35 GFLOPs** – SIMD width 128 *next*

So what happens if I try this trick again, doubling or quadrupling the strip size to 128 or 256? This increases the maximum number of warps per multiprocessor to 32 (which is the limit), so we might expect another bump in performance from improved multithreading. Unfortunately, we don’t; the performance drops slightly to 35 GFLOPs (36 kernel-only) in both cases. This I can’t quite explain.

**Performance: 1.7 GFLOPs** – SIMD width 32, non-stride-1 array accesses *next*

Even with this simple version, I made some assumptions and optimizations, knowing something about the machine. I know that stride-1 accesses in a thread block are important, so I ran the stride-1 `i` loop along the thread index. Just how important is that? Suppose we switch the `i` and `j` indices, so the SIMD memory accesses are along a column; the performance drops from 28 GFLOPs (K1) to 1.7. We can call this kernel Ks (s for stupid, or slow).

**Performance: 5.7 GFLOPs** – SIMD width 32, inverted parallel loop indices *next*

Still, we’ve only just started. If we inspect the code for kernel K1, we note that the inner loop contains two memory fetches, for `b` and `c`; both fetches are from the device memory, which has a very high latency. In particular, the fetch for `c` loads the same element for all the threads in the thread block. The memory system is designed for high bandwidth when all the threads access consecutive elements, such as with the `b` access. This used to be called *superword* access in classical vector machines, where the memory returns 64-bytes (or more) at a time. Kernel K1 doesn’t take advantage of this memory design for the `c` access, but we can fix that. Let’s strip-mine the `k` loop, and load a strip of `c` into the multiprocessor local memory. The pseudo code is:

parfor( int is = 0; is < n; is += 32 ) /* K2 */

parfor( int j = 0; j < m; ++j )

SIMDfor( int i = is; i < is+32; ++i )

for( int ks = 0; ks < p; ks += 32 )

cb[ks:ks+31] = c[ks+pitch_c*j:ks+31+pitch_c*j];

for( int k = ks; k < ks+32; ++k )

a[i+pitch_a*j] += b[i+pitch_b*k] * cb[k-ks];

**Performance: 33 GFLOPs** – cached access to `c`, SIMD width 32**Performance: 55 GFLOPs** – cached access to `c`, SIMD width 64**Performance: 63 GFLOPs** – cached access to `c`, SIMD width 128 *next*

Note the vector fetch of `c` into the temporary array `cb`. This is handled in kernel K2 by letting each thread fetch one element and storing into the multiprocessor local memory, so the inner loop only has one device memory fetch; the fetch of `cb` from the local memory is almost as fast as a register access; we see the performance improves to 33 GFLOPS, up from 28. We can again increase the number of threads per block from 32 to 64 and 128, and we see performance improve from 33 to 55 and 63 GFLOPs. As with kernel K1, increasing to 256 threads per block does not improve performance. An implementation detail: with more than one warp per thread block, we need to synchronize the warps after loading the temp array `cb`, and before reloading it the next time around the `ks` loop; see the CUDA source code for this detail.

**Performance: 63 GFLOPs** – cached access to `c`, SIMD width 64, unroll inner loop *next*

But we’re not done yet. What if we unroll the inner loop, to reduce the loop overhead? We might unroll to a factor of 2 or 4 or even 16. Unrolling the inner loop once in the 64-wide K2 kernel does improve performance, getting 63 GFLOPs, but more unrolling doesn’t help, and it doesn’t help the 128-wide kernel.

So far we’ve got two kernel versions, with variations in the thread-block (vector) size and unrolling. And we’ve only just begun. We tried unrolling the inner `k` loop; what if we try unrolling one of the outer loops? We could let each kernel instance compute two values of the `i` loop. The pseudo-code looks like:

parfor( int is = 0; is < n; is += 64 ) /* K3 */

parfor( int j = 0; j < m; ++j )

SIMDfor( int i = is; i < is+32; ++i )

for( int ks = 0; ks < p; ks += 32 )

cb[ks:ks+31] = c[ks+pitch_c*j:ks+31+pitch_c*j];

for( int k = ks; k < ks+32; ++k )

a[i+pitch_a*j] += b[i+pitch_b*k] * cb[k-ks];

a[i+32+pitch_a*j] += b[i+32+pitch_b*k] * cb[k-ks];

**Performance: 53 GFLOPs** – cached access to `c`, SIMD width 32, unroll `i` loop**Performance: 63 GFLOPs** – cached access to `c`, SIMD width 32, unroll `i` loop 3x *next*

Each iteration of the `i` loop now computes values for `i` and `i+32`. We don’t expect much advantage here, since the only values shared between the two `i` iterations are loaded from the local memory, which is already pretty fast. But even this kernel improves upon K2, with 53 GFLOPs. We can improve this to 63 GFLOPs by unrolling more or increasing the SIMD width to 64.

Next, we can try unrolling the `j` loop, so each kernel computes values for `j` and `j+1`. The pseudo-code is:

parfor( int is = 0; is < n; is += 32 ) /* K4 */

parfor( int j = 0; j < m; j += 2 )

SIMDfor( int i = is; i < is+32; ++i )

for( int ks = 0; ks < p; ks += 32 )

cb0[ks:ks+31] = c[ks+pitch_c*j:ks+31+pitch_c*j];

cb1[ks:ks+31] = c[ks+pitch_c*(j+1):ks+31+pitch_c*(j+1)];

for( int k = ks; k < ks+64; ++k )

a[i+pitch_a*j] += b[i+pitch_b*k] * cb0[k-ks];

a[i+pitch_a*(j+1)] += b[i+pitch_b*k] * cb1[k-ks];

**Performance: 59 GFLOPs** – cached access to `c`, SIMD width 32, unroll `j` loop**Performance: 98 GFLOPs** – cached access to `c`, SIMD width 64, unroll `j` loop**Performance: 117 GFLOPs** – cached access to `c`, SIMD width 128, unroll `j` loop *next*

Here, we note the two assignments in the `k` loop fetch the same value of `b` from the device memory. This version gets 59 GFLOPs; it jumps to 98 GFLOPs when we increase the SIMD width to 64, and again to 117 GFLOPs with a SIMD width of 128. Now we’re starting to see real performance, over 100 GFLOPs, host-to-host.

**Performance: 176 GFLOPs** – cached access to `c`, SIMD width 128, unroll `j` loop 3x *next*

But we’re not done yet. What if we unroll the `j` loop by four iterations instead of just two? This involves keeping four partial sums. Now the performance with SIMD width 128 is 176 GFLOPs host-to-host, and over 210 GFLOPs on the device.

**Performance: 202 GFLOPs** – cached access to `c`, SIMD width 128, unroll `j` loop 3x and `k` loop 1x **Performance: 208 GFLOPs** – cached access to `c`, SIMD width 128, unroll `j` and `k` loops 3x *next*

More unrolling of the `j` loop doesn’t improve performance, but what if we combine this with unrolling the `k` loop? If we unroll the `j` loop 3 times and the `k` loop once, with SIMD width of 128, we get 202 GFLOPs; unrolling the `k` 3 times gives us 208 GFLOPs (host-to-host), and 265 GFLOPs (kernel-only).

Our peak performance so far looks pretty good. It took some experimentation, but we have a version that uses only 128 threads and 2KB local memory per thread block, allowing up to 8 thread blocks on each multiprocessor, so taking great advantage of the multithreading properties of the machine. We haven’t even fully explored all the combinations. What if we combine `i` loop unrolling with the `j` and `k` loop unrolling? Should we explore other unroll factors as we combine unrolling multiple loops? What if we use pointer arithmetic instead of array references (this really questions whether NVCC optimizes the array references, but it seems to do a good job there)? I desperately wanted to break the 200 GFLOP barrier, and reached it. The version of `sgemm` that comes with CUDA BLAS gets about 260 GFLOPs (host-to-host) on a 4096×4096 matrix; I’ve still got some work to do to get that extra 25%.

When we optimize a matmul for a general purpose CPU with a cache, we’ve learned that we need a *tiled* algorithm. We can do the same thing on the GPU, where we fit the submatrices in the local memory. The pseudo code is:

parfor( int is = 0; is < n; is += 16 ) /* K5 */

parfor( int js = 0; js < m; js += 16 )

SIMDfor( int i = 0; i < 16; ++i )

SIMDfor( int j = 0; j < 16; ++j )

at[is:is+15][js:js+15] = 0.0; /* init A tile */

for( int ks = 0; ks < p; ks += 16 )

/*load B tile*/ bt[i][ks:ks+15] = b[i+pitch_b*ks:i+pitch_b*(ks+15):pitch_b];

/*load C tile*/ ct[ks:ks+15][j] = c[ks+pitch_c*(js+j):(ks+15)+pitch_c*(js+j)];

for( int k = ks; k < ks+64; ++k )

/*tile MM */ at[i][j] += bt[i][k]*ct[k][j];

a[i+pitch_a*j] = a[i][j]; /* store A tile */

**Performance: 164 GFLOPs** – tiled loops, SIMD width 16×16, cached access to `b` and `c`

We have to choose a tile size, and square tiles seem to make as much sense as any other shape, at least to start with. We choose 16×16 tiles and run 256 threads in a thread group, so each thread will compute one element of the `a` tile; this lets us keep that element in a register. The actual kernel code is slightly more complex than the previous kernels. It’s important to recall that this scalar kernel is one of a thread group or cohort of 256 cooperating instances, and it only works in that domain. This version gives us 164 GFLOPs, not quite as good as we’ve already seen. Why not? One reason is the thread group is 256 threads, so we hit the 1024 threads/multiprocessor limit with only four thread groups. We can address that as well, but I still haven’t quite reached the peak performance shown on kernel K4. The CUDA blas (260 GFLOPs) sgemm is similar to this tiled version. It’s based on work by Vasily Volkov, a Computer Science PhD student at Cal; Vasily’s code uses a 16×4 thread block with the `i` loop unrolled by 16, the `j` loop by 4, and the `k` loop by 16 (if I read it right). The code might be hard to follow, but it sure beats trying to code a matmul in DirectX.

So let’s suppose we’ve decided on the K4 algorithm. It assumes that the matrix sizes are multiples of 32 (or 64 or 128) in all dimensions, though it doesn’t require the matrices to be square. One way to satisfy this is to pad all your matrices, filling in zeroes in the extra rows and columns. Matrix addition and multiplication will preserve these zeroes and will not pollute the actual values; this may be your best option. Another solution is to add conditionals so as to not run off the ends of the matrices. This complicates the code and can affect performance. The simplest method to test for array limits is to put conditionals around the device memory fetch and store operations; if we fill in zeros to the `b` and `c` tiles, the innermost loop won’t need any tests. I reproduce the body of the kernel here:

float sum0 = 0.0, sum1 = 0.0;

for( int ks = 0; ks < p; ks += 32 ){

if( ks+tx < p && j < m )

cb0[tx] = c[ks+tx+pitch_c*j];

else

cb0[tx] = 0.0;

if( ks+tx < p && j+1 < m )

cb1[tx] = c[ks+tx+pitch_c*(j+1)];

else

cb1[tx] = 0.0;

__syncthreads();

if( i < n ){

for( int k = ks; k < ((ks+32 < m) ? ks+32 : m); ++k ){

float rb = b[i+pitch_b*k];

sum0 += rb * cb0[k-ks];

sum1 += rb * cb1[k-ks];

}

}

__syncthreads();

}

if( i < n && j < m )

a[i+pitch_a*j] = sum0;

if( i < n && j+1 < m )

a[i+pitch_a*(j+1)] = sum1;

Even if `i` and `j` are outside the matrix bounds, we can’t just skip the body of the loop for two reasons. First, each thread is part of a thread group, and as such it loads part of the data into the local temporary arrays `cb0` and `cb1`; even if this thread has nothing to compute, it has to do its part of the shared work. Second, we have those pesky barrier synchronizations; all threads in a thread group must participate in the barrier, so even if this thread has no work to do, it had better reach those barriers.

These tests cost about 5% in performance, in the simplest version of K4. It’s less costly in the more complex versions, but the code gets messy when mixed with some of the unrolling. But it will work with any matrix size, whereas K4 requires the size to be a multiple of 32.

Of course, if you need to deliver a library that works regardless of the matrix sizes, you have another option. You can create two versions of your routine, a faster one that works when the matrix sizes are appropriate multiples of 16, and a slower, general purpose one that works for other matrix sizes, with a conditional test to execute the right one. Then you get your good benchmark numbers (all benchmarks use large powers of two, right?), and you get right answers, too.

The point I’ve tried to make is how sensitive the performance of the GPU is to the formulation of your kernel, and how much and what kind of experimentation you’ll need to do to optimize your performance. How much of the optimization process will carry over from one GPU to another, or from one generation to the next from the same vendor? Many programmers like this low level of control, and it certainly could be appropriate when developing a numerical library, in the same way assembly language is appropriate.

To be fair, the same is true on your CPU as well; you need to optimize your matmul for (packed) vector operations, memory strides, and cache locality. A bad program will run several times slower than a good one; ordering the matmul loops so the inner loop is non-stride-1 reduces the performance on large matrices (on our Penryn) by more than a factor of 10. But compilers and tools are far more mature and helpful when compiling for an x64, IBM POWER, Sun SPARC, or other CPU.

I’m sure many readers would like to tell me (again) that I should be using the prebuilt library version of `sgemm` for matmul, not writing my own. Save your breath. Matmul is just one simple example here, three loops, three matrices, lots of parallelism, and yet I put in several days of work to get this seven line loop optimized for the GPU.

We can compare the evolution of GPU programming to the evolution of shared-memory parallel programming. There were many new languages designed to take advantage of parallelism (Id, SISAL, many others). Many low-level libraries were written to create and manage threads, eventually standardizing on Posix threads (aka pthreads). Much work was done on automatic parallelization, dating back to the 1960s and 1970s. When successful commercial shared-memory multiprocessors became more widely available in the 1980s, an effort began to standardize a less intrusive programming interface for multiple processors, driven by multiprocessor workstations, eventually resulting in the OpenMP API, which defines directives and a runtime interface to a shared-memory parallel programming model.

GPUs have their own set of domain-specific languages, including GLSL (OpenGL Shading Language), HLSL (high level shader language) from Microsoft for DirectX, and Cg (C for graphics) from NVIDIA. We’re now in a period with development of low-level libraries and interfaces to create and manage GPU threads; the OpenCL effort aims to standardize this. The cost to port a nontrivial application to this model is high, though the potential performance is alluring.

Luckily for me, my application (the compiler) runs on the host, and I don’t have to port that. But what about the real application programmer, who has thousands (or hundreds of thousands) of lines of code? Is it feasible to take GPU or accelerator programming concepts, abstract them into a predictable and useful programming model, and present them using a portable programming interface, in the same way that OpenMP abstracts and presents multiprocessor and multicore systems? That’s a topic for my next column.

—–

*Michael Wolfe has developed compilers for over 30 years in both academia and industry, and is now a senior compiler engineer at The Portland Group, Inc. (**www.pgroup.com**), a wholly-owned subsidiary of STMicroelectronics, Inc. The opinions stated here are those of the author, and do not represent opinions of The Portland Group, Inc. or STMicroelectronics, Inc.*