Compilers and More: Optimizing GPU Kernels

By Michael Wolfe

October 30, 2008

My last column discussed some of the complexities of programming GPUs today, focusing on how to interface the host program with the GPU. Here we focus on programming the GPU itself. As with last time, we’ll look at a simple single-precision matrix multiplication, equivalent to the BLAS SGEMM routine.

Matmul is a highly parallel algorithm, but let me emphasize that parallelism does not equate to performance. We need to carefully sculpt our algorithm to match the parallelism available in the architecture in order to reap the benefits. This is true whether we are targeting a GPU, a multicore x64, or even a single core with packed SSE operations. As an example, I took the simple matmul loop (in C, but with the matrices stored column-major):

     for( int j = 0; j < m; ++j )
       for( int k = 0; k < p; ++k )
         for( int i = 0; i < n; ++i )
           a[i+pitch_a*j] += b[i+pitch_b*k] * c[k+pitch_c*j];

modified it several ways and ran it on an Intel Xeon (3GHz, 6MB cache, 16GB memory, Penryn) using 4096×4096 matrices (to compare with results we’ll see below). With the loop in the order shown (stride-1 inner loop), the program ran at 1.7 GFLOPs; this is compiled C performance (using pgcc -fast). We can improve that by tiling or blocking the loops, organizing the matmul as a a bunch of submatrix multiplications, sized so each submatrix matmul fits in the processor cache. This improves performance to 5.7 GFLOPs, and it jumps to over 22 GFLOPs when we use OpenMP directives and run on all four cores. Advanced compilers help by automatically managing the vectorization, unrolling, memory alignments, adding prefetch instructions, and so forth.

We’re going to see several matmul GPU kernels, with performance on our GPU development system, with an NVIDIA GeForce GTX 280 (1GB memory, 30 multiprocessors), using NVIDIA’s CUDA language. The host is a Linux (OpenSUSE 11.0) triple-core AMD Phenom (2.1 GHz, 500KB cache, 4GB memory), though the host hardly matters; the performance for these experiments is entirely dominated by the GPU code.

As on the CPU, performance on a GPU can be fragile; small changes to the program can make large differences in performance. It’s easy to write a slow program. This was a characteristic of High Performance Fortran, one that (my opinion) was a major cause of its downfall; while HPF made it easier to write parallel programs, it didn’t make parallel programs fast. That is the job of the HPC programmer; the same will be true for accelerators, GPUs, and even multicore CPUs.

GPUs deliver their dramatic high performance through a well-balanced, carefully managed, highly parallel architecture. Algorithms running on the GPU must be parallelized and balanced as well; this does not come for free. Program development may cost extra time and effort to understand and use the appropriate programming model, a model that may not match the simple scalar processor with cache model we are comfortable with on x64 hosts. However, the analysis and programming techniques used to develop GPU algorithms will probably help you develop multicore programs as well. A good programming model with good compilers and tools can relieve you of much busywork, but you still have to think, and you still have to understand algorithms and architecture, and you should expect no less.

From here on below, I show many versions of matmul; if you’re not a programmer or want to skip over the details, look for the performance tags below, until the Summary; you don’t want to miss the conclusions. If you are a programmer and want to see all the code, you’ll find all the sources in a kernels tarfile at the PGI Web site.

In my last column, I proposed a simple matmul kernel for the GPU and focused on the host code to drive the kernel. We’ll use that simple kernel to start the discussion. What I had done is taken the matmul loop (as shown above), strip-mined the stride-1 i loop to the CUDA SIMD width of 32:

     for( int is = 0; is < n; is += 32 )
       for( int i = is; i < is+32; ++i )
         for( int j = 0; j < m; ++j )
           for( int k = 0; k < p; ++k )
             a[i+pitch_a*j] += b[i+pitch_b*k] * c[k+pitch_c*j];

run the i element loop as a thread block, and run the is strip loop and j loop in parallel:

     parfor( int is = 0; is < n; is += 32 )    /* K1 */
       parfor( int j = 0; j < m; ++j )
         SIMDfor( int i = is; i < is+32; ++i )
           for( int k = 0; k < p; ++k )
             a[i+pitch_a*j] += b[i+pitch_b*k] * c[k+pitch_c*j];

then optimized by hand just a little. The parallel (grid) loops and the SIMD (thread block) loop are handled implicitly by the GPU hardware and firmware, so they don’t appear in the kernel code. All that’s left is the body, the k loop. The final kernel in all its glory, cut-and-pasted from my CUDA source file, is:

    extern "C" __global__ void    /* K1 */
    mmkernel( float* a, float* b, float* c,
      int pitch_a, int pitch_b, int pitch_c,
      int n, int m, int p )
    {
        int i = blockIdx.x*32 + threadIdx.x;
        int j = blockIdx.y;
        float sum = 0.0;
        for( int k = 0; k < p; ++k )
          sum += b[i+pitch_b*k] * c[k+pitch_c*j];
        a[i+pitch_a*j] = sum;
    }

Performance: 28 GFLOPs – SIMD width 32 next
This version runs at 28 GFLOPs on our system (on 4096×4096 matrices). In the interest of full disclosure, I compiled the kernels discussed here with NVIDIA’s NVCC compiler version 2.0 with the -O option, and compiled the driver routine with pgcc -fast; I ran each program three times and report the middle performance score, rounding the GFLOPs down to an integer value. I will generally show a GFLOPs number most directly comparable to a host matmul, including the overhead of transmitting the operand matrices to the GPU memory and the result matrix back. I will sometimes give the performance of just the matmul kernel on the GPU; while the two numbers are often quite close, the kernel-only number is useful to expose more clearly the effect of changes to the kernel program (since the overhead stays the same). I show results for 4096×4096 matrices, which is close to the peak performance for each kernel. For version K1, the host-to-host and kernel-only performance were 28 and 29 GFLOPs, respectively.

That may sound like good performance, but we’re not nearly taking full advantage of the available parallelism. Recall the NVIDIA architecture description; the card I’m using has 30 multiprocessors, each with eight thread processors, quad-clocked to get a SIMD width of 32. The kernel above is a scalar program, but the card runs 32 copies of it in SIMD mode (or SIMT mode, to use NVIDIA’s term); the 32 copies comprise a warp. Each multiprocessor uses multithreading to support up to 32 warps (1024 scalar threads). The 32 warps can come from different thread blocks (different iterations of is or j) or from wider thread blocks (more than 32 scalar threads). There are limits in this generation of the card: each multiprocessor can support up to eight simultaneous thread blocks, and a thread block can support up to 16 warps.

Performance: 36 GFLOPs – SIMD width 64 next
The K1 kernel has only one warp per thread block, so at most eight thread blocks will be active on each multiprocessor, out of the possible 32. I can improve that by strip-mining the i loop to a width of 64, changing the 32 to a 64 in kernel K1, and running with 64 threads in each thread block. With this version, if eight thread blocks are scheduled on each multiprocessor, we get up to 16 warps, so the multithreading is more effective. And we see a performance increase, to 36 GFLOPs (38 kernel-only).

Performance: 35 GFLOPs – SIMD width 128 next
So what happens if I try this trick again, doubling or quadrupling the strip size to 128 or 256? This increases the maximum number of warps per multiprocessor to 32 (which is the limit), so we might expect another bump in performance from improved multithreading. Unfortunately, we don’t; the performance drops slightly to 35 GFLOPs (36 kernel-only) in both cases. This I can’t quite explain.

Performance: 1.7 GFLOPs – SIMD width 32, non-stride-1 array accesses next
Even with this simple version, I made some assumptions and optimizations, knowing something about the machine. I know that stride-1 accesses in a thread block are important, so I ran the stride-1 i loop along the thread index. Just how important is that? Suppose we switch the i and j indices, so the SIMD memory accesses are along a column; the performance drops from 28 GFLOPs (K1) to 1.7. We can call this kernel Ks (s for stupid, or slow).

Performance: 5.7 GFLOPs – SIMD width 32, inverted parallel loop indices next
Still, we’ve only just started. If we inspect the code for kernel K1, we note that the inner loop contains two memory fetches, for b and c; both fetches are from the device memory, which has a very high latency. In particular, the fetch for c loads the same element for all the threads in the thread block. The memory system is designed for high bandwidth when all the threads access consecutive elements, such as with the b access. This used to be called superword access in classical vector machines, where the memory returns 64-bytes (or more) at a time. Kernel K1 doesn’t take advantage of this memory design for the c access, but we can fix that. Let’s strip-mine the k loop, and load a strip of c into the multiprocessor local memory. The pseudo code is:

     parfor( int is = 0; is < n; is += 32 )    /* K2 */
        parfor( int j = 0; j < m; ++j )
          SIMDfor( int i = is; i < is+32; ++i )
            for( int ks = 0; ks < p; ks += 32 )
              cb[ks:ks+31] = c[ks+pitch_c*j:ks+31+pitch_c*j];
              for( int k = ks; k < ks+32; ++k )
                a[i+pitch_a*j] += b[i+pitch_b*k] * cb[k-ks];

Performance: 33 GFLOPs – cached access to c, SIMD width 32
Performance: 55 GFLOPs – cached access to c, SIMD width 64
Performance: 63 GFLOPs – cached access to c, SIMD width 128 next
Note the vector fetch of c into the temporary array cb. This is handled in kernel K2 by letting each thread fetch one element and storing into the multiprocessor local memory, so the inner loop only has one device memory fetch; the fetch of cb from the local memory is almost as fast as a register access; we see the performance improves to 33 GFLOPS, up from 28. We can again increase the number of threads per block from 32 to 64 and 128, and we see performance improve from 33 to 55 and 63 GFLOPs. As with kernel K1, increasing to 256 threads per block does not improve performance. An implementation detail: with more than one warp per thread block, we need to synchronize the warps after loading the temp array cb, and before reloading it the next time around the ks loop; see the CUDA source code for this detail.

Performance: 63 GFLOPs – cached access to c, SIMD width 64, unroll inner loop next
But we’re not done yet. What if we unroll the inner loop, to reduce the loop overhead? We might unroll to a factor of 2 or 4 or even 16. Unrolling the inner loop once in the 64-wide K2 kernel does improve performance, getting 63 GFLOPs, but more unrolling doesn’t help, and it doesn’t help the 128-wide kernel.

So far we’ve got two kernel versions, with variations in the thread-block (vector) size and unrolling. And we’ve only just begun. We tried unrolling the inner k loop; what if we try unrolling one of the outer loops? We could let each kernel instance compute two values of the i loop. The pseudo-code looks like:

      parfor( int is = 0; is < n; is += 64 )    /* K3 */
        parfor( int j = 0; j < m; ++j )
          SIMDfor( int i = is; i < is+32; ++i )
            for( int ks = 0; ks < p; ks += 32 )
              cb[ks:ks+31] = c[ks+pitch_c*j:ks+31+pitch_c*j];
              for( int k = ks; k < ks+32; ++k )
                a[i+pitch_a*j] += b[i+pitch_b*k] * cb[k-ks];
                a[i+32+pitch_a*j] += b[i+32+pitch_b*k] * cb[k-ks];

Performance: 53 GFLOPs – cached access to c, SIMD width 32, unroll i loop
Performance: 63 GFLOPs – cached access to c, SIMD width 32, unroll i loop 3x next
Each iteration of the i loop now computes values for i and i+32. We don’t expect much advantage here, since the only values shared between the two i iterations are loaded from the local memory, which is already pretty fast. But even this kernel improves upon K2, with 53 GFLOPs. We can improve this to 63 GFLOPs by unrolling more or increasing the SIMD width to 64.

Next, we can try unrolling the j loop, so each kernel computes values for j and j+1. The pseudo-code is:

      parfor( int is = 0; is < n; is += 32 )    /* K4 */
        parfor( int j = 0; j < m; j += 2 )
          SIMDfor( int i = is; i < is+32; ++i )
            for( int ks = 0; ks < p; ks += 32 )
              cb0[ks:ks+31] = c[ks+pitch_c*j:ks+31+pitch_c*j];
              cb1[ks:ks+31] = c[ks+pitch_c*(j+1):ks+31+pitch_c*(j+1)];
              for( int k = ks; k < ks+64; ++k )
                a[i+pitch_a*j] += b[i+pitch_b*k] * cb0[k-ks];
                a[i+pitch_a*(j+1)] += b[i+pitch_b*k] * cb1[k-ks];

Performance: 59 GFLOPs – cached access to c, SIMD width 32, unroll j loop
Performance: 98 GFLOPs – cached access to c, SIMD width 64, unroll j loop
Performance: 117 GFLOPs – cached access to c, SIMD width 128, unroll j loop next
Here, we note the two assignments in the k loop fetch the same value of b from the device memory. This version gets 59 GFLOPs; it jumps to 98 GFLOPs when we increase the SIMD width to 64, and again to 117 GFLOPs with a SIMD width of 128. Now we’re starting to see real performance, over 100 GFLOPs, host-to-host.

Performance: 176 GFLOPs – cached access to c, SIMD width 128, unroll j loop 3x next
But we’re not done yet. What if we unroll the j loop by four iterations instead of just two? This involves keeping four partial sums. Now the performance with SIMD width 128 is 176 GFLOPs host-to-host, and over 210 GFLOPs on the device.

Performance: 202 GFLOPs – cached access to c, SIMD width 128, unroll j loop 3x and k loop 1x
Performance: 208 GFLOPs – cached access to c, SIMD width 128, unroll j and k loops 3x next
More unrolling of the j loop doesn’t improve performance, but what if we combine this with unrolling the k loop? If we unroll the j loop 3 times and the k loop once, with SIMD width of 128, we get 202 GFLOPs; unrolling the k 3 times gives us 208 GFLOPs (host-to-host), and 265 GFLOPs (kernel-only).

Our peak performance so far looks pretty good. It took some experimentation, but we have a version that uses only 128 threads and 2KB local memory per thread block, allowing up to 8 thread blocks on each multiprocessor, so taking great advantage of the multithreading properties of the machine. We haven’t even fully explored all the combinations. What if we combine i loop unrolling with the j and k loop unrolling? Should we explore other unroll factors as we combine unrolling multiple loops? What if we use pointer arithmetic instead of array references (this really questions whether NVCC optimizes the array references, but it seems to do a good job there)? I desperately wanted to break the 200 GFLOP barrier, and reached it. The version of sgemm that comes with CUDA BLAS gets about 260 GFLOPs (host-to-host) on a 4096×4096 matrix; I’ve still got some work to do to get that extra 25%.

When we optimize a matmul for a general purpose CPU with a cache, we’ve learned that we need a tiled algorithm. We can do the same thing on the GPU, where we fit the submatrices in the local memory. The pseudo code is:

      parfor( int is = 0; is < n; is += 16 )    /* K5 */
        parfor( int js = 0; js < m; js += 16 )
          SIMDfor( int i = 0; i < 16; ++i )
            SIMDfor( int j = 0; j < 16; ++j )
              at[is:is+15][js:js+15] = 0.0;  /* init A tile */
              for( int ks = 0; ks < p; ks += 16 )
/*load B tile*/ bt[i][ks:ks+15] = b[i+pitch_b*ks:i+pitch_b*(ks+15):pitch_b];
/*load C tile*/ ct[ks:ks+15][j] = c[ks+pitch_c*(js+j):(ks+15)+pitch_c*(js+j)];
                for( int k = ks; k < ks+64; ++k )
/*tile MM */      at[i][j] += bt[i][k]*ct[k][j];
              a[i+pitch_a*j] = a[i][j];      /* store A tile */

Performance: 164 GFLOPs – tiled loops, SIMD width 16×16, cached access to b and c
We have to choose a tile size, and square tiles seem to make as much sense as any other shape, at least to start with. We choose 16×16 tiles and run 256 threads in a thread group, so each thread will compute one element of the a tile; this lets us keep that element in a register. The actual kernel code is slightly more complex than the previous kernels. It’s important to recall that this scalar kernel is one of a thread group or cohort of 256 cooperating instances, and it only works in that domain. This version gives us 164 GFLOPs, not quite as good as we’ve already seen. Why not? One reason is the thread group is 256 threads, so we hit the 1024 threads/multiprocessor limit with only four thread groups. We can address that as well, but I still haven’t quite reached the peak performance shown on kernel K4. The CUDA blas (260 GFLOPs) sgemm is similar to this tiled version. It’s based on work by Vasily Volkov, a Computer Science PhD student at Cal; Vasily’s code uses a 16×4 thread block with the i loop unrolled by 16, the j loop by 4, and the k loop by 16 (if I read it right). The code might be hard to follow, but it sure beats trying to code a matmul in DirectX
So let’s suppose we’ve decided on the K4 algorithm. It assumes that the matrix sizes are multiples of 32 (or 64 or 128) in all dimensions, though it doesn’t require the matrices to be square. One way to satisfy this is to pad all your matrices, filling in zeroes in the extra rows and columns. Matrix addition and multiplication will preserve these zeroes and will not pollute the actual values; this may be your best option. Another solution is to add conditionals so as to not run off the ends of the matrices. This complicates the code and can affect performance. The simplest method to test for array limits is to put conditionals around the device memory fetch and store operations; if we fill in zeros to the b and c tiles, the innermost loop won’t need any tests. I reproduce the body of the kernel here:

        float sum0 = 0.0, sum1 = 0.0;
        for( int ks = 0; ks < p; ks += 32 ){
          if( ks+tx < p && j < m )
            cb0[tx] = c[ks+tx+pitch_c*j];
          else
           cb0[tx] = 0.0;
          if( ks+tx < p && j+1 < m )
            cb1[tx] = c[ks+tx+pitch_c*(j+1)];
          else
            cb1[tx] = 0.0;
          __syncthreads();
         if( i < n ){
            for( int k = ks; k < ((ks+32 < m) ? ks+32 : m); ++k ){
              float rb = b[i+pitch_b*k];
              sum0 += rb * cb0[k-ks];
              sum1 += rb * cb1[k-ks];
            }
          }
          __syncthreads();
        }
        if( i < n && j < m )
          a[i+pitch_a*j] = sum0;
        if( i < n && j+1 < m )
          a[i+pitch_a*(j+1)] = sum1;

Even if i and j are outside the matrix bounds, we can’t just skip the body of the loop for two reasons. First, each thread is part of a thread group, and as such it loads part of the data into the local temporary arrays cb0 and cb1; even if this thread has nothing to compute, it has to do its part of the shared work. Second, we have those pesky barrier synchronizations; all threads in a thread group must participate in the barrier, so even if this thread has no work to do, it had better reach those barriers.

These tests cost about 5% in performance, in the simplest version of K4. It’s less costly in the more complex versions, but the code gets messy when mixed with some of the unrolling. But it will work with any matrix size, whereas K4 requires the size to be a multiple of 32.

Of course, if you need to deliver a library that works regardless of the matrix sizes, you have another option. You can create two versions of your routine, a faster one that works when the matrix sizes are appropriate multiples of 16, and a slower, general purpose one that works for other matrix sizes, with a conditional test to execute the right one. Then you get your good benchmark numbers (all benchmarks use large powers of two, right?), and you get right answers, too.

Summary

The point I’ve tried to make is how sensitive the performance of the GPU is to the formulation of your kernel, and how much and what kind of experimentation you’ll need to do to optimize your performance. How much of the optimization process will carry over from one GPU to another, or from one generation to the next from the same vendor? Many programmers like this low level of control, and it certainly could be appropriate when developing a numerical library, in the same way assembly language is appropriate.

To be fair, the same is true on your CPU as well; you need to optimize your matmul for (packed) vector operations, memory strides, and cache locality. A bad program will run several times slower than a good one; ordering the matmul loops so the inner loop is non-stride-1 reduces the performance on large matrices (on our Penryn) by more than a factor of 10. But compilers and tools are far more mature and helpful when compiling for an x64, IBM POWER, Sun SPARC, or other CPU.

I’m sure many readers would like to tell me (again) that I should be using the prebuilt library version of sgemm for matmul, not writing my own. Save your breath. Matmul is just one simple example here, three loops, three matrices, lots of parallelism, and yet I put in several days of work to get this seven line loop optimized for the GPU.

We can compare the evolution of GPU programming to the evolution of shared-memory parallel programming. There were many new languages designed to take advantage of parallelism (Id, SISAL, many others). Many low-level libraries were written to create and manage threads, eventually standardizing on Posix threads (aka pthreads). Much work was done on automatic parallelization, dating back to the 1960s and 1970s. When successful commercial shared-memory multiprocessors became more widely available in the 1980s, an effort began to standardize a less intrusive programming interface for multiple processors, driven by multiprocessor workstations, eventually resulting in the OpenMP API, which defines directives and a runtime interface to a shared-memory parallel programming model.

GPUs have their own set of domain-specific languages, including GLSL (OpenGL Shading Language), HLSL (high level shader language) from Microsoft for DirectX, and Cg (C for graphics) from NVIDIA. We’re now in a period with development of low-level libraries and interfaces to create and manage GPU threads; the OpenCL effort aims to standardize this. The cost to port a nontrivial application to this model is high, though the potential performance is alluring.

Luckily for me, my application (the compiler) runs on the host, and I don’t have to port that. But what about the real application programmer, who has thousands (or hundreds of thousands) of lines of code? Is it feasible to take GPU or accelerator programming concepts, abstract them into a predictable and useful programming model, and present them using a portable programming interface, in the same way that OpenMP abstracts and presents multiprocessor and multicore systems? That’s a topic for my next column.

—–

Michael Wolfe has developed compilers for over 30 years in both academia and industry, and is now a senior compiler engineer at The Portland Group, Inc. (www.pgroup.com), a wholly-owned subsidiary of STMicroelectronics, Inc. The opinions stated here are those of the author, and do not represent opinions of The Portland Group, Inc. or STMicroelectronics, Inc.

Topics: Developer Tools

Empowering High-Performance Computing for Artificial Intelligence

April 19, 2024

Artificial intelligence (AI) presents some of the most challenging demands in information technology, especially concerning computing power and data movement. As a result of these challenges, high-performance computing Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that have occurred about once a decade. With this in mind, the ISC Read more…

2024 Winter Classic: Texas Two Step

April 18, 2024

Texas Tech University. Their middle name is ‘tech’, so it’s no surprise that they’ve been fielding not one, but two teams in the last three Winter Classic cluster competitions. Their teams, dubbed Matador and Red Read more…

2024 Winter Classic: The Return of Team Fayetteville

April 18, 2024

Hailing from Fayetteville, NC, Fayetteville State University stayed under the radar in their first Winter Classic competition in 2022. Solid students for sure, but not a lot of HPC experience. All good. They didn’t Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use of Rigetti’s Novera 9-qubit QPU. The approach by a quantum Read more…

2024 Winter Classic: Meet Team Morehouse

April 17, 2024

Morehouse College? The university is well-known for their long list of illustrious graduates, the rigor of their academics, and the quality of the instruction. They were one of the first schools to sign up for the Winter Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that ha Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use o Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pre Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Instit Read more…

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

April 11, 2024

The chip market is facing a crisis: chip development is now concentrated in the hands of the few. A confluence of events this week reminded us how few chips Read more…

The VC View: Quantonation’s Deep Dive into Funding Quantum Start-ups

April 11, 2024

Yesterday Quantonation — which promotes itself as a one-of-a-kind venture capital (VC) company specializing in quantum science and deep physics — announce Read more…

Nvidia’s GTC Is the New Intel IDF

April 9, 2024

After many years, Nvidia's GPU Technology Conference (GTC) was back in person and has become the conference for those who care about semiconductors and AI. I Read more…

Google Announces Homegrown ARM-based CPUs

April 9, 2024

Google sprang a surprise at the ongoing Google Next Cloud conference by introducing its own ARM-based CPU called Axion, which will be offered to customers in it Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Google Addresses the Mysteries of Its Hypercomputer

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

Click Here for More Headlines

HPCwire is a registered trademark of Tabor Communications, Inc. Use of this site is governed by our Terms of Use and Privacy Policy.

Reproduction in whole or in part in any form or medium without express written permission of Tabor Communications, Inc. is prohibited.

Leading Solution Providers

Off The Wire

Industry Headlines

April 19, 2024

April 18, 2024

April 17, 2024

April 16, 2024

Subscribe to HPCwire's Weekly Update!

Empowering High-Performance Computing for Artificial Intelligence

Kathy Yelick on Post-Exascale Challenges

2024 Winter Classic: Texas Two Step

2024 Winter Classic: The Return of Team Fayetteville

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

2024 Winter Classic: Meet Team Morehouse

Kathy Yelick on Post-Exascale Challenges

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

MLCommons Launches New AI Safety Benchmark Initiative

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

The VC View: Quantonation’s Deep Dive into Funding Quantum Start-ups

Nvidia’s GTC Is the New Intel IDF

Google Announces Homegrown ARM-based CPUs

Nvidia H100: Are 550,000 GPUs Enough for This Year?

Synopsys Eats Ansys: Does HPC Get Indigestion?

Intel’s Server and PC Chip Development Will Blur After 2025

Choosing the Right GPU for LLM Inference and Training

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

Google Addresses the Mysteries of Its Hypercomputer

How AMD May Get Across the CUDA Moat

Leading Solution Providers

Contributors

Tiffany Trader

Editorial Director

Douglas Eadline

Managing Editor

John Russell

Senior Editor

Kevin Jackson

Contributing Editor

Ali Azhar

Contributing Editor

Alex Woodie

Contributing Editor

Addison Snell

Contributing Editor

Drew Jolly

Assistant Editor

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

China Is All In on a RISC-V Future

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

DoD Takes a Long View of Quantum Computing

Eyes on the Quantum Prize – D-Wave Says its Time is Now

GenAI Having Major Impact on Data Culture, Survey Says

The GenAI Datacenter Squeeze Is Here

The Information Nexus of Advanced Computing and Data systems for a High Performance World

Share

Copy short link