Compilers and More: Optimizing GPU Kernels

By Michael Wolfe

October 30, 2008

My last column discussed some of the complexities of programming GPUs today, focusing on how to interface the host program with the GPU. Here we focus on programming the GPU itself. As with last time, we’ll look at a simple single-precision matrix multiplication, equivalent to the BLAS SGEMM routine.

Matmul is a highly parallel algorithm, but let me emphasize that parallelism does not equate to performance. We need to carefully sculpt our algorithm to match the parallelism available in the architecture in order to reap the benefits. This is true whether we are targeting a GPU, a multicore x64, or even a single core with packed SSE operations. As an example, I took the simple matmul loop (in C, but with the matrices stored column-major):

     for( int j = 0; j < m; ++j )
for( int k = 0; k < p; ++k )
for( int i = 0; i < n; ++i )
a[i+pitch_a*j] += b[i+pitch_b*k] * c[k+pitch_c*j];

modified it several ways and ran it on an Intel Xeon (3GHz, 6MB cache, 16GB memory, Penryn) using 4096×4096 matrices (to compare with results we’ll see below). With the loop in the order shown (stride-1 inner loop), the program ran at 1.7 GFLOPs; this is compiled C performance (using pgcc -fast). We can improve that by tiling or blocking the loops, organizing the matmul as a a bunch of submatrix multiplications, sized so each submatrix matmul fits in the processor cache. This improves performance to 5.7 GFLOPs, and it jumps to over 22 GFLOPs when we use OpenMP directives and run on all four cores. Advanced compilers help by automatically managing the vectorization, unrolling, memory alignments, adding prefetch instructions, and so forth.

We’re going to see several matmul GPU kernels, with performance on our GPU development system, with an NVIDIA GeForce GTX 280 (1GB memory, 30 multiprocessors), using NVIDIA’s CUDA language. The host is a Linux (OpenSUSE 11.0) triple-core AMD Phenom (2.1 GHz, 500KB cache, 4GB memory), though the host hardly matters; the performance for these experiments is entirely dominated by the GPU code.

As on the CPU, performance on a GPU can be fragile; small changes to the program can make large differences in performance. It’s easy to write a slow program. This was a characteristic of High Performance Fortran, one that (my opinion) was a major cause of its downfall; while HPF made it easier to write parallel programs, it didn’t make parallel programs fast. That is the job of the HPC programmer; the same will be true for accelerators, GPUs, and even multicore CPUs.

GPUs deliver their dramatic high performance through a well-balanced, carefully managed, highly parallel architecture. Algorithms running on the GPU must be parallelized and balanced as well; this does not come for free. Program development may cost extra time and effort to understand and use the appropriate programming model, a model that may not match the simple scalar processor with cache model we are comfortable with on x64 hosts. However, the analysis and programming techniques used to develop GPU algorithms will probably help you develop multicore programs as well. A good programming model with good compilers and tools can relieve you of much busywork, but you still have to think, and you still have to understand algorithms and architecture, and you should expect no less.

From here on below, I show many versions of matmul; if you’re not a programmer or want to skip over the details, look for the performance tags below, until the Summary; you don’t want to miss the conclusions. If you are a programmer and want to see all the code, you’ll find all the sources in a kernels tarfile at the PGI Web site.

In my last column, I proposed a simple matmul kernel for the GPU and focused on the host code to drive the kernel. We’ll use that simple kernel to start the discussion. What I had done is taken the matmul loop (as shown above), strip-mined the stride-1 i loop to the CUDA SIMD width of 32:

     for( int is = 0; is < n; is += 32 )
for( int i = is; i < is+32; ++i )
for( int j = 0; j < m; ++j )
for( int k = 0; k < p; ++k )
a[i+pitch_a*j] += b[i+pitch_b*k] * c[k+pitch_c*j];

run the i element loop as a thread block, and run the is strip loop and j loop in parallel:

     parfor( int is = 0; is < n; is += 32 )    /* K1 */
parfor( int j = 0; j < m; ++j )
SIMDfor( int i = is; i < is+32; ++i )
for( int k = 0; k < p; ++k )
a[i+pitch_a*j] += b[i+pitch_b*k] * c[k+pitch_c*j];

then optimized by hand just a little. The parallel (grid) loops and the SIMD (thread block) loop are handled implicitly by the GPU hardware and firmware, so they don’t appear in the kernel code. All that’s left is the body, the k loop. The final kernel in all its glory, cut-and-pasted from my CUDA source file, is:

    extern "C" __global__ void    /* K1 */
mmkernel( float* a, float* b, float* c,
int pitch_a, int pitch_b, int pitch_c,
int n, int m, int p )
{
int i = blockIdx.x*32 + threadIdx.x;
int j = blockIdx.y;
float sum = 0.0;
for( int k = 0; k < p; ++k )
sum += b[i+pitch_b*k] * c[k+pitch_c*j];
a[i+pitch_a*j] = sum;
}

Performance: 28 GFLOPs – SIMD width 32 next
This version runs at 28 GFLOPs on our system (on 4096×4096 matrices). In the interest of full disclosure, I compiled the kernels discussed here with NVIDIA’s NVCC compiler version 2.0 with the -O option, and compiled the driver routine with pgcc -fast; I ran each program three times and report the middle performance score, rounding the GFLOPs down to an integer value. I will generally show a GFLOPs number most directly comparable to a host matmul, including the overhead of transmitting the operand matrices to the GPU memory and the result matrix back. I will sometimes give the performance of just the matmul kernel on the GPU; while the two numbers are often quite close, the kernel-only number is useful to expose more clearly the effect of changes to the kernel program (since the overhead stays the same). I show results for 4096×4096 matrices, which is close to the peak performance for each kernel. For version K1, the host-to-host and kernel-only performance were 28 and 29 GFLOPs, respectively.

That may sound like good performance, but we’re not nearly taking full advantage of the available parallelism. Recall the NVIDIA architecture description; the card I’m using has 30 multiprocessors, each with eight thread processors, quad-clocked to get a SIMD width of 32. The kernel above is a scalar program, but the card runs 32 copies of it in SIMD mode (or SIMT mode, to use NVIDIA’s term); the 32 copies comprise a warp. Each multiprocessor uses multithreading to support up to 32 warps (1024 scalar threads). The 32 warps can come from different thread blocks (different iterations of is or j) or from wider thread blocks (more than 32 scalar threads). There are limits in this generation of the card: each multiprocessor can support up to eight simultaneous thread blocks, and a thread block can support up to 16 warps.

Performance: 36 GFLOPs – SIMD width 64 next
The K1 kernel has only one warp per thread block, so at most eight thread blocks will be active on each multiprocessor, out of the possible 32. I can improve that by strip-mining the i loop to a width of 64, changing the 32 to a 64 in kernel K1, and running with 64 threads in each thread block. With this version, if eight thread blocks are scheduled on each multiprocessor, we get up to 16 warps, so the multithreading is more effective. And we see a performance increase, to 36 GFLOPs (38 kernel-only).

Performance: 35 GFLOPs – SIMD width 128 next
So what happens if I try this trick again, doubling or quadrupling the strip size to 128 or 256? This increases the maximum number of warps per multiprocessor to 32 (which is the limit), so we might expect another bump in performance from improved multithreading. Unfortunately, we don’t; the performance drops slightly to 35 GFLOPs (36 kernel-only) in both cases. This I can’t quite explain.

Performance: 1.7 GFLOPs – SIMD width 32, non-stride-1 array accesses next
Even with this simple version, I made some assumptions and optimizations, knowing something about the machine. I know that stride-1 accesses in a thread block are important, so I ran the stride-1 i loop along the thread index. Just how important is that? Suppose we switch the i and j indices, so the SIMD memory accesses are along a column; the performance drops from 28 GFLOPs (K1) to 1.7. We can call this kernel Ks (s for stupid, or slow).

Performance: 5.7 GFLOPs – SIMD width 32, inverted parallel loop indices next
Still, we’ve only just started. If we inspect the code for kernel K1, we note that the inner loop contains two memory fetches, for b and c; both fetches are from the device memory, which has a very high latency. In particular, the fetch for c loads the same element for all the threads in the thread block. The memory system is designed for high bandwidth when all the threads access consecutive elements, such as with the b access. This used to be called superword access in classical vector machines, where the memory returns 64-bytes (or more) at a time. Kernel K1 doesn’t take advantage of this memory design for the c access, but we can fix that. Let’s strip-mine the k loop, and load a strip of c into the multiprocessor local memory. The pseudo code is:

     parfor( int is = 0; is < n; is += 32 )    /* K2 */
parfor( int j = 0; j < m; ++j )
SIMDfor( int i = is; i < is+32; ++i )
for( int ks = 0; ks < p; ks += 32 )
cb[ks:ks+31] = c[ks+pitch_c*j:ks+31+pitch_c*j];
for( int k = ks; k < ks+32; ++k )
a[i+pitch_a*j] += b[i+pitch_b*k] * cb[k-ks];

Performance: 33 GFLOPs – cached access to c, SIMD width 32
Performance: 55 GFLOPs – cached access to c, SIMD width 64
Performance: 63 GFLOPs – cached access to c, SIMD width 128 next
Note the vector fetch of c into the temporary array cb. This is handled in kernel K2 by letting each thread fetch one element and storing into the multiprocessor local memory, so the inner loop only has one device memory fetch; the fetch of cb from the local memory is almost as fast as a register access; we see the performance improves to 33 GFLOPS, up from 28. We can again increase the number of threads per block from 32 to 64 and 128, and we see performance improve from 33 to 55 and 63 GFLOPs. As with kernel K1, increasing to 256 threads per block does not improve performance. An implementation detail: with more than one warp per thread block, we need to synchronize the warps after loading the temp array cb, and before reloading it the next time around the ks loop; see the CUDA source code for this detail.

Performance: 63 GFLOPs – cached access to c, SIMD width 64, unroll inner loop next
But we’re not done yet. What if we unroll the inner loop, to reduce the loop overhead? We might unroll to a factor of 2 or 4 or even 16. Unrolling the inner loop once in the 64-wide K2 kernel does improve performance, getting 63 GFLOPs, but more unrolling doesn’t help, and it doesn’t help the 128-wide kernel.

So far we’ve got two kernel versions, with variations in the thread-block (vector) size and unrolling. And we’ve only just begun. We tried unrolling the inner k loop; what if we try unrolling one of the outer loops? We could let each kernel instance compute two values of the i loop. The pseudo-code looks like:

      parfor( int is = 0; is < n; is += 64 )    /* K3 */
parfor( int j = 0; j < m; ++j )
SIMDfor( int i = is; i < is+32; ++i )
for( int ks = 0; ks < p; ks += 32 )
cb[ks:ks+31] = c[ks+pitch_c*j:ks+31+pitch_c*j];
for( int k = ks; k < ks+32; ++k )
a[i+pitch_a*j] += b[i+pitch_b*k] * cb[k-ks];
a[i+32+pitch_a*j] += b[i+32+pitch_b*k] * cb[k-ks];

Performance: 53 GFLOPs – cached access to c, SIMD width 32, unroll i loop
Performance: 63 GFLOPs – cached access to c, SIMD width 32, unroll i loop 3x next
Each iteration of the i loop now computes values for i and i+32. We don’t expect much advantage here, since the only values shared between the two i iterations are loaded from the local memory, which is already pretty fast. But even this kernel improves upon K2, with 53 GFLOPs. We can improve this to 63 GFLOPs by unrolling more or increasing the SIMD width to 64.

Next, we can try unrolling the j loop, so each kernel computes values for j and j+1. The pseudo-code is:

      parfor( int is = 0; is < n; is += 32 )    /* K4 */
parfor( int j = 0; j < m; j += 2 )
SIMDfor( int i = is; i < is+32; ++i )
for( int ks = 0; ks < p; ks += 32 )
cb0[ks:ks+31] = c[ks+pitch_c*j:ks+31+pitch_c*j];
cb1[ks:ks+31] = c[ks+pitch_c*(j+1):ks+31+pitch_c*(j+1)];
for( int k = ks; k < ks+64; ++k )
a[i+pitch_a*j] += b[i+pitch_b*k] * cb0[k-ks];
a[i+pitch_a*(j+1)] += b[i+pitch_b*k] * cb1[k-ks];

Performance: 59 GFLOPs – cached access to c, SIMD width 32, unroll j loop
Performance: 98 GFLOPs – cached access to c, SIMD width 64, unroll j loop
Performance: 117 GFLOPs – cached access to c, SIMD width 128, unroll j loop next
Here, we note the two assignments in the k loop fetch the same value of b from the device memory. This version gets 59 GFLOPs; it jumps to 98 GFLOPs when we increase the SIMD width to 64, and again to 117 GFLOPs with a SIMD width of 128. Now we’re starting to see real performance, over 100 GFLOPs, host-to-host.

Performance: 176 GFLOPs – cached access to c, SIMD width 128, unroll j loop 3x next
But we’re not done yet. What if we unroll the j loop by four iterations instead of just two? This involves keeping four partial sums. Now the performance with SIMD width 128 is 176 GFLOPs host-to-host, and over 210 GFLOPs on the device.

Performance: 202 GFLOPs – cached access to c, SIMD width 128, unroll j loop 3x and k loop 1x
Performance: 208 GFLOPs – cached access to c, SIMD width 128, unroll j and k loops 3x next
More unrolling of the j loop doesn’t improve performance, but what if we combine this with unrolling the k loop? If we unroll the j loop 3 times and the k loop once, with SIMD width of 128, we get 202 GFLOPs; unrolling the k 3 times gives us 208 GFLOPs (host-to-host), and 265 GFLOPs (kernel-only).

Our peak performance so far looks pretty good. It took some experimentation, but we have a version that uses only 128 threads and 2KB local memory per thread block, allowing up to 8 thread blocks on each multiprocessor, so taking great advantage of the multithreading properties of the machine. We haven’t even fully explored all the combinations. What if we combine i loop unrolling with the j and k loop unrolling? Should we explore other unroll factors as we combine unrolling multiple loops? What if we use pointer arithmetic instead of array references (this really questions whether NVCC optimizes the array references, but it seems to do a good job there)? I desperately wanted to break the 200 GFLOP barrier, and reached it. The version of sgemm that comes with CUDA BLAS gets about 260 GFLOPs (host-to-host) on a 4096×4096 matrix; I’ve still got some work to do to get that extra 25%.

When we optimize a matmul for a general purpose CPU with a cache, we’ve learned that we need a tiled algorithm. We can do the same thing on the GPU, where we fit the submatrices in the local memory. The pseudo code is:

      parfor( int is = 0; is < n; is += 16 )    /* K5 */
parfor( int js = 0; js < m; js += 16 )
SIMDfor( int i = 0; i < 16; ++i )
SIMDfor( int j = 0; j < 16; ++j )
at[is:is+15][js:js+15] = 0.0; /* init A tile */
for( int ks = 0; ks < p; ks += 16 )
/*load B tile*/ bt[i][ks:ks+15] = b[i+pitch_b*ks:i+pitch_b*(ks+15):pitch_b];
/*load C tile*/ ct[ks:ks+15][j] = c[ks+pitch_c*(js+j):(ks+15)+pitch_c*(js+j)];
for( int k = ks; k < ks+64; ++k )
/*tile MM */ at[i][j] += bt[i][k]*ct[k][j];
a[i+pitch_a*j] = a[i][j]; /* store A tile */

Performance: 164 GFLOPs – tiled loops, SIMD width 16×16, cached access to b and c
We have to choose a tile size, and square tiles seem to make as much sense as any other shape, at least to start with. We choose 16×16 tiles and run 256 threads in a thread group, so each thread will compute one element of the a tile; this lets us keep that element in a register. The actual kernel code is slightly more complex than the previous kernels. It’s important to recall that this scalar kernel is one of a thread group or cohort of 256 cooperating instances, and it only works in that domain. This version gives us 164 GFLOPs, not quite as good as we’ve already seen. Why not? One reason is the thread group is 256 threads, so we hit the 1024 threads/multiprocessor limit with only four thread groups. We can address that as well, but I still haven’t quite reached the peak performance shown on kernel K4. The CUDA blas (260 GFLOPs) sgemm is similar to this tiled version. It’s based on work by Vasily Volkov, a Computer Science PhD student at Cal; Vasily’s code uses a 16×4 thread block with the i loop unrolled by 16, the j loop by 4, and the k loop by 16 (if I read it right). The code might be hard to follow, but it sure beats trying to code a matmul in DirectX
So let’s suppose we’ve decided on the K4 algorithm. It assumes that the matrix sizes are multiples of 32 (or 64 or 128) in all dimensions, though it doesn’t require the matrices to be square. One way to satisfy this is to pad all your matrices, filling in zeroes in the extra rows and columns. Matrix addition and multiplication will preserve these zeroes and will not pollute the actual values; this may be your best option. Another solution is to add conditionals so as to not run off the ends of the matrices. This complicates the code and can affect performance. The simplest method to test for array limits is to put conditionals around the device memory fetch and store operations; if we fill in zeros to the b and c tiles, the innermost loop won’t need any tests. I reproduce the body of the kernel here:

        float sum0 = 0.0, sum1 = 0.0;
for( int ks = 0; ks < p; ks += 32 ){
if( ks+tx < p && j < m )
cb0[tx] = c[ks+tx+pitch_c*j];
else
cb0[tx] = 0.0;
if( ks+tx < p && j+1 < m )
cb1[tx] = c[ks+tx+pitch_c*(j+1)];
else
cb1[tx] = 0.0;
__syncthreads();
if( i < n ){
for( int k = ks; k < ((ks+32 < m) ? ks+32 : m); ++k ){
float rb = b[i+pitch_b*k];
sum0 += rb * cb0[k-ks];
sum1 += rb * cb1[k-ks];
}
}
__syncthreads();
}
if( i < n && j < m )
a[i+pitch_a*j] = sum0;
if( i < n && j+1 < m )
a[i+pitch_a*(j+1)] = sum1;

Even if i and j are outside the matrix bounds, we can’t just skip the body of the loop for two reasons. First, each thread is part of a thread group, and as such it loads part of the data into the local temporary arrays cb0 and cb1; even if this thread has nothing to compute, it has to do its part of the shared work. Second, we have those pesky barrier synchronizations; all threads in a thread group must participate in the barrier, so even if this thread has no work to do, it had better reach those barriers.

These tests cost about 5% in performance, in the simplest version of K4. It’s less costly in the more complex versions, but the code gets messy when mixed with some of the unrolling. But it will work with any matrix size, whereas K4 requires the size to be a multiple of 32.

Of course, if you need to deliver a library that works regardless of the matrix sizes, you have another option. You can create two versions of your routine, a faster one that works when the matrix sizes are appropriate multiples of 16, and a slower, general purpose one that works for other matrix sizes, with a conditional test to execute the right one. Then you get your good benchmark numbers (all benchmarks use large powers of two, right?), and you get right answers, too.

Summary

The point I’ve tried to make is how sensitive the performance of the GPU is to the formulation of your kernel, and how much and what kind of experimentation you’ll need to do to optimize your performance. How much of the optimization process will carry over from one GPU to another, or from one generation to the next from the same vendor? Many programmers like this low level of control, and it certainly could be appropriate when developing a numerical library, in the same way assembly language is appropriate.

To be fair, the same is true on your CPU as well; you need to optimize your matmul for (packed) vector operations, memory strides, and cache locality. A bad program will run several times slower than a good one; ordering the matmul loops so the inner loop is non-stride-1 reduces the performance on large matrices (on our Penryn) by more than a factor of 10. But compilers and tools are far more mature and helpful when compiling for an x64, IBM POWER, Sun SPARC, or other CPU.

I’m sure many readers would like to tell me (again) that I should be using the prebuilt library version of sgemm for matmul, not writing my own. Save your breath. Matmul is just one simple example here, three loops, three matrices, lots of parallelism, and yet I put in several days of work to get this seven line loop optimized for the GPU.

We can compare the evolution of GPU programming to the evolution of shared-memory parallel programming. There were many new languages designed to take advantage of parallelism (Id, SISAL, many others). Many low-level libraries were written to create and manage threads, eventually standardizing on Posix threads (aka pthreads). Much work was done on automatic parallelization, dating back to the 1960s and 1970s. When successful commercial shared-memory multiprocessors became more widely available in the 1980s, an effort began to standardize a less intrusive programming interface for multiple processors, driven by multiprocessor workstations, eventually resulting in the OpenMP API, which defines directives and a runtime interface to a shared-memory parallel programming model.

GPUs have their own set of domain-specific languages, including GLSL (OpenGL Shading Language), HLSL (high level shader language) from Microsoft for DirectX, and Cg (C for graphics) from NVIDIA. We’re now in a period with development of low-level libraries and interfaces to create and manage GPU threads; the OpenCL effort aims to standardize this. The cost to port a nontrivial application to this model is high, though the potential performance is alluring.

Luckily for me, my application (the compiler) runs on the host, and I don’t have to port that. But what about the real application programmer, who has thousands (or hundreds of thousands) of lines of code? Is it feasible to take GPU or accelerator programming concepts, abstract them into a predictable and useful programming model, and present them using a portable programming interface, in the same way that OpenMP abstracts and presents multiprocessor and multicore systems? That’s a topic for my next column.

—–

Michael Wolfe has developed compilers for over 30 years in both academia and industry, and is now a senior compiler engineer at The Portland Group, Inc. (www.pgroup.com), a wholly-owned subsidiary of STMicroelectronics, Inc. The opinions stated here are those of the author, and do not represent opinions of The Portland Group, Inc. or STMicroelectronics, Inc.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, code-named Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from its predecessors, including the red-hot H100 and A100 GPUs. Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. While Nvidia may not spring to mind when thinking of the quant Read more…

2024 Winter Classic: Meet the HPE Mentors

March 18, 2024

The latest installment of the 2024 Winter Classic Studio Update Show features our interview with the HPE mentor team who introduced our student teams to the joys (and potential sorrows) of the HPL (LINPACK) and accompany Read more…

Houston We Have a Solution: Addressing the HPC and Tech Talent Gap

March 15, 2024

Generations of Houstonian teachers, counselors, and parents have either worked in the aerospace industry or know people who do - the prospect of entering the field was normalized for boys in 1969 when the Apollo 11 missi Read more…

Apple Buys DarwinAI Deepening its AI Push According to Report

March 14, 2024

Apple has purchased Canadian AI startup DarwinAI according to a Bloomberg report today. Apparently the deal was done early this year but still hasn’t been publicly announced according to the report. Apple is preparing Read more…

Survey of Rapid Training Methods for Neural Networks

March 14, 2024

Artificial neural networks are computing systems with interconnected layers that process and learn from data. During training, neural networks utilize optimization algorithms to iteratively refine their parameters until Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, code-named Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. Wh Read more…

Houston We Have a Solution: Addressing the HPC and Tech Talent Gap

March 15, 2024

Generations of Houstonian teachers, counselors, and parents have either worked in the aerospace industry or know people who do - the prospect of entering the fi Read more…

Survey of Rapid Training Methods for Neural Networks

March 14, 2024

Artificial neural networks are computing systems with interconnected layers that process and learn from data. During training, neural networks utilize optimizat Read more…

PASQAL Issues Roadmap to 10,000 Qubits in 2026 and Fault Tolerance in 2028

March 13, 2024

Paris-based PASQAL, a developer of neutral atom-based quantum computers, yesterday issued a roadmap for delivering systems with 10,000 physical qubits in 2026 a Read more…

India Is an AI Powerhouse Waiting to Happen, but Challenges Await

March 12, 2024

The Indian government is pushing full speed ahead to make the country an attractive technology base, especially in the hot fields of AI and semiconductors, but Read more…

Charles Tahan Exits National Quantum Coordination Office

March 12, 2024

(March 1, 2024) My first official day at the White House Office of Science and Technology Policy (OSTP) was June 15, 2020, during the depths of the COVID-19 loc Read more…

AI Bias In the Spotlight On International Women’s Day

March 11, 2024

What impact does AI bias have on women and girls? What can people do to increase female participation in the AI field? These are some of the questions the tech Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Analyst Panel Says Take the Quantum Computing Plunge Now…

November 27, 2023

Should you start exploring quantum computing? Yes, said a panel of analysts convened at Tabor Communications HPC and AI on Wall Street conference earlier this y Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Leading Solution Providers

Contributors

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Training of 1-Trillion Parameter Scientific AI Begins

November 13, 2023

A US national lab has started training a massive AI brain that could ultimately become the must-have computing resource for scientific researchers. Argonne N Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

Google Introduces ‘Hypercomputer’ to Its AI Infrastructure

December 11, 2023

Google ran out of monikers to describe its new AI system released on December 7. Supercomputer perhaps wasn't an apt description, so it settled on Hypercomputer Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire