Compilers and More: Optimizing GPU Kernels

By Michael Wolfe

October 30, 2008

My last column discussed some of the complexities of programming GPUs today, focusing on how to interface the host program with the GPU. Here we focus on programming the GPU itself. As with last time, we’ll look at a simple single-precision matrix multiplication, equivalent to the BLAS SGEMM routine.

Matmul is a highly parallel algorithm, but let me emphasize that parallelism does not equate to performance. We need to carefully sculpt our algorithm to match the parallelism available in the architecture in order to reap the benefits. This is true whether we are targeting a GPU, a multicore x64, or even a single core with packed SSE operations. As an example, I took the simple matmul loop (in C, but with the matrices stored column-major):

     for( int j = 0; j < m; ++j )
for( int k = 0; k < p; ++k )
for( int i = 0; i < n; ++i )
a[i+pitch_a*j] += b[i+pitch_b*k] * c[k+pitch_c*j];

modified it several ways and ran it on an Intel Xeon (3GHz, 6MB cache, 16GB memory, Penryn) using 4096×4096 matrices (to compare with results we’ll see below). With the loop in the order shown (stride-1 inner loop), the program ran at 1.7 GFLOPs; this is compiled C performance (using pgcc -fast). We can improve that by tiling or blocking the loops, organizing the matmul as a a bunch of submatrix multiplications, sized so each submatrix matmul fits in the processor cache. This improves performance to 5.7 GFLOPs, and it jumps to over 22 GFLOPs when we use OpenMP directives and run on all four cores. Advanced compilers help by automatically managing the vectorization, unrolling, memory alignments, adding prefetch instructions, and so forth.

We’re going to see several matmul GPU kernels, with performance on our GPU development system, with an NVIDIA GeForce GTX 280 (1GB memory, 30 multiprocessors), using NVIDIA’s CUDA language. The host is a Linux (OpenSUSE 11.0) triple-core AMD Phenom (2.1 GHz, 500KB cache, 4GB memory), though the host hardly matters; the performance for these experiments is entirely dominated by the GPU code.

As on the CPU, performance on a GPU can be fragile; small changes to the program can make large differences in performance. It’s easy to write a slow program. This was a characteristic of High Performance Fortran, one that (my opinion) was a major cause of its downfall; while HPF made it easier to write parallel programs, it didn’t make parallel programs fast. That is the job of the HPC programmer; the same will be true for accelerators, GPUs, and even multicore CPUs.

GPUs deliver their dramatic high performance through a well-balanced, carefully managed, highly parallel architecture. Algorithms running on the GPU must be parallelized and balanced as well; this does not come for free. Program development may cost extra time and effort to understand and use the appropriate programming model, a model that may not match the simple scalar processor with cache model we are comfortable with on x64 hosts. However, the analysis and programming techniques used to develop GPU algorithms will probably help you develop multicore programs as well. A good programming model with good compilers and tools can relieve you of much busywork, but you still have to think, and you still have to understand algorithms and architecture, and you should expect no less.

From here on below, I show many versions of matmul; if you’re not a programmer or want to skip over the details, look for the performance tags below, until the Summary; you don’t want to miss the conclusions. If you are a programmer and want to see all the code, you’ll find all the sources in a kernels tarfile at the PGI Web site.

In my last column, I proposed a simple matmul kernel for the GPU and focused on the host code to drive the kernel. We’ll use that simple kernel to start the discussion. What I had done is taken the matmul loop (as shown above), strip-mined the stride-1 i loop to the CUDA SIMD width of 32:

     for( int is = 0; is < n; is += 32 )
for( int i = is; i < is+32; ++i )
for( int j = 0; j < m; ++j )
for( int k = 0; k < p; ++k )
a[i+pitch_a*j] += b[i+pitch_b*k] * c[k+pitch_c*j];

run the i element loop as a thread block, and run the is strip loop and j loop in parallel:

     parfor( int is = 0; is < n; is += 32 )    /* K1 */
parfor( int j = 0; j < m; ++j )
SIMDfor( int i = is; i < is+32; ++i )
for( int k = 0; k < p; ++k )
a[i+pitch_a*j] += b[i+pitch_b*k] * c[k+pitch_c*j];

then optimized by hand just a little. The parallel (grid) loops and the SIMD (thread block) loop are handled implicitly by the GPU hardware and firmware, so they don’t appear in the kernel code. All that’s left is the body, the k loop. The final kernel in all its glory, cut-and-pasted from my CUDA source file, is:

    extern "C" __global__ void    /* K1 */
mmkernel( float* a, float* b, float* c,
int pitch_a, int pitch_b, int pitch_c,
int n, int m, int p )
{
int i = blockIdx.x*32 + threadIdx.x;
int j = blockIdx.y;
float sum = 0.0;
for( int k = 0; k < p; ++k )
sum += b[i+pitch_b*k] * c[k+pitch_c*j];
a[i+pitch_a*j] = sum;
}

Performance: 28 GFLOPs – SIMD width 32 next
This version runs at 28 GFLOPs on our system (on 4096×4096 matrices). In the interest of full disclosure, I compiled the kernels discussed here with NVIDIA’s NVCC compiler version 2.0 with the -O option, and compiled the driver routine with pgcc -fast; I ran each program three times and report the middle performance score, rounding the GFLOPs down to an integer value. I will generally show a GFLOPs number most directly comparable to a host matmul, including the overhead of transmitting the operand matrices to the GPU memory and the result matrix back. I will sometimes give the performance of just the matmul kernel on the GPU; while the two numbers are often quite close, the kernel-only number is useful to expose more clearly the effect of changes to the kernel program (since the overhead stays the same). I show results for 4096×4096 matrices, which is close to the peak performance for each kernel. For version K1, the host-to-host and kernel-only performance were 28 and 29 GFLOPs, respectively.

That may sound like good performance, but we’re not nearly taking full advantage of the available parallelism. Recall the NVIDIA architecture description; the card I’m using has 30 multiprocessors, each with eight thread processors, quad-clocked to get a SIMD width of 32. The kernel above is a scalar program, but the card runs 32 copies of it in SIMD mode (or SIMT mode, to use NVIDIA’s term); the 32 copies comprise a warp. Each multiprocessor uses multithreading to support up to 32 warps (1024 scalar threads). The 32 warps can come from different thread blocks (different iterations of is or j) or from wider thread blocks (more than 32 scalar threads). There are limits in this generation of the card: each multiprocessor can support up to eight simultaneous thread blocks, and a thread block can support up to 16 warps.

Performance: 36 GFLOPs – SIMD width 64 next
The K1 kernel has only one warp per thread block, so at most eight thread blocks will be active on each multiprocessor, out of the possible 32. I can improve that by strip-mining the i loop to a width of 64, changing the 32 to a 64 in kernel K1, and running with 64 threads in each thread block. With this version, if eight thread blocks are scheduled on each multiprocessor, we get up to 16 warps, so the multithreading is more effective. And we see a performance increase, to 36 GFLOPs (38 kernel-only).

Performance: 35 GFLOPs – SIMD width 128 next
So what happens if I try this trick again, doubling or quadrupling the strip size to 128 or 256? This increases the maximum number of warps per multiprocessor to 32 (which is the limit), so we might expect another bump in performance from improved multithreading. Unfortunately, we don’t; the performance drops slightly to 35 GFLOPs (36 kernel-only) in both cases. This I can’t quite explain.

Performance: 1.7 GFLOPs – SIMD width 32, non-stride-1 array accesses next
Even with this simple version, I made some assumptions and optimizations, knowing something about the machine. I know that stride-1 accesses in a thread block are important, so I ran the stride-1 i loop along the thread index. Just how important is that? Suppose we switch the i and j indices, so the SIMD memory accesses are along a column; the performance drops from 28 GFLOPs (K1) to 1.7. We can call this kernel Ks (s for stupid, or slow).

Performance: 5.7 GFLOPs – SIMD width 32, inverted parallel loop indices next
Still, we’ve only just started. If we inspect the code for kernel K1, we note that the inner loop contains two memory fetches, for b and c; both fetches are from the device memory, which has a very high latency. In particular, the fetch for c loads the same element for all the threads in the thread block. The memory system is designed for high bandwidth when all the threads access consecutive elements, such as with the b access. This used to be called superword access in classical vector machines, where the memory returns 64-bytes (or more) at a time. Kernel K1 doesn’t take advantage of this memory design for the c access, but we can fix that. Let’s strip-mine the k loop, and load a strip of c into the multiprocessor local memory. The pseudo code is:

     parfor( int is = 0; is < n; is += 32 )    /* K2 */
parfor( int j = 0; j < m; ++j )
SIMDfor( int i = is; i < is+32; ++i )
for( int ks = 0; ks < p; ks += 32 )
cb[ks:ks+31] = c[ks+pitch_c*j:ks+31+pitch_c*j];
for( int k = ks; k < ks+32; ++k )
a[i+pitch_a*j] += b[i+pitch_b*k] * cb[k-ks];

Performance: 33 GFLOPs – cached access to c, SIMD width 32
Performance: 55 GFLOPs – cached access to c, SIMD width 64
Performance: 63 GFLOPs – cached access to c, SIMD width 128 next
Note the vector fetch of c into the temporary array cb. This is handled in kernel K2 by letting each thread fetch one element and storing into the multiprocessor local memory, so the inner loop only has one device memory fetch; the fetch of cb from the local memory is almost as fast as a register access; we see the performance improves to 33 GFLOPS, up from 28. We can again increase the number of threads per block from 32 to 64 and 128, and we see performance improve from 33 to 55 and 63 GFLOPs. As with kernel K1, increasing to 256 threads per block does not improve performance. An implementation detail: with more than one warp per thread block, we need to synchronize the warps after loading the temp array cb, and before reloading it the next time around the ks loop; see the CUDA source code for this detail.

Performance: 63 GFLOPs – cached access to c, SIMD width 64, unroll inner loop next
But we’re not done yet. What if we unroll the inner loop, to reduce the loop overhead? We might unroll to a factor of 2 or 4 or even 16. Unrolling the inner loop once in the 64-wide K2 kernel does improve performance, getting 63 GFLOPs, but more unrolling doesn’t help, and it doesn’t help the 128-wide kernel.

So far we’ve got two kernel versions, with variations in the thread-block (vector) size and unrolling. And we’ve only just begun. We tried unrolling the inner k loop; what if we try unrolling one of the outer loops? We could let each kernel instance compute two values of the i loop. The pseudo-code looks like:

      parfor( int is = 0; is < n; is += 64 )    /* K3 */
parfor( int j = 0; j < m; ++j )
SIMDfor( int i = is; i < is+32; ++i )
for( int ks = 0; ks < p; ks += 32 )
cb[ks:ks+31] = c[ks+pitch_c*j:ks+31+pitch_c*j];
for( int k = ks; k < ks+32; ++k )
a[i+pitch_a*j] += b[i+pitch_b*k] * cb[k-ks];
a[i+32+pitch_a*j] += b[i+32+pitch_b*k] * cb[k-ks];

Performance: 53 GFLOPs – cached access to c, SIMD width 32, unroll i loop
Performance: 63 GFLOPs – cached access to c, SIMD width 32, unroll i loop 3x next
Each iteration of the i loop now computes values for i and i+32. We don’t expect much advantage here, since the only values shared between the two i iterations are loaded from the local memory, which is already pretty fast. But even this kernel improves upon K2, with 53 GFLOPs. We can improve this to 63 GFLOPs by unrolling more or increasing the SIMD width to 64.

Next, we can try unrolling the j loop, so each kernel computes values for j and j+1. The pseudo-code is:

      parfor( int is = 0; is < n; is += 32 )    /* K4 */
parfor( int j = 0; j < m; j += 2 )
SIMDfor( int i = is; i < is+32; ++i )
for( int ks = 0; ks < p; ks += 32 )
cb0[ks:ks+31] = c[ks+pitch_c*j:ks+31+pitch_c*j];
cb1[ks:ks+31] = c[ks+pitch_c*(j+1):ks+31+pitch_c*(j+1)];
for( int k = ks; k < ks+64; ++k )
a[i+pitch_a*j] += b[i+pitch_b*k] * cb0[k-ks];
a[i+pitch_a*(j+1)] += b[i+pitch_b*k] * cb1[k-ks];

Performance: 59 GFLOPs – cached access to c, SIMD width 32, unroll j loop
Performance: 98 GFLOPs – cached access to c, SIMD width 64, unroll j loop
Performance: 117 GFLOPs – cached access to c, SIMD width 128, unroll j loop next
Here, we note the two assignments in the k loop fetch the same value of b from the device memory. This version gets 59 GFLOPs; it jumps to 98 GFLOPs when we increase the SIMD width to 64, and again to 117 GFLOPs with a SIMD width of 128. Now we’re starting to see real performance, over 100 GFLOPs, host-to-host.

Performance: 176 GFLOPs – cached access to c, SIMD width 128, unroll j loop 3x next
But we’re not done yet. What if we unroll the j loop by four iterations instead of just two? This involves keeping four partial sums. Now the performance with SIMD width 128 is 176 GFLOPs host-to-host, and over 210 GFLOPs on the device.

Performance: 202 GFLOPs – cached access to c, SIMD width 128, unroll j loop 3x and k loop 1x
Performance: 208 GFLOPs – cached access to c, SIMD width 128, unroll j and k loops 3x next
More unrolling of the j loop doesn’t improve performance, but what if we combine this with unrolling the k loop? If we unroll the j loop 3 times and the k loop once, with SIMD width of 128, we get 202 GFLOPs; unrolling the k 3 times gives us 208 GFLOPs (host-to-host), and 265 GFLOPs (kernel-only).

Our peak performance so far looks pretty good. It took some experimentation, but we have a version that uses only 128 threads and 2KB local memory per thread block, allowing up to 8 thread blocks on each multiprocessor, so taking great advantage of the multithreading properties of the machine. We haven’t even fully explored all the combinations. What if we combine i loop unrolling with the j and k loop unrolling? Should we explore other unroll factors as we combine unrolling multiple loops? What if we use pointer arithmetic instead of array references (this really questions whether NVCC optimizes the array references, but it seems to do a good job there)? I desperately wanted to break the 200 GFLOP barrier, and reached it. The version of sgemm that comes with CUDA BLAS gets about 260 GFLOPs (host-to-host) on a 4096×4096 matrix; I’ve still got some work to do to get that extra 25%.

When we optimize a matmul for a general purpose CPU with a cache, we’ve learned that we need a tiled algorithm. We can do the same thing on the GPU, where we fit the submatrices in the local memory. The pseudo code is:

      parfor( int is = 0; is < n; is += 16 )    /* K5 */
parfor( int js = 0; js < m; js += 16 )
SIMDfor( int i = 0; i < 16; ++i )
SIMDfor( int j = 0; j < 16; ++j )
at[is:is+15][js:js+15] = 0.0; /* init A tile */
for( int ks = 0; ks < p; ks += 16 )
/*load B tile*/ bt[i][ks:ks+15] = b[i+pitch_b*ks:i+pitch_b*(ks+15):pitch_b];
/*load C tile*/ ct[ks:ks+15][j] = c[ks+pitch_c*(js+j):(ks+15)+pitch_c*(js+j)];
for( int k = ks; k < ks+64; ++k )
/*tile MM */ at[i][j] += bt[i][k]*ct[k][j];
a[i+pitch_a*j] = a[i][j]; /* store A tile */

Performance: 164 GFLOPs – tiled loops, SIMD width 16×16, cached access to b and c
We have to choose a tile size, and square tiles seem to make as much sense as any other shape, at least to start with. We choose 16×16 tiles and run 256 threads in a thread group, so each thread will compute one element of the a tile; this lets us keep that element in a register. The actual kernel code is slightly more complex than the previous kernels. It’s important to recall that this scalar kernel is one of a thread group or cohort of 256 cooperating instances, and it only works in that domain. This version gives us 164 GFLOPs, not quite as good as we’ve already seen. Why not? One reason is the thread group is 256 threads, so we hit the 1024 threads/multiprocessor limit with only four thread groups. We can address that as well, but I still haven’t quite reached the peak performance shown on kernel K4. The CUDA blas (260 GFLOPs) sgemm is similar to this tiled version. It’s based on work by Vasily Volkov, a Computer Science PhD student at Cal; Vasily’s code uses a 16×4 thread block with the i loop unrolled by 16, the j loop by 4, and the k loop by 16 (if I read it right). The code might be hard to follow, but it sure beats trying to code a matmul in DirectX. PGI GPU Matmul Performance Chart
So let’s suppose we’ve decided on the K4 algorithm. It assumes that the matrix sizes are multiples of 32 (or 64 or 128) in all dimensions, though it doesn’t require the matrices to be square. One way to satisfy this is to pad all your matrices, filling in zeroes in the extra rows and columns. Matrix addition and multiplication will preserve these zeroes and will not pollute the actual values; this may be your best option. Another solution is to add conditionals so as to not run off the ends of the matrices. This complicates the code and can affect performance. The simplest method to test for array limits is to put conditionals around the device memory fetch and store operations; if we fill in zeros to the b and c tiles, the innermost loop won’t need any tests. I reproduce the body of the kernel here:

        float sum0 = 0.0, sum1 = 0.0;
for( int ks = 0; ks < p; ks += 32 ){
if( ks+tx < p && j < m )
cb0[tx] = c[ks+tx+pitch_c*j];
else
cb0[tx] = 0.0;
if( ks+tx < p && j+1 < m )
cb1[tx] = c[ks+tx+pitch_c*(j+1)];
else
cb1[tx] = 0.0;
__syncthreads();
if( i < n ){
for( int k = ks; k < ((ks+32 < m) ? ks+32 : m); ++k ){
float rb = b[i+pitch_b*k];
sum0 += rb * cb0[k-ks];
sum1 += rb * cb1[k-ks];
}
}
__syncthreads();
}
if( i < n && j < m )
a[i+pitch_a*j] = sum0;
if( i < n && j+1 < m )
a[i+pitch_a*(j+1)] = sum1;

Even if i and j are outside the matrix bounds, we can’t just skip the body of the loop for two reasons. First, each thread is part of a thread group, and as such it loads part of the data into the local temporary arrays cb0 and cb1; even if this thread has nothing to compute, it has to do its part of the shared work. Second, we have those pesky barrier synchronizations; all threads in a thread group must participate in the barrier, so even if this thread has no work to do, it had better reach those barriers.

These tests cost about 5% in performance, in the simplest version of K4. It’s less costly in the more complex versions, but the code gets messy when mixed with some of the unrolling. But it will work with any matrix size, whereas K4 requires the size to be a multiple of 32.

Of course, if you need to deliver a library that works regardless of the matrix sizes, you have another option. You can create two versions of your routine, a faster one that works when the matrix sizes are appropriate multiples of 16, and a slower, general purpose one that works for other matrix sizes, with a conditional test to execute the right one. Then you get your good benchmark numbers (all benchmarks use large powers of two, right?), and you get right answers, too.

Summary

The point I’ve tried to make is how sensitive the performance of the GPU is to the formulation of your kernel, and how much and what kind of experimentation you’ll need to do to optimize your performance. How much of the optimization process will carry over from one GPU to another, or from one generation to the next from the same vendor? Many programmers like this low level of control, and it certainly could be appropriate when developing a numerical library, in the same way assembly language is appropriate.

To be fair, the same is true on your CPU as well; you need to optimize your matmul for (packed) vector operations, memory strides, and cache locality. A bad program will run several times slower than a good one; ordering the matmul loops so the inner loop is non-stride-1 reduces the performance on large matrices (on our Penryn) by more than a factor of 10. But compilers and tools are far more mature and helpful when compiling for an x64, IBM POWER, Sun SPARC, or other CPU.

I’m sure many readers would like to tell me (again) that I should be using the prebuilt library version of sgemm for matmul, not writing my own. Save your breath. Matmul is just one simple example here, three loops, three matrices, lots of parallelism, and yet I put in several days of work to get this seven line loop optimized for the GPU.

We can compare the evolution of GPU programming to the evolution of shared-memory parallel programming. There were many new languages designed to take advantage of parallelism (Id, SISAL, many others). Many low-level libraries were written to create and manage threads, eventually standardizing on Posix threads (aka pthreads). Much work was done on automatic parallelization, dating back to the 1960s and 1970s. When successful commercial shared-memory multiprocessors became more widely available in the 1980s, an effort began to standardize a less intrusive programming interface for multiple processors, driven by multiprocessor workstations, eventually resulting in the OpenMP API, which defines directives and a runtime interface to a shared-memory parallel programming model.

GPUs have their own set of domain-specific languages, including GLSL (OpenGL Shading Language), HLSL (high level shader language) from Microsoft for DirectX, and Cg (C for graphics) from NVIDIA. We’re now in a period with development of low-level libraries and interfaces to create and manage GPU threads; the OpenCL effort aims to standardize this. The cost to port a nontrivial application to this model is high, though the potential performance is alluring.

Luckily for me, my application (the compiler) runs on the host, and I don’t have to port that. But what about the real application programmer, who has thousands (or hundreds of thousands) of lines of code? Is it feasible to take GPU or accelerator programming concepts, abstract them into a predictable and useful programming model, and present them using a portable programming interface, in the same way that OpenMP abstracts and presents multiprocessor and multicore systems? That’s a topic for my next column.

—–

Michael Wolfe has developed compilers for over 30 years in both academia and industry, and is now a senior compiler engineer at The Portland Group, Inc. (www.pgroup.com), a wholly-owned subsidiary of STMicroelectronics, Inc. The opinions stated here are those of the author, and do not represent opinions of The Portland Group, Inc. or STMicroelectronics, Inc.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

UCSD, AIST Forge Tighter Alliance with AI-Focused MOU

January 18, 2018

The rich history of collaboration between UC San Diego and AIST in Japan is getting richer. The organizations entered into a five-year memorandum of understanding on January 10. The MOU represents the continuation of a 1 Read more…

By Tiffany Trader

New Blueprint for Converging HPC, Big Data

January 18, 2018

After five annual workshops on Big Data and Extreme-Scale Computing (BDEC), a group of international HPC heavyweights including Jack Dongarra (University of Tennessee), Satoshi Matsuoka (Tokyo Institute of Technology), Read more…

By John Russell

Researchers Measure Impact of ‘Meltdown’ and ‘Spectre’ Patches on HPC Workloads

January 17, 2018

Computer scientists from the Center for Computational Research, State University of New York (SUNY), University at Buffalo have examined the effect of Meltdown and Spectre security updates on the performance of popular H Read more…

By Tiffany Trader

HPE Extreme Performance Solutions

HPE and NREL Take Steps to Create a Sustainable, Energy-Efficient Data Center with an H2 Fuel Cell

As enterprises attempt to manage rising volumes of data, unplanned data center outages are becoming more common and more expensive. As the cost of downtime rises, enterprises lose out on productivity and valuable competitive advantage without access to their critical data. Read more…

Fostering Lustre Advancement Through Development and Contributions

January 17, 2018

Six months after organizational changes at Intel's High Performance Data (HPDD) division, most in the Lustre community have shed any initial apprehension around the potential changes that could affect or disrupt Lustre Read more…

By Carlos Aoki Thomaz

UCSD, AIST Forge Tighter Alliance with AI-Focused MOU

January 18, 2018

The rich history of collaboration between UC San Diego and AIST in Japan is getting richer. The organizations entered into a five-year memorandum of understandi Read more…

By Tiffany Trader

New Blueprint for Converging HPC, Big Data

January 18, 2018

After five annual workshops on Big Data and Extreme-Scale Computing (BDEC), a group of international HPC heavyweights including Jack Dongarra (University of Te Read more…

By John Russell

Researchers Measure Impact of ‘Meltdown’ and ‘Spectre’ Patches on HPC Workloads

January 17, 2018

Computer scientists from the Center for Computational Research, State University of New York (SUNY), University at Buffalo have examined the effect of Meltdown Read more…

By Tiffany Trader

Fostering Lustre Advancement Through Development and Contributions

January 17, 2018

Six months after organizational changes at Intel's High Performance Data (HPDD) division, most in the Lustre community have shed any initial apprehension aroun Read more…

By Carlos Aoki Thomaz

When the Chips Are Down

January 11, 2018

In the last article, "The High Stakes Semiconductor Game that Drives HPC Diversity," I alluded to the challenges facing the semiconductor industry and how that may impact the evolution of HPC systems over the next few years. I thought I’d lift the covers a little and look at some of the commercial challenges that impact the component technology we use in HPC. Read more…

By Dairsie Latimer

How Meltdown and Spectre Patches Will Affect HPC Workloads

January 10, 2018

There have been claims that the fixes for the Meltdown and Spectre security vulnerabilities, named the KPTI (aka KAISER) patches, are going to affect applicatio Read more…

By Rosemary Francis

Momentum Builds for US Exascale

January 9, 2018

2018 looks to be a great year for the U.S. exascale program. The last several months of 2017 revealed a number of important developments that help put the U.S. Read more…

By Alex R. Larzelere

ANL’s Rick Stevens on CANDLE, ARM, Quantum, and More

January 8, 2018

Late last year HPCwire caught up with Rick Stevens, associate laboratory director for computing, environment and life Sciences at Argonne National Laboratory, f Read more…

By John Russell

Inventor Claims to Have Solved Floating Point Error Problem

January 17, 2018

"The decades-old floating point error problem has been solved," proclaims a press release from inventor Alan Jorgensen. The computer scientist has filed for and Read more…

By Tiffany Trader

US Coalesces Plans for First Exascale Supercomputer: Aurora in 2021

September 27, 2017

At the Advanced Scientific Computing Advisory Committee (ASCAC) meeting, in Arlington, Va., yesterday (Sept. 26), it was revealed that the "Aurora" supercompute Read more…

By Tiffany Trader

Japan Unveils Quantum Neural Network

November 22, 2017

The U.S. and China are leading the race toward productive quantum computing, but it's early enough that ultimate leadership is still something of an open questi Read more…

By Tiffany Trader

AMD Showcases Growing Portfolio of EPYC and Radeon-based Systems at SC17

November 13, 2017

AMD’s charge back into HPC and the datacenter is on full display at SC17. Having launched the EPYC processor line in June along with its MI25 GPU the focus he Read more…

By John Russell

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

IBM Begins Power9 Rollout with Backing from DOE, Google

December 6, 2017

After over a year of buildup, IBM is unveiling its first Power9 system based on the same architecture as the Department of Energy CORAL supercomputers, Summit a Read more…

By Tiffany Trader

Fast Forward: Five HPC Predictions for 2018

December 21, 2017

What’s on your list of high (and low) lights for 2017? Volta 100’s arrival on the heels of the P100? Appearance, albeit late in the year, of IBM’s Power9? Read more…

By John Russell

GlobalFoundries Puts Wind in AMD’s Sails with 12nm FinFET

September 24, 2017

From its annual tech conference last week (Sept. 20), where GlobalFoundries welcomed more than 600 semiconductor professionals (reaching the Santa Clara venue Read more…

By Tiffany Trader

Leading Solution Providers

Chip Flaws ‘Meltdown’ and ‘Spectre’ Loom Large

January 4, 2018

The HPC and wider tech community have been abuzz this week over the discovery of critical design flaws that impact virtually all contemporary microprocessors. T Read more…

By Tiffany Trader

Perspective: What Really Happened at SC17?

November 22, 2017

SC is over. Now comes the myriad of follow-ups. Inboxes are filled with templated emails from vendors and other exhibitors hoping to win a place in the post-SC thinking of booth visitors. Attendees of tutorials, workshops and other technical sessions will be inundated with requests for feedback. Read more…

By Andrew Jones

Tensors Come of Age: Why the AI Revolution Will Help HPC

November 13, 2017

Thirty years ago, parallel computing was coming of age. A bitter battle began between stalwart vector computing supporters and advocates of various approaches to parallel computing. IBM skeptic Alan Karp, reacting to announcements of nCUBE’s 1024-microprocessor system and Thinking Machines’ 65,536-element array, made a public $100 wager that no one could get a parallel speedup of over 200 on real HPC workloads. Read more…

By John Gustafson & Lenore Mullin

Delays, Smoke, Records & Markets – A Candid Conversation with Cray CEO Peter Ungaro

October 5, 2017

Earlier this month, Tom Tabor, publisher of HPCwire and I had a very personal conversation with Cray CEO Peter Ungaro. Cray has been on something of a Cinderell Read more…

By Tiffany Trader & Tom Tabor

Flipping the Flops and Reading the Top500 Tea Leaves

November 13, 2017

The 50th edition of the Top500 list, the biannual publication of the world’s fastest supercomputers based on public Linpack benchmarking results, was released Read more…

By Tiffany Trader

GlobalFoundries, Ayar Labs Team Up to Commercialize Optical I/O

December 4, 2017

GlobalFoundries (GF) and Ayar Labs, a startup focused on using light, instead of electricity, to transfer data between chips, today announced they've entered in Read more…

By Tiffany Trader

How Meltdown and Spectre Patches Will Affect HPC Workloads

January 10, 2018

There have been claims that the fixes for the Meltdown and Spectre security vulnerabilities, named the KPTI (aka KAISER) patches, are going to affect applicatio Read more…

By Rosemary Francis

HPC Chips – A Veritable Smorgasbord?

October 10, 2017

For the first time since AMD's ill-fated launch of Bulldozer the answer to the question, 'Which CPU will be in my next HPC system?' doesn't have to be 'Whichever variety of Intel Xeon E5 they are selling when we procure'. Read more…

By Dairsie Latimer

  • arrow
  • Click Here for More Headlines
  • arrow
Share This