Compilers and More: Optimizing GPU Kernels

By Michael Wolfe

October 30, 2008

My last column discussed some of the complexities of programming GPUs today, focusing on how to interface the host program with the GPU. Here we focus on programming the GPU itself. As with last time, we’ll look at a simple single-precision matrix multiplication, equivalent to the BLAS SGEMM routine.

Matmul is a highly parallel algorithm, but let me emphasize that parallelism does not equate to performance. We need to carefully sculpt our algorithm to match the parallelism available in the architecture in order to reap the benefits. This is true whether we are targeting a GPU, a multicore x64, or even a single core with packed SSE operations. As an example, I took the simple matmul loop (in C, but with the matrices stored column-major):

     for( int j = 0; j < m; ++j )
for( int k = 0; k < p; ++k )
for( int i = 0; i < n; ++i )
a[i+pitch_a*j] += b[i+pitch_b*k] * c[k+pitch_c*j];

modified it several ways and ran it on an Intel Xeon (3GHz, 6MB cache, 16GB memory, Penryn) using 4096×4096 matrices (to compare with results we’ll see below). With the loop in the order shown (stride-1 inner loop), the program ran at 1.7 GFLOPs; this is compiled C performance (using pgcc -fast). We can improve that by tiling or blocking the loops, organizing the matmul as a a bunch of submatrix multiplications, sized so each submatrix matmul fits in the processor cache. This improves performance to 5.7 GFLOPs, and it jumps to over 22 GFLOPs when we use OpenMP directives and run on all four cores. Advanced compilers help by automatically managing the vectorization, unrolling, memory alignments, adding prefetch instructions, and so forth.

We’re going to see several matmul GPU kernels, with performance on our GPU development system, with an NVIDIA GeForce GTX 280 (1GB memory, 30 multiprocessors), using NVIDIA’s CUDA language. The host is a Linux (OpenSUSE 11.0) triple-core AMD Phenom (2.1 GHz, 500KB cache, 4GB memory), though the host hardly matters; the performance for these experiments is entirely dominated by the GPU code.

As on the CPU, performance on a GPU can be fragile; small changes to the program can make large differences in performance. It’s easy to write a slow program. This was a characteristic of High Performance Fortran, one that (my opinion) was a major cause of its downfall; while HPF made it easier to write parallel programs, it didn’t make parallel programs fast. That is the job of the HPC programmer; the same will be true for accelerators, GPUs, and even multicore CPUs.

GPUs deliver their dramatic high performance through a well-balanced, carefully managed, highly parallel architecture. Algorithms running on the GPU must be parallelized and balanced as well; this does not come for free. Program development may cost extra time and effort to understand and use the appropriate programming model, a model that may not match the simple scalar processor with cache model we are comfortable with on x64 hosts. However, the analysis and programming techniques used to develop GPU algorithms will probably help you develop multicore programs as well. A good programming model with good compilers and tools can relieve you of much busywork, but you still have to think, and you still have to understand algorithms and architecture, and you should expect no less.

From here on below, I show many versions of matmul; if you’re not a programmer or want to skip over the details, look for the performance tags below, until the Summary; you don’t want to miss the conclusions. If you are a programmer and want to see all the code, you’ll find all the sources in a kernels tarfile at the PGI Web site.

In my last column, I proposed a simple matmul kernel for the GPU and focused on the host code to drive the kernel. We’ll use that simple kernel to start the discussion. What I had done is taken the matmul loop (as shown above), strip-mined the stride-1 i loop to the CUDA SIMD width of 32:

     for( int is = 0; is < n; is += 32 )
for( int i = is; i < is+32; ++i )
for( int j = 0; j < m; ++j )
for( int k = 0; k < p; ++k )
a[i+pitch_a*j] += b[i+pitch_b*k] * c[k+pitch_c*j];

run the i element loop as a thread block, and run the is strip loop and j loop in parallel:

     parfor( int is = 0; is < n; is += 32 )    /* K1 */
parfor( int j = 0; j < m; ++j )
SIMDfor( int i = is; i < is+32; ++i )
for( int k = 0; k < p; ++k )
a[i+pitch_a*j] += b[i+pitch_b*k] * c[k+pitch_c*j];

then optimized by hand just a little. The parallel (grid) loops and the SIMD (thread block) loop are handled implicitly by the GPU hardware and firmware, so they don’t appear in the kernel code. All that’s left is the body, the k loop. The final kernel in all its glory, cut-and-pasted from my CUDA source file, is:

    extern "C" __global__ void    /* K1 */
mmkernel( float* a, float* b, float* c,
int pitch_a, int pitch_b, int pitch_c,
int n, int m, int p )
int i = blockIdx.x*32 + threadIdx.x;
int j = blockIdx.y;
float sum = 0.0;
for( int k = 0; k < p; ++k )
sum += b[i+pitch_b*k] * c[k+pitch_c*j];
a[i+pitch_a*j] = sum;

Performance: 28 GFLOPs – SIMD width 32 next
This version runs at 28 GFLOPs on our system (on 4096×4096 matrices). In the interest of full disclosure, I compiled the kernels discussed here with NVIDIA’s NVCC compiler version 2.0 with the -O option, and compiled the driver routine with pgcc -fast; I ran each program three times and report the middle performance score, rounding the GFLOPs down to an integer value. I will generally show a GFLOPs number most directly comparable to a host matmul, including the overhead of transmitting the operand matrices to the GPU memory and the result matrix back. I will sometimes give the performance of just the matmul kernel on the GPU; while the two numbers are often quite close, the kernel-only number is useful to expose more clearly the effect of changes to the kernel program (since the overhead stays the same). I show results for 4096×4096 matrices, which is close to the peak performance for each kernel. For version K1, the host-to-host and kernel-only performance were 28 and 29 GFLOPs, respectively.

That may sound like good performance, but we’re not nearly taking full advantage of the available parallelism. Recall the NVIDIA architecture description; the card I’m using has 30 multiprocessors, each with eight thread processors, quad-clocked to get a SIMD width of 32. The kernel above is a scalar program, but the card runs 32 copies of it in SIMD mode (or SIMT mode, to use NVIDIA’s term); the 32 copies comprise a warp. Each multiprocessor uses multithreading to support up to 32 warps (1024 scalar threads). The 32 warps can come from different thread blocks (different iterations of is or j) or from wider thread blocks (more than 32 scalar threads). There are limits in this generation of the card: each multiprocessor can support up to eight simultaneous thread blocks, and a thread block can support up to 16 warps.

Performance: 36 GFLOPs – SIMD width 64 next
The K1 kernel has only one warp per thread block, so at most eight thread blocks will be active on each multiprocessor, out of the possible 32. I can improve that by strip-mining the i loop to a width of 64, changing the 32 to a 64 in kernel K1, and running with 64 threads in each thread block. With this version, if eight thread blocks are scheduled on each multiprocessor, we get up to 16 warps, so the multithreading is more effective. And we see a performance increase, to 36 GFLOPs (38 kernel-only).

Performance: 35 GFLOPs – SIMD width 128 next
So what happens if I try this trick again, doubling or quadrupling the strip size to 128 or 256? This increases the maximum number of warps per multiprocessor to 32 (which is the limit), so we might expect another bump in performance from improved multithreading. Unfortunately, we don’t; the performance drops slightly to 35 GFLOPs (36 kernel-only) in both cases. This I can’t quite explain.

Performance: 1.7 GFLOPs – SIMD width 32, non-stride-1 array accesses next
Even with this simple version, I made some assumptions and optimizations, knowing something about the machine. I know that stride-1 accesses in a thread block are important, so I ran the stride-1 i loop along the thread index. Just how important is that? Suppose we switch the i and j indices, so the SIMD memory accesses are along a column; the performance drops from 28 GFLOPs (K1) to 1.7. We can call this kernel Ks (s for stupid, or slow).

Performance: 5.7 GFLOPs – SIMD width 32, inverted parallel loop indices next
Still, we’ve only just started. If we inspect the code for kernel K1, we note that the inner loop contains two memory fetches, for b and c; both fetches are from the device memory, which has a very high latency. In particular, the fetch for c loads the same element for all the threads in the thread block. The memory system is designed for high bandwidth when all the threads access consecutive elements, such as with the b access. This used to be called superword access in classical vector machines, where the memory returns 64-bytes (or more) at a time. Kernel K1 doesn’t take advantage of this memory design for the c access, but we can fix that. Let’s strip-mine the k loop, and load a strip of c into the multiprocessor local memory. The pseudo code is:

     parfor( int is = 0; is < n; is += 32 )    /* K2 */
parfor( int j = 0; j < m; ++j )
SIMDfor( int i = is; i < is+32; ++i )
for( int ks = 0; ks < p; ks += 32 )
cb[ks:ks+31] = c[ks+pitch_c*j:ks+31+pitch_c*j];
for( int k = ks; k < ks+32; ++k )
a[i+pitch_a*j] += b[i+pitch_b*k] * cb[k-ks];

Performance: 33 GFLOPs – cached access to c, SIMD width 32
Performance: 55 GFLOPs – cached access to c, SIMD width 64
Performance: 63 GFLOPs – cached access to c, SIMD width 128 next
Note the vector fetch of c into the temporary array cb. This is handled in kernel K2 by letting each thread fetch one element and storing into the multiprocessor local memory, so the inner loop only has one device memory fetch; the fetch of cb from the local memory is almost as fast as a register access; we see the performance improves to 33 GFLOPS, up from 28. We can again increase the number of threads per block from 32 to 64 and 128, and we see performance improve from 33 to 55 and 63 GFLOPs. As with kernel K1, increasing to 256 threads per block does not improve performance. An implementation detail: with more than one warp per thread block, we need to synchronize the warps after loading the temp array cb, and before reloading it the next time around the ks loop; see the CUDA source code for this detail.

Performance: 63 GFLOPs – cached access to c, SIMD width 64, unroll inner loop next
But we’re not done yet. What if we unroll the inner loop, to reduce the loop overhead? We might unroll to a factor of 2 or 4 or even 16. Unrolling the inner loop once in the 64-wide K2 kernel does improve performance, getting 63 GFLOPs, but more unrolling doesn’t help, and it doesn’t help the 128-wide kernel.

So far we’ve got two kernel versions, with variations in the thread-block (vector) size and unrolling. And we’ve only just begun. We tried unrolling the inner k loop; what if we try unrolling one of the outer loops? We could let each kernel instance compute two values of the i loop. The pseudo-code looks like:

      parfor( int is = 0; is < n; is += 64 )    /* K3 */
parfor( int j = 0; j < m; ++j )
SIMDfor( int i = is; i < is+32; ++i )
for( int ks = 0; ks < p; ks += 32 )
cb[ks:ks+31] = c[ks+pitch_c*j:ks+31+pitch_c*j];
for( int k = ks; k < ks+32; ++k )
a[i+pitch_a*j] += b[i+pitch_b*k] * cb[k-ks];
a[i+32+pitch_a*j] += b[i+32+pitch_b*k] * cb[k-ks];

Performance: 53 GFLOPs – cached access to c, SIMD width 32, unroll i loop
Performance: 63 GFLOPs – cached access to c, SIMD width 32, unroll i loop 3x next
Each iteration of the i loop now computes values for i and i+32. We don’t expect much advantage here, since the only values shared between the two i iterations are loaded from the local memory, which is already pretty fast. But even this kernel improves upon K2, with 53 GFLOPs. We can improve this to 63 GFLOPs by unrolling more or increasing the SIMD width to 64.

Next, we can try unrolling the j loop, so each kernel computes values for j and j+1. The pseudo-code is:

      parfor( int is = 0; is < n; is += 32 )    /* K4 */
parfor( int j = 0; j < m; j += 2 )
SIMDfor( int i = is; i < is+32; ++i )
for( int ks = 0; ks < p; ks += 32 )
cb0[ks:ks+31] = c[ks+pitch_c*j:ks+31+pitch_c*j];
cb1[ks:ks+31] = c[ks+pitch_c*(j+1):ks+31+pitch_c*(j+1)];
for( int k = ks; k < ks+64; ++k )
a[i+pitch_a*j] += b[i+pitch_b*k] * cb0[k-ks];
a[i+pitch_a*(j+1)] += b[i+pitch_b*k] * cb1[k-ks];

Performance: 59 GFLOPs – cached access to c, SIMD width 32, unroll j loop
Performance: 98 GFLOPs – cached access to c, SIMD width 64, unroll j loop
Performance: 117 GFLOPs – cached access to c, SIMD width 128, unroll j loop next
Here, we note the two assignments in the k loop fetch the same value of b from the device memory. This version gets 59 GFLOPs; it jumps to 98 GFLOPs when we increase the SIMD width to 64, and again to 117 GFLOPs with a SIMD width of 128. Now we’re starting to see real performance, over 100 GFLOPs, host-to-host.

Performance: 176 GFLOPs – cached access to c, SIMD width 128, unroll j loop 3x next
But we’re not done yet. What if we unroll the j loop by four iterations instead of just two? This involves keeping four partial sums. Now the performance with SIMD width 128 is 176 GFLOPs host-to-host, and over 210 GFLOPs on the device.

Performance: 202 GFLOPs – cached access to c, SIMD width 128, unroll j loop 3x and k loop 1x
Performance: 208 GFLOPs – cached access to c, SIMD width 128, unroll j and k loops 3x next
More unrolling of the j loop doesn’t improve performance, but what if we combine this with unrolling the k loop? If we unroll the j loop 3 times and the k loop once, with SIMD width of 128, we get 202 GFLOPs; unrolling the k 3 times gives us 208 GFLOPs (host-to-host), and 265 GFLOPs (kernel-only).

Our peak performance so far looks pretty good. It took some experimentation, but we have a version that uses only 128 threads and 2KB local memory per thread block, allowing up to 8 thread blocks on each multiprocessor, so taking great advantage of the multithreading properties of the machine. We haven’t even fully explored all the combinations. What if we combine i loop unrolling with the j and k loop unrolling? Should we explore other unroll factors as we combine unrolling multiple loops? What if we use pointer arithmetic instead of array references (this really questions whether NVCC optimizes the array references, but it seems to do a good job there)? I desperately wanted to break the 200 GFLOP barrier, and reached it. The version of sgemm that comes with CUDA BLAS gets about 260 GFLOPs (host-to-host) on a 4096×4096 matrix; I’ve still got some work to do to get that extra 25%.

When we optimize a matmul for a general purpose CPU with a cache, we’ve learned that we need a tiled algorithm. We can do the same thing on the GPU, where we fit the submatrices in the local memory. The pseudo code is:

      parfor( int is = 0; is < n; is += 16 )    /* K5 */
parfor( int js = 0; js < m; js += 16 )
SIMDfor( int i = 0; i < 16; ++i )
SIMDfor( int j = 0; j < 16; ++j )
at[is:is+15][js:js+15] = 0.0; /* init A tile */
for( int ks = 0; ks < p; ks += 16 )
/*load B tile*/ bt[i][ks:ks+15] = b[i+pitch_b*ks:i+pitch_b*(ks+15):pitch_b];
/*load C tile*/ ct[ks:ks+15][j] = c[ks+pitch_c*(js+j):(ks+15)+pitch_c*(js+j)];
for( int k = ks; k < ks+64; ++k )
/*tile MM */ at[i][j] += bt[i][k]*ct[k][j];
a[i+pitch_a*j] = a[i][j]; /* store A tile */

Performance: 164 GFLOPs – tiled loops, SIMD width 16×16, cached access to b and c
We have to choose a tile size, and square tiles seem to make as much sense as any other shape, at least to start with. We choose 16×16 tiles and run 256 threads in a thread group, so each thread will compute one element of the a tile; this lets us keep that element in a register. The actual kernel code is slightly more complex than the previous kernels. It’s important to recall that this scalar kernel is one of a thread group or cohort of 256 cooperating instances, and it only works in that domain. This version gives us 164 GFLOPs, not quite as good as we’ve already seen. Why not? One reason is the thread group is 256 threads, so we hit the 1024 threads/multiprocessor limit with only four thread groups. We can address that as well, but I still haven’t quite reached the peak performance shown on kernel K4. The CUDA blas (260 GFLOPs) sgemm is similar to this tiled version. It’s based on work by Vasily Volkov, a Computer Science PhD student at Cal; Vasily’s code uses a 16×4 thread block with the i loop unrolled by 16, the j loop by 4, and the k loop by 16 (if I read it right). The code might be hard to follow, but it sure beats trying to code a matmul in DirectX. PGI GPU Matmul Performance Chart
So let’s suppose we’ve decided on the K4 algorithm. It assumes that the matrix sizes are multiples of 32 (or 64 or 128) in all dimensions, though it doesn’t require the matrices to be square. One way to satisfy this is to pad all your matrices, filling in zeroes in the extra rows and columns. Matrix addition and multiplication will preserve these zeroes and will not pollute the actual values; this may be your best option. Another solution is to add conditionals so as to not run off the ends of the matrices. This complicates the code and can affect performance. The simplest method to test for array limits is to put conditionals around the device memory fetch and store operations; if we fill in zeros to the b and c tiles, the innermost loop won’t need any tests. I reproduce the body of the kernel here:

        float sum0 = 0.0, sum1 = 0.0;
for( int ks = 0; ks < p; ks += 32 ){
if( ks+tx < p && j < m )
cb0[tx] = c[ks+tx+pitch_c*j];
cb0[tx] = 0.0;
if( ks+tx < p && j+1 < m )
cb1[tx] = c[ks+tx+pitch_c*(j+1)];
cb1[tx] = 0.0;
if( i < n ){
for( int k = ks; k < ((ks+32 < m) ? ks+32 : m); ++k ){
float rb = b[i+pitch_b*k];
sum0 += rb * cb0[k-ks];
sum1 += rb * cb1[k-ks];
if( i < n && j < m )
a[i+pitch_a*j] = sum0;
if( i < n && j+1 < m )
a[i+pitch_a*(j+1)] = sum1;

Even if i and j are outside the matrix bounds, we can’t just skip the body of the loop for two reasons. First, each thread is part of a thread group, and as such it loads part of the data into the local temporary arrays cb0 and cb1; even if this thread has nothing to compute, it has to do its part of the shared work. Second, we have those pesky barrier synchronizations; all threads in a thread group must participate in the barrier, so even if this thread has no work to do, it had better reach those barriers.

These tests cost about 5% in performance, in the simplest version of K4. It’s less costly in the more complex versions, but the code gets messy when mixed with some of the unrolling. But it will work with any matrix size, whereas K4 requires the size to be a multiple of 32.

Of course, if you need to deliver a library that works regardless of the matrix sizes, you have another option. You can create two versions of your routine, a faster one that works when the matrix sizes are appropriate multiples of 16, and a slower, general purpose one that works for other matrix sizes, with a conditional test to execute the right one. Then you get your good benchmark numbers (all benchmarks use large powers of two, right?), and you get right answers, too.


The point I’ve tried to make is how sensitive the performance of the GPU is to the formulation of your kernel, and how much and what kind of experimentation you’ll need to do to optimize your performance. How much of the optimization process will carry over from one GPU to another, or from one generation to the next from the same vendor? Many programmers like this low level of control, and it certainly could be appropriate when developing a numerical library, in the same way assembly language is appropriate.

To be fair, the same is true on your CPU as well; you need to optimize your matmul for (packed) vector operations, memory strides, and cache locality. A bad program will run several times slower than a good one; ordering the matmul loops so the inner loop is non-stride-1 reduces the performance on large matrices (on our Penryn) by more than a factor of 10. But compilers and tools are far more mature and helpful when compiling for an x64, IBM POWER, Sun SPARC, or other CPU.

I’m sure many readers would like to tell me (again) that I should be using the prebuilt library version of sgemm for matmul, not writing my own. Save your breath. Matmul is just one simple example here, three loops, three matrices, lots of parallelism, and yet I put in several days of work to get this seven line loop optimized for the GPU.

We can compare the evolution of GPU programming to the evolution of shared-memory parallel programming. There were many new languages designed to take advantage of parallelism (Id, SISAL, many others). Many low-level libraries were written to create and manage threads, eventually standardizing on Posix threads (aka pthreads). Much work was done on automatic parallelization, dating back to the 1960s and 1970s. When successful commercial shared-memory multiprocessors became more widely available in the 1980s, an effort began to standardize a less intrusive programming interface for multiple processors, driven by multiprocessor workstations, eventually resulting in the OpenMP API, which defines directives and a runtime interface to a shared-memory parallel programming model.

GPUs have their own set of domain-specific languages, including GLSL (OpenGL Shading Language), HLSL (high level shader language) from Microsoft for DirectX, and Cg (C for graphics) from NVIDIA. We’re now in a period with development of low-level libraries and interfaces to create and manage GPU threads; the OpenCL effort aims to standardize this. The cost to port a nontrivial application to this model is high, though the potential performance is alluring.

Luckily for me, my application (the compiler) runs on the host, and I don’t have to port that. But what about the real application programmer, who has thousands (or hundreds of thousands) of lines of code? Is it feasible to take GPU or accelerator programming concepts, abstract them into a predictable and useful programming model, and present them using a portable programming interface, in the same way that OpenMP abstracts and presents multiprocessor and multicore systems? That’s a topic for my next column.


Michael Wolfe has developed compilers for over 30 years in both academia and industry, and is now a senior compiler engineer at The Portland Group, Inc. (, a wholly-owned subsidiary of STMicroelectronics, Inc. The opinions stated here are those of the author, and do not represent opinions of The Portland Group, Inc. or STMicroelectronics, Inc.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Machine Learning at HPC User Forum: Drilling into Specific Use Cases

September 22, 2017

The 66th HPC User Forum held September 5-7, in Milwaukee, Wisconsin, at the elegant and historic Pfister Hotel, highlighting the 1893 Victorian décor and art of “The Grand Hotel Of The West,” contrasted nicely with Read more…

By Arno Kolster

Google Cloud Makes Good on Promise to Add Nvidia P100 GPUs

September 21, 2017

Google has taken down the notice on its cloud platform website that says Nvidia Tesla P100s are “coming soon.” That's because the search giant has announced the beta launch of the high-end P100 Nvidia Tesla GPUs on t Read more…

By George Leopold

Cray Wins $48M Supercomputer Contract from KISTI

September 21, 2017

It was a good day for Cray which won a $48 million contract from the Korea Institute of Science and Technology Information (KISTI) for a 128-rack CS500 cluster supercomputer. The new system, equipped with Intel Xeon Scal Read more…

By John Russell

HPE Extreme Performance Solutions

HPE Prepares Customers for Success with the HPC Software Portfolio

High performance computing (HPC) software is key to harnessing the full power of HPC environments. Development and management tools enable IT departments to streamline installation and maintenance of their systems as well as create, optimize, and run their HPC applications. Read more…

Adolfy Hoisie to Lead Brookhaven’s Computing for National Security Effort

September 21, 2017

Brookhaven National Laboratory announced today that Adolfy Hoisie will chair its newly formed Computing for National Security department, which is part of Brookhaven’s new Computational Science Initiative (CSI). Read more…

By John Russell

Machine Learning at HPC User Forum: Drilling into Specific Use Cases

September 22, 2017

The 66th HPC User Forum held September 5-7, in Milwaukee, Wisconsin, at the elegant and historic Pfister Hotel, highlighting the 1893 Victorian décor and art o Read more…

By Arno Kolster

Stanford University and UberCloud Achieve Breakthrough in Living Heart Simulations

September 21, 2017

Cardiac arrhythmia can be an undesirable and potentially lethal side effect of drugs. During this condition, the electrical activity of the heart turns chaotic, Read more…

By Wolfgang Gentzsch, UberCloud, and Francisco Sahli, Stanford University

PNNL’s Center for Advanced Tech Evaluation Seeks Wider HPC Community Ties

September 21, 2017

Two years ago the Department of Energy established the Center for Advanced Technology Evaluation (CENATE) at Pacific Northwest National Laboratory (PNNL). CENAT Read more…

By John Russell

Exascale Computing Project Names Doug Kothe as Director

September 20, 2017

The Department of Energy’s Exascale Computing Project (ECP) has named Doug Kothe as its new director effective October 1. He replaces Paul Messina, who is stepping down after two years to return to Argonne National Laboratory. Kothe is a 32-year veteran of DOE’s National Laboratory System. Read more…

Takeaways from the Milwaukee HPC User Forum

September 19, 2017

Milwaukee’s elegant Pfister Hotel hosted approximately 100 attendees for the 66th HPC User Forum (September 5-7, 2017). In the original home city of Pabst Blu Read more…

By Merle Giles

Kathy Yelick Charts the Promise and Progress of Exascale Science

September 15, 2017

On Friday, Sept. 8, Kathy Yelick of Lawrence Berkeley National Laboratory and the University of California, Berkeley, delivered the keynote address on “Breakthrough Science at the Exascale” at the ACM Europe Conference in Barcelona. In conjunction with her presentation, Yelick agreed to a short Q&A discussion with HPCwire. Read more…

By Tiffany Trader

DARPA Pledges Another $300 Million for Post-Moore’s Readiness

September 14, 2017

The Defense Advanced Research Projects Agency (DARPA) launched a giant funding effort to ensure the United States can sustain the pace of electronic innovation vital to both a flourishing economy and a secure military. Under the banner of the Electronics Resurgence Initiative (ERI), some $500-$800 million will be invested in post-Moore’s Law technologies. Read more…

By Tiffany Trader

IBM Breaks Ground for Complex Quantum Chemistry

September 14, 2017

IBM has reported the use of a novel algorithm to simulate BeH2 (beryllium-hydride) on a quantum computer. This is the largest molecule so far simulated on a quantum computer. The technique, which used six qubits of a seven-qubit system, is an important step forward and may suggest an approach to simulating ever larger molecules. Read more…

By John Russell

How ‘Knights Mill’ Gets Its Deep Learning Flops

June 22, 2017

Intel, the subject of much speculation regarding the delayed, rewritten or potentially canceled “Aurora” contract (the Argonne Lab part of the CORAL “ Read more…

By Tiffany Trader

Reinders: “AVX-512 May Be a Hidden Gem” in Intel Xeon Scalable Processors

June 29, 2017

Imagine if we could use vector processing on something other than just floating point problems.  Today, GPUs and CPUs work tirelessly to accelerate algorithms Read more…

By James Reinders

NERSC Scales Scientific Deep Learning to 15 Petaflops

August 28, 2017

A collaborative effort between Intel, NERSC and Stanford has delivered the first 15-petaflops deep learning software running on HPC platforms and is, according Read more…

By Rob Farber

Oracle Layoffs Reportedly Hit SPARC and Solaris Hard

September 7, 2017

Oracle’s latest layoffs have many wondering if this is the end of the line for the SPARC processor and Solaris OS development. As reported by multiple sources Read more…

By John Russell

Six Exascale PathForward Vendors Selected; DoE Providing $258M

June 15, 2017

The much-anticipated PathForward awards for hardware R&D in support of the Exascale Computing Project were announced today with six vendors selected – AMD Read more…

By John Russell

Russian Researchers Claim First Quantum-Safe Blockchain

May 25, 2017

The Russian Quantum Center today announced it has overcome the threat of quantum cryptography by creating the first quantum-safe blockchain, securing cryptocurrencies like Bitcoin, along with classified government communications and other sensitive digital transfers. Read more…

By Doug Black

Top500 Results: Latest List Trends and What’s in Store

June 19, 2017

Greetings from Frankfurt and the 2017 International Supercomputing Conference where the latest Top500 list has just been revealed. Although there were no major Read more…

By Tiffany Trader

IBM Clears Path to 5nm with Silicon Nanosheets

June 5, 2017

Two years since announcing the industry’s first 7nm node test chip, IBM and its research alliance partners GlobalFoundries and Samsung have developed a proces Read more…

By Tiffany Trader

Leading Solution Providers

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

Graphcore Readies Launch of 16nm Colossus-IPU Chip

July 20, 2017

A second $30 million funding round for U.K. AI chip developer Graphcore sets up the company to go to market with its “intelligent processing unit” (IPU) in Read more…

By Tiffany Trader

Google Debuts TPU v2 and will Add to Google Cloud

May 25, 2017

Not long after stirring attention in the deep learning/AI community by revealing the details of its Tensor Processing Unit (TPU), Google last week announced the Read more…

By John Russell

Google Releases Deeplearn.js to Further Democratize Machine Learning

August 17, 2017

Spreading the use of machine learning tools is one of the goals of Google’s PAIR (People + AI Research) initiative, which was introduced in early July. Last w Read more…

By John Russell

EU Funds 20 Million Euro ARM+FPGA Exascale Project

September 7, 2017

At the Barcelona Supercomputer Centre on Wednesday (Sept. 6), 16 partners gathered to launch the EuroEXA project, which invests €20 million over three-and-a-half years into exascale-focused research and development. Led by the Horizon 2020 program, EuroEXA picks up the banner of a triad of partner projects — ExaNeSt, EcoScale and ExaNoDe — building on their work... Read more…

By Tiffany Trader

Amazon Debuts New AMD-based GPU Instances for Graphics Acceleration

September 12, 2017

Last week Amazon Web Services (AWS) streaming service, AppStream 2.0, introduced a new GPU instance called Graphics Design intended to accelerate graphics. The Read more…

By John Russell

Cray Moves to Acquire the Seagate ClusterStor Line

July 28, 2017

This week Cray announced that it is picking up Seagate's ClusterStor HPC storage array business for an undisclosed sum. "In short we're effectively transitioning the bulk of the ClusterStor product line to Cray," said CEO Peter Ungaro. Read more…

By Tiffany Trader

GlobalFoundries: 7nm Chips Coming in 2018, EUV in 2019

June 13, 2017

GlobalFoundries has formally announced that its 7nm technology is ready for customer engagement with product tape outs expected for the first half of 2018. The Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Share This