The Heterogeneous Programming Jungle

By Michael Wolfe

March 19, 2012

There’s a lot of talk now about heterogeneous computing, but here I want to focus on heterogeneous programming. There are several, perhaps many, approaches being developed to program heterogeneous systems, but I would argue that none of them have proven that they successfully address the real goal. This article will discuss a range of potentially interesting heterogeneous systems for HPC, why programming them is hard, and why developing a high level programming model is even harder.

I’ve always said that parallel programming is intrinsically hard and will remain so. Parallel programming is all about performance. If you’re not interested in performance, save yourself the headache and just use a single thread. To get performance, your program has to create parallel activities, insert synchronization between them, and manage data locality.

The heterogeneous systems of interest to HPC use an attached coprocessor or accelerator that is optimized for certain types of computation. These devices typically exhibit internal parallelism, and execute asynchronously and concurrently with the host processor. Programming a heterogeneous system is then even more complex than “traditional” parallel programming (if any parallel programming can be called traditional), because in addition to the complexity of parallel programming on the attached device, the program must manage the concurrent activities between the host and device, and manage data locality between the host and device.

A Heterogeneous System Zoo

Before we embark on a discussion of the jungle of programming options, it’s instructive to explore the range of heterogeneous systems that are currently in use, being designed, or being considered, from GPUs to Intel MIC, DSPs and beyond.

  • Intel/AMD X86 host + NVIDIA GPUs (x86+GPU). This is the most common heterogeneous HPC system we see today, including 35 of the Top 500 supercomputers in the November 2011 list. The GPU has its own device memory, which is connected to the host via PCIe. Data must be allocated on and moved to the GPU memory, and parallel kernels are launched asynchronously on the GPU by the host.
  • AMD Opteron + AMD GPUs. This is essentially another x86+GPU option, using AMD Firestream instead of NVIDIA. The general structure is the same as above.
  • AMD Opteron + AMD APU. The AMD Heterogeneous System Architecture (formerly AMD Fusion) could easily become a player in this market, given that the APU (Accelerated Processor Unit) is integrated on the chip with the Opteron cores. Current offerings in this line are programmed much like x86+GPU. The host and APU share physical memory, but the memory is partitioned. It’s very like the x86+GPU case, except that a data copy between the two memory spaces can run at memory speeds, instead of at PCI speed. Again, parallel kernels are launched asynchronously on the APU by the host. AMD has announced aggressive plans for this product line in the future.
  • Intel Core + Intel Ivy Bridge integrated GPU. The on-chip GPU on the next generation Ivy Bridge processor is reported to be OpenCL programmable, allowing heterogeneous programming models as well. Intel’s Sandy Bridge has an integrated GPU, but it seems not to support OpenCL.
  • Intel Core + Intel MIC. Reportedly, the MIC is the highly parallel technical computing accelerator architecture for Intel. Currently, the MIC is on a PCIe card like a GPU, so it has the same data model as x86+GPU. The MIC could support parallel kernels like a GPU, but it also can support OpenMP parallelism or dynamic task parallelism among the cores on the chip as well. An early Intel Larrabee article (also published in PLDI 2009) described a programming model that supported virtual shared memory between the host and Larrabee, but I haven’t heard whether the MIC product includes support for that model.
  • NVIDIA Denver: ARM + NVIDIA GPU. As far as I know, this is not yet a product, and could look like the AMD APU, except for the host CPU and accelerator instruction sets.
  • Texas Instruments: ARM + TI DSPs. A recent HPCwire article described TI’s potential move (return, actually) to HPC using multiple 10GHz DSP chips. It’s pretty early to talk about system architecture or programming strategy for these.
  • Convey Intel x86 + FPGA-implemented reconfigurable vector unit. I’ve always thought of the Convey machine as a coprocessor or accelerator system with a really interesting implementation. One of the key advantages of the system is that the coprocessor memory controllers live in the same virtual address space as the host memory. It’s separate physical memory, but the host can access the coprocessor memory, and vice versa, though with a performance cost. Moreover, the programming model looks more like vector programming than like coprocessor offloading.
  • GP core + some other multicore or stream accelerators. Possibilities include a Tilera multicore or the Chinese FeiTeng FT64. Again, it’s a little early to talk about system architecture or programming strategy.
  • GP core + FPGA fabric. Convey uses FPGAs to implement the reconfigurable vector unit, but you could consider a more customizable unit as well. There were plenty of FPGA products being displayed at SC11 in Seattle, but none were as well integrated as Convey’s, nor did they have as strong a programming environment. The potential for fully custom functional units with application-specific, variable-length datatypes is attractive in the abstract, but except for a few success stories in financial and bioinformatics, this approach has a long way to go to gain much traction in HPC.
  • IBM Power + Cell. This was more common a couple years ago. While there are still a few Cell-equipped supercomputers on the Top 500 list, IBM has pulled the plug on the PowerXCell 8i product, so expect these to vanish.

In the HPC space, there seem to be only two lonely vendors still dedicated to CPU-only solutions: IBM and the Blue Gene family, and future Fujitsu SPARC-based systems, follow-ons to the Fujitsu K computer.


Despite the wide variety of heterogeneous systems, there is surprising similarity among most or all of the various designs. All the systems allow the attached device to execute asynchronously with the host. You would expect this, especially since most of the devices are programmable devices themselves, but even the tightly-connected Convey coprocessor unit executes asynchronously.

All the systems exhibit several levels of parallelism within the coprocessor.

  • Typically, the coprocessor has several, and sometimes many, execution units (I’ll avoid over-using the word processor). NVIDIA Fermi GPUs have 16 Streaming Multiprocessors (SMPs); AMD GPUs have 20 or more SIMD units; Intel MIC will have 50+ x86 cores. If your program doesn’t take advantage of at least that much parallelism, you won’t be getting anywhere near the benefit of the coprocessor.
  • Each execution unit typically has SIMD or vector execution. NVIDIA GPUs execute threads in SIMD-like groups of 32 (what NVIDIA calls warps); AMD GPUs execute in wavefronts that are 64-threads wide; the Intel MIC, like its predecessor Larrabee, has 512-bit wide SIMD instructions (16 floats or 8 doubles). Again, if your application doesn’t take advantage of that much SIMD parallelism, your performance is going to suffer by the same factor.

Since these devices are used for processing large blocks of data, memory latency is a problem. Cache memory doesn’t help when the dataset is larger than the cache. One method to address the problem is to use a high bandwidth memory, such as Convey’s Scatter-Gather memory. The other is to add multithreading, where the execution unit saves the state of two or more threads, and can swap execution between threads in a single cycle. This strategy can range from swapping between threads at a cache miss, or alternating between threadson every cycle. While one thread is waiting for memory, the execution unit keeps busy by switching to a different thread. To be effective, the program needs to expose even more parallelism, which will be exploited via multithreading. Intel Larrabee / Knights Ferry has four-way multithreading per core, whereas GPUs have a multithreading factor of 20 or 30. Your program will need to provide a factor of somewhere between four and twenty times more parallel activities for the best performance. Moreover, the programming model may want to expose the difference between multithreading and multiprocessing. When tuning for a multithreaded multiprocessor, the programmer may be able to share data among threads that share the same execution unit, since they will share cache resources and can synchronize efficiently, and distribute data across threads that use different execution units.

But most importantly, the attached device has its own path to memory, usually to a separate memory unit. These designs fall into three categories:

  • Separate physical memory: Current discrete GPUs and the upcoming Intel MIC have a separate physical memory connected to the attached device, not directly connected to the host processor, and often even not accessible from the host. The device is implemented as a separate card with its own memory. There may be support for the device accessing host memory directly, or the host accessing device memory, but at a significant performance cost.
  • Partitioned physical memory: Today’s AMD Fusion processor chips fall into this category, as do low-end systems (such as laptops) using a motherboard-integrated GPU. There is one physical memory, but some fraction of that memory is dedicated to the APU or GPU. Logically, it looks like a separate physical memory, except copying data between the two spaces is faster, because both spaces are on the same memory bus, and moving data to the APU or GPU is slower, because the CPU main memory doesn’t have the bandwidth of, say, a good graphics memory system typical for a GPU.
  • Separate physical memory, one virtual memory: The Convey system fits this category. The coprocessor has its own memory controllers to its own memory subsystem, but both the CPU and coprocessor physical memories are mapped to a single virtual address space. It becomes part of application optimization to make sure to allocate data in the coprocessor physical memory, for instance, if it will be mostly accessed by coprocessor instructions.

Programming Model Goals

Given the similarities among system designs, one might think it should be obvious how to come up with a programming strategy that would preserve portability and performance across all these devices. What we want is a method that allows the application writer to write a program once, and let the compiler or runtime optimize for each target. Is that too much to ask?

Let me reflect momentarily on the two gold standards in this arena. The first is high level programming languages in general. After 50 years of programming using Algol, Pascal, Fortran, C, C++, Java, and many, many other languages, we tend to forget how wonderful and important it is that we can write a single program, compile it, run it, and get the same results on any number of different processors and operating systems.

Second, let’s look back at vector computing. The HPC space 30 years ago was dominated by vector computers: Cray, NEC, Fujitsu, IBM, Convex, and more. Building on the compiler work from very early supercomputers (such as TI’s ASC) and some very good academic research (at Illinois, Rice, and others), these vendors produced vectorizing compilers that could generate pretty good vector code from loops in your program. It was not the only way to get vector performance from your code. You could always use a vector library or add intrinsics to your program. But vectorizing compilers were the dominant vector programming method.

Vectorizing compilers were successful for three very important reasons. The compilers not only attempted to vectorize your loops, they gave some very specific user feedback when they failed. How often does your optimizing compiler tell you when it failed to optimize your code? Never, most likely. In fact, you would find it annoying if it printed a message every time it couldn’t float a computation out of a loop, say. However, the difference between vector performance and non-vector performance was a factor of 5 or 10, or more in some cases. That performance should not be left on the table. If a programmer is depending on the compiler to generate vector code, he or she really wants to know if the compiler was successful. And the feedback could be quite specific. Not just failed to vectorize the loop at line 25, but an unknown variable N in the second subscript of the array reference to A on line 27 prevents vectorization. So the first reason vectorizing compilers were successful is that the programmer used this feedback to rewrite that portion of the code and eventually reach vector performance. The second reason is that this feedback slowly trained the programmer how to write his or her next program so it would vectorize automatically.

The third reason, and ultimately the most important, is that the style of programming that vectorizing compilers promoted gave good performance across a wide range of machines. Programmers learned to make the inner loops be stride-1, avoid conditionals, inline functions (or sink loops into the functions), pull I/O out of the inner loops, and identify data dependences that prevent vectorization. Programs written in this style would then vectorize with the Cray compiler, as well as the IBM, NEC, Fujitsu, Convex, and others. Without ever collaborating on a specification, the vendors trained their users on how to write performance-portable vector programs.

I claim that what we want is a programming strategy, model or language that will promote programming in a style that will give good performance across a wide range of heterogeneous systems. It doesn’t necessarily have to be a new language. As with the vectorizing compilers lesson, if we can create a set of coding rules that will allow compilers and tools to exploit the parallelism effectively, that’s probably good enough. But there are several factors that make this hard.

Why It’s Hard

Parallel programming is hard enough to begin with. Now we have to deal not only with the parallelism we’re accustomed to on our multicore CPUs, we have to deal with an attached asynchronous device, as well as with the parallelism on that device.

To get high performance parallel code, you have to optimize locality and synchronization as well. For these devices, locality optimization mostly boils down to managing the distinct host and device memory spaces. Someone has to decide what data gets allocated in which memory, and whether or when to move that data to the other memory.

Many systems will have multiple coprocessors at each node, each with its own memory. Users are going to want to exploit all the resources available, and that may mean managing not just one coprocessor, but two or more. Suddenly you have not just a data movement problem, but a data distribution problem, and perhaps load balancing issues, too. These are issues that were addressed partly by High Performance Fortran in the 1990s. Some data has to be distributed among the memories, some has to be replicated, and some has to be shared or partially shared. And remember that these coprocessors are typically connected to some pretty high performance CPUs. Let’s not just leave the CPU idle while the coprocessor is busy, let’s distribute the work (and data) across the CPU cores as well as the coprocessor(s).

Most of the complexity comes from the heterogeneity itself. The coprocessor has a different instruction set, different performance characteristics, is optimized for different algorithms and applications than the host, and is optimized to work from its own memory. The goal of HPC is the HP part, the high performance part, and we need to be able to take advantage of the features of the coprocessor to get this performance.

The challenge of designing a higher level programming model or strategy is deciding what to virtualize and what to expose. Successful virtualization preserves productivity, performance, and portability. Vectorizing compilers were successful at virtualizing the vector instruction set; although they exposed the presence of vector instructions, they virtualized the details of the instruction set, such as the vector register length, instruction spellings, and so on. Vectorizing compilers are still in use today, generating code for the x86 SIMD (SSE and AVX) and Power Altivec instructions. There are other ways to generate these instructions, such as the Intel SSE Intrinsics, but these can hardly be said to preserve productivity, and certainly do not promote portability. You might also use a set of vector library routines, such as the BLAS library, or C++ vector operations in the STL, but these don’t compose well, and can easily become memory-bandwidth bound.

Another alternative is vector or array extensions to the language, such as Fortran array assignments or Intel’s Array Notation for C. However, while these allow the compiler to more easily generate vector code, it doesn’t mean it’s better code than an explicit loop that gets vectorized. For example, compiling and vectorizing the following loop for SSE:

    do i = 1, n       x = a(i) + b(i)       c(i) = exp(x) + 1/x     enddo 

can be done by loading four elements of a and b into SSE registers, adding them, dividing into one, calling a vector exp routine, adding that result to the divide result, and storing those four elements into c. The intermediate result xnever gets stored. The equivalent array code would be:

    x(:) = a(:) + b(:)     c(:) = exp(x(:)) + 1/x(:) 

The array assignments simplify the analysis to determine whether vector code can be generated, but the compiler has to do effectively the same amount of work to generate efficient code. In Fortran and Intel C, these array assignments are defined as computing the whole right hand side, then doing all the stores. The effect is as if the code were written as:

    forall(i=1:n) temp(i) = a(i) + b(i)     forall(i=1:n) x(i) = temp(i)     forall(i=1:n) temp(i) = exp(x(i)) + 1/x(i)     forall(i=1:n) c(i) = temp(i) 

where the temp array is allocated and managed by the compiler. The first analysis is to determine whether the temp array can be discarded. In some cases it cannot (such as a(2:n-1) = a(1:n-2) + a(3:n)), and the analysis to determine this is effectively the same dependence analysis as to vectorize the loop. If successful, the compiler is left with:

    forall(i=1:n) x(i) = a(i) + b(i)     forall(i=1:n) c(i) = exp(x(i)) + 1/x(i) 

Then the compiler needs to determine whether it can fuse these two loops. The advantage of fusing is avoiding the reload of the SSE register holding the intermediate value x. This analysis is essentially the same as for discarding the temparray above. Assuming that’s successful, we get:

    forall(i=1:n)       x(i) = a(i) + b(i)       c(i) = exp(x(i)) + 1/x(i)     endforall 

Now we want the compiler to determine whether the array x is needed at all. In the original loop, xwas a scalar, and compiler lifetime analysis for scalars is precise. For arrays, it’s much more difficult, and sometimes intractable. At best, the code generated from the array assignments is as good as that from the vectorized loop. More likely, it will generate more memory accesses, and for large datasets, more cache misses.

At the minimum, the programming model should virtualize those aspects that are different among target systems. For instance, high level languages virtualize instruction sets and registers. The compilers virtualize instruction-level parallelism by scheduling instructions automatically. Operating systems virtualize the fixed physical memory size of the system with virtual memory, and virtualize the effect of multiple users by time slicing.

The model has to strike a balance between virtualizing a feature and perhaps losing performance or losing the ability to tune for that feature, versus exposing that feature and improving potential performance at the cost of productivity. For instance, one way to manage separate physical memories is to essentially emulate shared memory by utilizing virtual memory hardware, using demand paging to move data as needed from one physical memory to another. Where it works, it completely hides the separate memories, but it also makes it hard to optimize your program for separate memories, in part because the memory sharing is done at a hardware-defined granularity (virtual memory page) instead of an application-defined granularity.

Grab your Machete and Pith Helmet

If parallel programming is hard, heterogeneous programming is that hard, squared. Defining and building a productive, performance-portable heterogeneous programming system is hard. There are several current programming strategies that attempt to solve this problem, including OpenCL, Microsoft C++AMP, Google Renderscript, Intel’s proposed offload directives (see slide 24), and the recent OpenACC specification. We might also learn something from embedded system programming, which has had to deal with heterogeneous systems for many years. My next article will whack through the underbrush to expose each of these programming strategies in turn, presenting advantages and disadvantages relative to the goal.

About the Author

Michael Wolfe has developed compilers for over 30 years in both academia and industry, and is now a senior compiler engineer at The Portland Group, Inc. (, a wholly-owned subsidiary of STMicroelectronics, Inc. The opinions stated here are those of the author, and do not represent opinions of The Portland Group, Inc. or STMicroelectronics, Inc.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Exascale Escapes 2018 Budget Axe; Rest of Science Suffers

May 23, 2017

President Trump's proposed $4.1 trillion FY 2018 budget is good for U.S. exascale computing development, but grim for the rest of science and technology spend Read more…

By Tiffany Trader

Hedge Funds (with Supercomputing help) Rank First Among Investors

May 22, 2017

In case you didn’t know, The Quants Run Wall Street Now, or so says a headline in today’s Wall Street Journal. Quant-run hedge funds now control the largest Read more…

By John Russell

IBM, D-Wave Report Quantum Computing Advances

May 18, 2017

IBM said this week it has built and tested a pair of quantum computing processors, including a prototype of a commercial version. That progress follows an an Read more…

By George Leopold

PRACEdays 2017 Wraps Up in Barcelona

May 18, 2017

Barcelona has been absolutely lovely; the weather, the food, the people. I am, sadly, finishing my last day at PRACEdays 2017 with two sessions: an in-depth loo Read more…

By Kim McMahon

HPE Extreme Performance Solutions

Exploring the Three Models of Remote Visualization

The explosion of data and advancement of digital technologies are dramatically changing the way many companies do business. With the help of high performance computing (HPC) solutions and data analytics platforms, manufacturers are developing products faster, healthcare providers are improving patient care, and energy companies are improving planning, exploration, and production. Read more…

US, Europe, Japan Deepen Research Computing Partnership

May 18, 2017

On May 17, 2017, a ceremony was held during the PRACEdays 2017 conference in Barcelona to announce the memorandum of understanding (MOU) between PRACE in Europe Read more…

By Tiffany Trader

NSF, IARPA, and SRC Push into “Semiconductor Synthetic Biology” Computing

May 18, 2017

Research into how biological systems might be fashioned into computational technology has a long history with various DNA-based computing approaches explored. N Read more…

By John Russell

DOE’s HPC4Mfg Leads to Paper Manufacturing Improvement

May 17, 2017

Papermaking ranks third behind only petroleum refining and chemical production in terms of energy consumption. Recently, simulations made possible by the U.S. D Read more…

By John Russell

PRACEdays 2017: The start of a beautiful week in Barcelona

May 17, 2017

Touching down in Barcelona on Saturday afternoon, it was warm, sunny, and oh so Spanish. I was greeted at my hotel with a glass of Cava to sip and treated to a Read more…

By Kim McMahon

Exascale Escapes 2018 Budget Axe; Rest of Science Suffers

May 23, 2017

President Trump's proposed $4.1 trillion FY 2018 budget is good for U.S. exascale computing development, but grim for the rest of science and technology spend Read more…

By Tiffany Trader

Cray Offers Supercomputing as a Service, Targets Biotechs First

May 16, 2017

Leading supercomputer vendor Cray and datacenter/cloud provider the Markley Group today announced plans to jointly deliver supercomputing as a service. The init Read more…

By John Russell

HPE’s Memory-centric The Machine Coming into View, Opens ARMs to 3rd-party Developers

May 16, 2017

Announced three years ago, HPE’s The Machine is said to be the largest R&D program in the venerable company’s history, one that could be progressing tow Read more…

By Doug Black

What’s Up with Hyperion as It Transitions From IDC?

May 15, 2017

If you’re wondering what’s happening with Hyperion Research – formerly the IDC HPC group – apparently you are not alone, says Steve Conway, now senior V Read more…

By John Russell

Nvidia’s Mammoth Volta GPU Aims High for AI, HPC

May 10, 2017

At Nvidia's GPU Technology Conference (GTC17) in San Jose, Calif., this morning, CEO Jensen Huang announced the company's much-anticipated Volta architecture a Read more…

By Tiffany Trader

HPE Launches Servers, Services, and Collaboration at GTC

May 10, 2017

Hewlett Packard Enterprise (HPE) today launched a new liquid cooled GPU-driven Apollo platform based on SGI ICE architecture, a new collaboration with NVIDIA, a Read more…

By John Russell

IBM PowerAI Tools Aim to Ease Deep Learning Data Prep, Shorten Training 

May 10, 2017

A new set of GPU-powered AI software announced by IBM today brings automation to many of the tedious, time consuming and complex aspects of AI project on-rampin Read more…

By Doug Black

Bright Computing 8.0 Adds Azure, Expands Machine Learning Support

May 9, 2017

Bright Computing, long a prominent provider of cluster management tools for HPC, today released version 8.0 of Bright Cluster Manager and Bright OpenStack. The Read more…

By John Russell

Quantum Bits: D-Wave and VW; Google Quantum Lab; IBM Expands Access

March 21, 2017

For a technology that’s usually characterized as far off and in a distant galaxy, quantum computing has been steadily picking up steam. Just how close real-wo Read more…

By John Russell

Trump Budget Targets NIH, DOE, and EPA; No Mention of NSF

March 16, 2017

President Trump’s proposed U.S. fiscal 2018 budget issued today sharply cuts science spending while bolstering military spending as he promised during the cam Read more…

By John Russell

Google Pulls Back the Covers on Its First Machine Learning Chip

April 6, 2017

This week Google released a report detailing the design and performance characteristics of the Tensor Processing Unit (TPU), its custom ASIC for the inference Read more…

By Tiffany Trader

HPC Compiler Company PathScale Seeks Life Raft

March 23, 2017

HPCwire has learned that HPC compiler company PathScale has fallen on difficult times and is asking the community for help or actively seeking a buyer for its a Read more…

By Tiffany Trader

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Last week, Google reported that its custom ASIC Tensor Processing Unit (TPU) was 15-30x faster for inferencing workloads than Nvidia's K80 GPU (see our coverage Read more…

By Tiffany Trader

CPU-based Visualization Positions for Exascale Supercomputing

March 16, 2017

Since our first formal product releases of OSPRay and OpenSWR libraries in 2016, CPU-based Software Defined Visualization (SDVis) has achieved wide-spread adopt Read more…

By Jim Jeffers, Principal Engineer and Engineering Leader, Intel

TSUBAME3.0 Points to Future HPE Pascal-NVLink-OPA Server

February 17, 2017

Since our initial coverage of the TSUBAME3.0 supercomputer yesterday, more details have come to light on this innovative project. Of particular interest is a ne Read more…

By Tiffany Trader

Nvidia’s Mammoth Volta GPU Aims High for AI, HPC

May 10, 2017

At Nvidia's GPU Technology Conference (GTC17) in San Jose, Calif., this morning, CEO Jensen Huang announced the company's much-anticipated Volta architecture a Read more…

By Tiffany Trader

Leading Solution Providers

Facebook Open Sources Caffe2; Nvidia, Intel Rush to Optimize

April 18, 2017

From its F8 developer conference in San Jose, Calif., today, Facebook announced Caffe2, a new open-source, cross-platform framework for deep learning. Caffe2 is Read more…

By Tiffany Trader

Tokyo Tech’s TSUBAME3.0 Will Be First HPE-SGI Super

February 16, 2017

In a press event Friday afternoon local time in Japan, Tokyo Institute of Technology (Tokyo Tech) announced its plans for the TSUBAME3.0 supercomputer, which w Read more…

By Tiffany Trader

Is Liquid Cooling Ready to Go Mainstream?

February 13, 2017

Lost in the frenzy of SC16 was a substantial rise in the number of vendors showing server oriented liquid cooling technologies. Three decades ago liquid cooling Read more…

By Steve Campbell

MIT Mathematician Spins Up 220,000-Core Google Compute Cluster

April 21, 2017

On Thursday, Google announced that MIT math professor and computational number theorist Andrew V. Sutherland had set a record for the largest Google Compute Eng Read more…

By Tiffany Trader

IBM Wants to be “Red Hat” of Deep Learning

January 26, 2017

IBM today announced the addition of TensorFlow and Chainer deep learning frameworks to its PowerAI suite of deep learning tools, which already includes popular Read more…

By John Russell

US Supercomputing Leaders Tackle the China Question

March 15, 2017

As China continues to prove its supercomputing mettle via the Top500 list and the forward march of its ambitious plans to stand up an exascale machine by 2020, Read more…

By Tiffany Trader

HPC Technique Propels Deep Learning at Scale

February 21, 2017

Researchers from Baidu's Silicon Valley AI Lab (SVAIL) have adapted a well-known HPC communication technique to boost the speed and scale of their neural networ Read more…

By Tiffany Trader

DOE Supercomputer Achieves Record 45-Qubit Quantum Simulation

April 13, 2017

In order to simulate larger and larger quantum systems and usher in an age of "quantum supremacy," researchers are stretching the limits of today's most advance Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Share This