The Heterogeneous Programming Jungle

By Michael Wolfe

March 19, 2012

There’s a lot of talk now about heterogeneous computing, but here I want to focus on heterogeneous programming. There are several, perhaps many, approaches being developed to program heterogeneous systems, but I would argue that none of them have proven that they successfully address the real goal. This article will discuss a range of potentially interesting heterogeneous systems for HPC, why programming them is hard, and why developing a high level programming model is even harder.

I’ve always said that parallel programming is intrinsically hard and will remain so. Parallel programming is all about performance. If you’re not interested in performance, save yourself the headache and just use a single thread. To get performance, your program has to create parallel activities, insert synchronization between them, and manage data locality.

The heterogeneous systems of interest to HPC use an attached coprocessor or accelerator that is optimized for certain types of computation. These devices typically exhibit internal parallelism, and execute asynchronously and concurrently with the host processor. Programming a heterogeneous system is then even more complex than “traditional” parallel programming (if any parallel programming can be called traditional), because in addition to the complexity of parallel programming on the attached device, the program must manage the concurrent activities between the host and device, and manage data locality between the host and device.

A Heterogeneous System Zoo

Before we embark on a discussion of the jungle of programming options, it’s instructive to explore the range of heterogeneous systems that are currently in use, being designed, or being considered, from GPUs to Intel MIC, DSPs and beyond.

  • Intel/AMD X86 host + NVIDIA GPUs (x86+GPU). This is the most common heterogeneous HPC system we see today, including 35 of the Top 500 supercomputers in the November 2011 list. The GPU has its own device memory, which is connected to the host via PCIe. Data must be allocated on and moved to the GPU memory, and parallel kernels are launched asynchronously on the GPU by the host.
  • AMD Opteron + AMD GPUs. This is essentially another x86+GPU option, using AMD Firestream instead of NVIDIA. The general structure is the same as above.
  • AMD Opteron + AMD APU. The AMD Heterogeneous System Architecture (formerly AMD Fusion) could easily become a player in this market, given that the APU (Accelerated Processor Unit) is integrated on the chip with the Opteron cores. Current offerings in this line are programmed much like x86+GPU. The host and APU share physical memory, but the memory is partitioned. It’s very like the x86+GPU case, except that a data copy between the two memory spaces can run at memory speeds, instead of at PCI speed. Again, parallel kernels are launched asynchronously on the APU by the host. AMD has announced aggressive plans for this product line in the future.
  • Intel Core + Intel Ivy Bridge integrated GPU. The on-chip GPU on the next generation Ivy Bridge processor is reported to be OpenCL programmable, allowing heterogeneous programming models as well. Intel’s Sandy Bridge has an integrated GPU, but it seems not to support OpenCL.
  • Intel Core + Intel MIC. Reportedly, the MIC is the highly parallel technical computing accelerator architecture for Intel. Currently, the MIC is on a PCIe card like a GPU, so it has the same data model as x86+GPU. The MIC could support parallel kernels like a GPU, but it also can support OpenMP parallelism or dynamic task parallelism among the cores on the chip as well. An early Intel Larrabee article (also published in PLDI 2009) described a programming model that supported virtual shared memory between the host and Larrabee, but I haven’t heard whether the MIC product includes support for that model.
  • NVIDIA Denver: ARM + NVIDIA GPU. As far as I know, this is not yet a product, and could look like the AMD APU, except for the host CPU and accelerator instruction sets.
  • Texas Instruments: ARM + TI DSPs. A recent HPCwire article described TI’s potential move (return, actually) to HPC using multiple 10GHz DSP chips. It’s pretty early to talk about system architecture or programming strategy for these.
  • Convey Intel x86 + FPGA-implemented reconfigurable vector unit. I’ve always thought of the Convey machine as a coprocessor or accelerator system with a really interesting implementation. One of the key advantages of the system is that the coprocessor memory controllers live in the same virtual address space as the host memory. It’s separate physical memory, but the host can access the coprocessor memory, and vice versa, though with a performance cost. Moreover, the programming model looks more like vector programming than like coprocessor offloading.
  • GP core + some other multicore or stream accelerators. Possibilities include a Tilera multicore or the Chinese FeiTeng FT64. Again, it’s a little early to talk about system architecture or programming strategy.
  • GP core + FPGA fabric. Convey uses FPGAs to implement the reconfigurable vector unit, but you could consider a more customizable unit as well. There were plenty of FPGA products being displayed at SC11 in Seattle, but none were as well integrated as Convey’s, nor did they have as strong a programming environment. The potential for fully custom functional units with application-specific, variable-length datatypes is attractive in the abstract, but except for a few success stories in financial and bioinformatics, this approach has a long way to go to gain much traction in HPC.
  • IBM Power + Cell. This was more common a couple years ago. While there are still a few Cell-equipped supercomputers on the Top 500 list, IBM has pulled the plug on the PowerXCell 8i product, so expect these to vanish.

In the HPC space, there seem to be only two lonely vendors still dedicated to CPU-only solutions: IBM and the Blue Gene family, and future Fujitsu SPARC-based systems, follow-ons to the Fujitsu K computer.

Similarities

Despite the wide variety of heterogeneous systems, there is surprising similarity among most or all of the various designs. All the systems allow the attached device to execute asynchronously with the host. You would expect this, especially since most of the devices are programmable devices themselves, but even the tightly-connected Convey coprocessor unit executes asynchronously.

All the systems exhibit several levels of parallelism within the coprocessor.

  • Typically, the coprocessor has several, and sometimes many, execution units (I’ll avoid over-using the word processor). NVIDIA Fermi GPUs have 16 Streaming Multiprocessors (SMPs); AMD GPUs have 20 or more SIMD units; Intel MIC will have 50+ x86 cores. If your program doesn’t take advantage of at least that much parallelism, you won’t be getting anywhere near the benefit of the coprocessor.
  • Each execution unit typically has SIMD or vector execution. NVIDIA GPUs execute threads in SIMD-like groups of 32 (what NVIDIA calls warps); AMD GPUs execute in wavefronts that are 64-threads wide; the Intel MIC, like its predecessor Larrabee, has 512-bit wide SIMD instructions (16 floats or 8 doubles). Again, if your application doesn’t take advantage of that much SIMD parallelism, your performance is going to suffer by the same factor.

Since these devices are used for processing large blocks of data, memory latency is a problem. Cache memory doesn’t help when the dataset is larger than the cache. One method to address the problem is to use a high bandwidth memory, such as Convey’s Scatter-Gather memory. The other is to add multithreading, where the execution unit saves the state of two or more threads, and can swap execution between threads in a single cycle. This strategy can range from swapping between threads at a cache miss, or alternating between threadson every cycle. While one thread is waiting for memory, the execution unit keeps busy by switching to a different thread. To be effective, the program needs to expose even more parallelism, which will be exploited via multithreading. Intel Larrabee / Knights Ferry has four-way multithreading per core, whereas GPUs have a multithreading factor of 20 or 30. Your program will need to provide a factor of somewhere between four and twenty times more parallel activities for the best performance. Moreover, the programming model may want to expose the difference between multithreading and multiprocessing. When tuning for a multithreaded multiprocessor, the programmer may be able to share data among threads that share the same execution unit, since they will share cache resources and can synchronize efficiently, and distribute data across threads that use different execution units.

But most importantly, the attached device has its own path to memory, usually to a separate memory unit. These designs fall into three categories:

  • Separate physical memory: Current discrete GPUs and the upcoming Intel MIC have a separate physical memory connected to the attached device, not directly connected to the host processor, and often even not accessible from the host. The device is implemented as a separate card with its own memory. There may be support for the device accessing host memory directly, or the host accessing device memory, but at a significant performance cost.
  • Partitioned physical memory: Today’s AMD Fusion processor chips fall into this category, as do low-end systems (such as laptops) using a motherboard-integrated GPU. There is one physical memory, but some fraction of that memory is dedicated to the APU or GPU. Logically, it looks like a separate physical memory, except copying data between the two spaces is faster, because both spaces are on the same memory bus, and moving data to the APU or GPU is slower, because the CPU main memory doesn’t have the bandwidth of, say, a good graphics memory system typical for a GPU.
  • Separate physical memory, one virtual memory: The Convey system fits this category. The coprocessor has its own memory controllers to its own memory subsystem, but both the CPU and coprocessor physical memories are mapped to a single virtual address space. It becomes part of application optimization to make sure to allocate data in the coprocessor physical memory, for instance, if it will be mostly accessed by coprocessor instructions.

Programming Model Goals

Given the similarities among system designs, one might think it should be obvious how to come up with a programming strategy that would preserve portability and performance across all these devices. What we want is a method that allows the application writer to write a program once, and let the compiler or runtime optimize for each target. Is that too much to ask?

Let me reflect momentarily on the two gold standards in this arena. The first is high level programming languages in general. After 50 years of programming using Algol, Pascal, Fortran, C, C++, Java, and many, many other languages, we tend to forget how wonderful and important it is that we can write a single program, compile it, run it, and get the same results on any number of different processors and operating systems.

Second, let’s look back at vector computing. The HPC space 30 years ago was dominated by vector computers: Cray, NEC, Fujitsu, IBM, Convex, and more. Building on the compiler work from very early supercomputers (such as TI’s ASC) and some very good academic research (at Illinois, Rice, and others), these vendors produced vectorizing compilers that could generate pretty good vector code from loops in your program. It was not the only way to get vector performance from your code. You could always use a vector library or add intrinsics to your program. But vectorizing compilers were the dominant vector programming method.

Vectorizing compilers were successful for three very important reasons. The compilers not only attempted to vectorize your loops, they gave some very specific user feedback when they failed. How often does your optimizing compiler tell you when it failed to optimize your code? Never, most likely. In fact, you would find it annoying if it printed a message every time it couldn’t float a computation out of a loop, say. However, the difference between vector performance and non-vector performance was a factor of 5 or 10, or more in some cases. That performance should not be left on the table. If a programmer is depending on the compiler to generate vector code, he or she really wants to know if the compiler was successful. And the feedback could be quite specific. Not just failed to vectorize the loop at line 25, but an unknown variable N in the second subscript of the array reference to A on line 27 prevents vectorization. So the first reason vectorizing compilers were successful is that the programmer used this feedback to rewrite that portion of the code and eventually reach vector performance. The second reason is that this feedback slowly trained the programmer how to write his or her next program so it would vectorize automatically.

The third reason, and ultimately the most important, is that the style of programming that vectorizing compilers promoted gave good performance across a wide range of machines. Programmers learned to make the inner loops be stride-1, avoid conditionals, inline functions (or sink loops into the functions), pull I/O out of the inner loops, and identify data dependences that prevent vectorization. Programs written in this style would then vectorize with the Cray compiler, as well as the IBM, NEC, Fujitsu, Convex, and others. Without ever collaborating on a specification, the vendors trained their users on how to write performance-portable vector programs.

I claim that what we want is a programming strategy, model or language that will promote programming in a style that will give good performance across a wide range of heterogeneous systems. It doesn’t necessarily have to be a new language. As with the vectorizing compilers lesson, if we can create a set of coding rules that will allow compilers and tools to exploit the parallelism effectively, that’s probably good enough. But there are several factors that make this hard.

Why It’s Hard

Parallel programming is hard enough to begin with. Now we have to deal not only with the parallelism we’re accustomed to on our multicore CPUs, we have to deal with an attached asynchronous device, as well as with the parallelism on that device.

To get high performance parallel code, you have to optimize locality and synchronization as well. For these devices, locality optimization mostly boils down to managing the distinct host and device memory spaces. Someone has to decide what data gets allocated in which memory, and whether or when to move that data to the other memory.

Many systems will have multiple coprocessors at each node, each with its own memory. Users are going to want to exploit all the resources available, and that may mean managing not just one coprocessor, but two or more. Suddenly you have not just a data movement problem, but a data distribution problem, and perhaps load balancing issues, too. These are issues that were addressed partly by High Performance Fortran in the 1990s. Some data has to be distributed among the memories, some has to be replicated, and some has to be shared or partially shared. And remember that these coprocessors are typically connected to some pretty high performance CPUs. Let’s not just leave the CPU idle while the coprocessor is busy, let’s distribute the work (and data) across the CPU cores as well as the coprocessor(s).

Most of the complexity comes from the heterogeneity itself. The coprocessor has a different instruction set, different performance characteristics, is optimized for different algorithms and applications than the host, and is optimized to work from its own memory. The goal of HPC is the HP part, the high performance part, and we need to be able to take advantage of the features of the coprocessor to get this performance.

The challenge of designing a higher level programming model or strategy is deciding what to virtualize and what to expose. Successful virtualization preserves productivity, performance, and portability. Vectorizing compilers were successful at virtualizing the vector instruction set; although they exposed the presence of vector instructions, they virtualized the details of the instruction set, such as the vector register length, instruction spellings, and so on. Vectorizing compilers are still in use today, generating code for the x86 SIMD (SSE and AVX) and Power Altivec instructions. There are other ways to generate these instructions, such as the Intel SSE Intrinsics, but these can hardly be said to preserve productivity, and certainly do not promote portability. You might also use a set of vector library routines, such as the BLAS library, or C++ vector operations in the STL, but these don’t compose well, and can easily become memory-bandwidth bound.

Another alternative is vector or array extensions to the language, such as Fortran array assignments or Intel’s Array Notation for C. However, while these allow the compiler to more easily generate vector code, it doesn’t mean it’s better code than an explicit loop that gets vectorized. For example, compiling and vectorizing the following loop for SSE:

    do i = 1, n       x = a(i) + b(i)       c(i) = exp(x) + 1/x     enddo 

can be done by loading four elements of a and b into SSE registers, adding them, dividing into one, calling a vector exp routine, adding that result to the divide result, and storing those four elements into c. The intermediate result xnever gets stored. The equivalent array code would be:

    x(:) = a(:) + b(:)     c(:) = exp(x(:)) + 1/x(:) 

The array assignments simplify the analysis to determine whether vector code can be generated, but the compiler has to do effectively the same amount of work to generate efficient code. In Fortran and Intel C, these array assignments are defined as computing the whole right hand side, then doing all the stores. The effect is as if the code were written as:

    forall(i=1:n) temp(i) = a(i) + b(i)     forall(i=1:n) x(i) = temp(i)     forall(i=1:n) temp(i) = exp(x(i)) + 1/x(i)     forall(i=1:n) c(i) = temp(i) 

where the temp array is allocated and managed by the compiler. The first analysis is to determine whether the temp array can be discarded. In some cases it cannot (such as a(2:n-1) = a(1:n-2) + a(3:n)), and the analysis to determine this is effectively the same dependence analysis as to vectorize the loop. If successful, the compiler is left with:

    forall(i=1:n) x(i) = a(i) + b(i)     forall(i=1:n) c(i) = exp(x(i)) + 1/x(i) 

Then the compiler needs to determine whether it can fuse these two loops. The advantage of fusing is avoiding the reload of the SSE register holding the intermediate value x. This analysis is essentially the same as for discarding the temparray above. Assuming that’s successful, we get:

    forall(i=1:n)       x(i) = a(i) + b(i)       c(i) = exp(x(i)) + 1/x(i)     endforall 

Now we want the compiler to determine whether the array x is needed at all. In the original loop, xwas a scalar, and compiler lifetime analysis for scalars is precise. For arrays, it’s much more difficult, and sometimes intractable. At best, the code generated from the array assignments is as good as that from the vectorized loop. More likely, it will generate more memory accesses, and for large datasets, more cache misses.

At the minimum, the programming model should virtualize those aspects that are different among target systems. For instance, high level languages virtualize instruction sets and registers. The compilers virtualize instruction-level parallelism by scheduling instructions automatically. Operating systems virtualize the fixed physical memory size of the system with virtual memory, and virtualize the effect of multiple users by time slicing.

The model has to strike a balance between virtualizing a feature and perhaps losing performance or losing the ability to tune for that feature, versus exposing that feature and improving potential performance at the cost of productivity. For instance, one way to manage separate physical memories is to essentially emulate shared memory by utilizing virtual memory hardware, using demand paging to move data as needed from one physical memory to another. Where it works, it completely hides the separate memories, but it also makes it hard to optimize your program for separate memories, in part because the memory sharing is done at a hardware-defined granularity (virtual memory page) instead of an application-defined granularity.

Grab your Machete and Pith Helmet

If parallel programming is hard, heterogeneous programming is that hard, squared. Defining and building a productive, performance-portable heterogeneous programming system is hard. There are several current programming strategies that attempt to solve this problem, including OpenCL, Microsoft C++AMP, Google Renderscript, Intel’s proposed offload directives (see slide 24), and the recent OpenACC specification. We might also learn something from embedded system programming, which has had to deal with heterogeneous systems for many years. My next article will whack through the underbrush to expose each of these programming strategies in turn, presenting advantages and disadvantages relative to the goal.

About the Author

Michael Wolfe has developed compilers for over 30 years in both academia and industry, and is now a senior compiler engineer at The Portland Group, Inc. (www.pgroup.com), a wholly-owned subsidiary of STMicroelectronics, Inc. The opinions stated here are those of the author, and do not represent opinions of The Portland Group, Inc. or STMicroelectronics, Inc.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

What’s Hot and What’s Not at ISC 2018?

June 22, 2018

As the calendar rolls around to late June we see the ISC conference, held in Frankfurt (June 24th-28th), heave into view. With some of the pre-show announcements already starting to roll out, what do we think some of the Read more…

By Dairsie Latimer

Servers in Orbit, HPE Apollos Make 4,500 Trips Around Earth

June 22, 2018

The International Space Station shines a little brighter in the night sky thanks to what amounts to an orbiting supercomputer lofted to the outpost last year as part of a year-long experiment to determine if high-end com Read more…

By George Leopold

HPCwire Readers’ and Editors’ Choice Awards Turns 15

June 22, 2018

A hallmark of sustainability is this: If you are not serving a need effectively and efficiently you do not last. The HPCwire Readers’ and Editors’ Choice awards program has stood the test of time. Each year, our read Read more…

By Tiffany Trader

HPE Extreme Performance Solutions

HPC and AI Convergence is Accelerating New Levels of Intelligence

Data analytics is the most valuable tool in the digital marketplace – so much so that organizations are employing high performance computing (HPC) capabilities to rapidly collect, share, and analyze endless streams of data. Read more…

IBM Accelerated Insights

Taking the AI Training Wheels Off: From PoC to Production

Even though it seems simple now, there were a lot of skills to master in learning to ride a bike. From balancing on two wheels, and steering in a straight line, to going around corners and stopping before running over the dog, it took lots of practice to master these skills. Read more…

Tribute: Dr. Bob Borchers, 1936-2018

June 21, 2018

Dr. Bob Borchers, a leader in the high performance computing community for decades, passed away peacefully in Maui, Hawaii, on June 7th. His memorial service will be held on June 22nd in Reston, Virginia. Dr. Borchers Read more…

By Ann Redelfs

What’s Hot and What’s Not at ISC 2018?

June 22, 2018

As the calendar rolls around to late June we see the ISC conference, held in Frankfurt (June 24th-28th), heave into view. With some of the pre-show announcement Read more…

By Dairsie Latimer

Servers in Orbit, HPE Apollos Make 4,500 Trips Around Earth

June 22, 2018

The International Space Station shines a little brighter in the night sky thanks to what amounts to an orbiting supercomputer lofted to the outpost last year as Read more…

By George Leopold

HPCwire Readers’ and Editors’ Choice Awards Turns 15

June 22, 2018

A hallmark of sustainability is this: If you are not serving a need effectively and efficiently you do not last. The HPCwire Readers’ and Editors’ Choice aw Read more…

By Tiffany Trader

ISC 2018 Preview from @hpcnotes

June 21, 2018

Prepare for your social media feed to be saturated with #HPC, #ISC18, #Top500, etc. Prepare for your mainstream media to talk about supercomputers (in between t Read more…

By Andrew Jones

AMD’s EPYC Road to Redemption in Six Slides

June 21, 2018

A year ago AMD returned to the server market with its EPYC processor line. The earth didn’t tremble but folks took notice. People remember the Opteron fondly Read more…

By John Russell

European HPC Summit Week and PRACEdays 2018: Slaying Dragons and SHAPEing Futures One SME at a Time

June 20, 2018

The University of Ljubljana in Slovenia hosted the third annual EHPCSW18 and fifth annual PRACEdays18 events which opened May 29, 2018. The conference was chair Read more…

By Elizabeth Leake (STEM-Trek for HPCwire)

Cray Introduces All Flash Lustre Storage Solution Targeting HPC

June 19, 2018

Citing the rise of IOPS-intensive workflows and more affordable flash technology, Cray today introduced the L300F, a scalable all-flash storage solution whose p Read more…

By John Russell

Sandia to Take Delivery of World’s Largest Arm System

June 18, 2018

While the enterprise remains circumspect on prospects for Arm servers in the datacenter, the leadership HPC community is taking a bolder, brighter view of the x86 server CPU alternative. Amongst current and planned Arm HPC installations – i.e., the innovative Mont-Blanc project, led by Bull/Atos, the 'Isambard’ Cray XC50 going into the University of Bristol, and commitments from both Japan and France among others -- HPE is announcing that it will be supply the United States National Nuclear Security Administration (NNSA) with a 2.3 petaflops peak Arm-based system, named Astra. Read more…

By Tiffany Trader

MLPerf – Will New Machine Learning Benchmark Help Propel AI Forward?

May 2, 2018

Let the AI benchmarking wars begin. Today, a diverse group from academia and industry – Google, Baidu, Intel, AMD, Harvard, and Stanford among them – releas Read more…

By John Russell

How the Cloud Is Falling Short for HPC

March 15, 2018

The last couple of years have seen cloud computing gradually build some legitimacy within the HPC world, but still the HPC industry lies far behind enterprise I Read more…

By Chris Downing

US Plans $1.8 Billion Spend on DOE Exascale Supercomputing

April 11, 2018

On Monday, the United States Department of Energy announced its intention to procure up to three exascale supercomputers at a cost of up to $1.8 billion with th Read more…

By Tiffany Trader

Deep Learning at 15 PFlops Enables Training for Extreme Weather Identification at Scale

March 19, 2018

Petaflop per second deep learning training performance on the NERSC (National Energy Research Scientific Computing Center) Cori supercomputer has given climate Read more…

By Rob Farber

ORNL Summit Supercomputer Is Officially Here

June 8, 2018

Oak Ridge National Laboratory (ORNL) together with IBM and Nvidia celebrated the official unveiling of the Department of Energy (DOE) Summit supercomputer toda Read more…

By Tiffany Trader

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

Hennessy & Patterson: A New Golden Age for Computer Architecture

April 17, 2018

On Monday June 4, 2018, 2017 A.M. Turing Award Winners John L. Hennessy and David A. Patterson will deliver the Turing Lecture at the 45th International Sympo Read more…

By Staff

Google Chases Quantum Supremacy with 72-Qubit Processor

March 7, 2018

Google pulled ahead of the pack this week in the race toward "quantum supremacy," with the introduction of a new 72-qubit quantum processor called Bristlecone. Read more…

By Tiffany Trader

Leading Solution Providers

SC17 Booth Video Tours Playlist

Altair @ SC17

Altair

AMD @ SC17

AMD

ASRock Rack @ SC17

ASRock Rack

CEJN @ SC17

CEJN

DDN Storage @ SC17

DDN Storage

Huawei @ SC17

Huawei

IBM @ SC17

IBM

IBM Power Systems @ SC17

IBM Power Systems

Intel @ SC17

Intel

Lenovo @ SC17

Lenovo

Mellanox Technologies @ SC17

Mellanox Technologies

Microsoft @ SC17

Microsoft

Penguin Computing @ SC17

Penguin Computing

Pure Storage @ SC17

Pure Storage

Supericro @ SC17

Supericro

Tyan @ SC17

Tyan

Univa @ SC17

Univa

Google I/O 2018: AI Everywhere; TPU 3.0 Delivers 100+ Petaflops but Requires Liquid Cooling

May 9, 2018

All things AI dominated discussion at yesterday’s opening of Google’s I/O 2018 developers meeting covering much of Google's near-term product roadmap. The e Read more…

By John Russell

Nvidia Ups Hardware Game with 16-GPU DGX-2 Server and 18-Port NVSwitch

March 27, 2018

Nvidia unveiled a raft of new products from its annual technology conference in San Jose today, and despite not offering up a new chip architecture, there were still a few surprises in store for HPC hardware aficionados. Read more…

By Tiffany Trader

Pattern Computer – Startup Claims Breakthrough in ‘Pattern Discovery’ Technology

May 23, 2018

If it weren’t for the heavy-hitter technology team behind start-up Pattern Computer, which emerged from stealth today in a live-streamed event from San Franci Read more…

By John Russell

Sandia to Take Delivery of World’s Largest Arm System

June 18, 2018

While the enterprise remains circumspect on prospects for Arm servers in the datacenter, the leadership HPC community is taking a bolder, brighter view of the x86 server CPU alternative. Amongst current and planned Arm HPC installations – i.e., the innovative Mont-Blanc project, led by Bull/Atos, the 'Isambard’ Cray XC50 going into the University of Bristol, and commitments from both Japan and France among others -- HPE is announcing that it will be supply the United States National Nuclear Security Administration (NNSA) with a 2.3 petaflops peak Arm-based system, named Astra. Read more…

By Tiffany Trader

Part One: Deep Dive into 2018 Trends in Life Sciences HPC

March 1, 2018

Life sciences is an interesting lens through which to see HPC. It is perhaps not an obvious choice, given life sciences’ relative newness as a heavy user of H Read more…

By John Russell

Intel Pledges First Commercial Nervana Product ‘Spring Crest’ in 2019

May 24, 2018

At its AI developer conference in San Francisco yesterday, Intel embraced a holistic approach to AI and showed off a broad AI portfolio that includes Xeon processors, Movidius technologies, FPGAs and Intel’s Nervana Neural Network Processors (NNPs), based on the technology it acquired in 2016. Read more…

By Tiffany Trader

Google Charts Two-Dimensional Quantum Course

April 26, 2018

Quantum error correction, essential for achieving universal fault-tolerant quantum computation, is one of the main challenges of the quantum computing field and it’s top of mind for Google’s John Martinis. At a presentation last week at the HPC User Forum in Tucson, Martinis, one of the world's foremost experts in quantum computing, emphasized... Read more…

By Tiffany Trader

Cray Rolls Out AMD-Based CS500; More to Follow?

April 18, 2018

Cray was the latest OEM to bring AMD back into the fold with introduction today of a CS500 option based on AMD’s Epyc processor line. The move follows Cray’ Read more…

By John Russell

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This