Compilers and More: MPI+X

By Michael Wolfe

July 16, 2014

At ISC’14, there was intense and continuing interest in the choice of a standard approach for programming the next generation HPC systems. While not guaranteed, many of these systems are likely to be large clusters of nodes with multicore CPUs and some sort of attached accelerators. A standard programming approach is necessary to convince developers, and particularly ISVs, to start adoption now in preparation for this coming generation of systems. John Barr raised the same question in an article at Scientific Computing World from a more philosophical point of view. Here I address this question from a deeper technical perspective.

HPC programming is currently dominated by either a flat model with MPI across nodes as well as cores within a node, or a hybrid model with MPI across the nodes and OpenMP shared memory parallelism across the cores in a node. The advantage of flat MPI is a simpler programming model, only one level of parallelism and only one API. The disadvantage is it doesn’t take advantage of the shared data across the ranks on the same node, requiring message and buffer management across all ranks. MPI+OpenMP roughly inverts those advantages and disadvantages.

The reason MPI and MPI+OpenMP have worked so well over the past 20 years now is that most HPC systems are roughly isomorphic, with some differences in instruction set, node topology, and performance profiles. The system is a network of nodes, the nodes have one or more processors, the processors have one or more cores. The cores on a node share virtual and physical memory, with hardware cache coherence to make shared memory programming relatively safe. There are some outliers, like the big SGI shared memory systems, which have some programming model and performance advantages for certain applications.

John Barr suggests that the options for programming next-generation HPC systems will be MPI+X, where X is one of: OpenMP (if accelerators don’t persist), OpenACC, OpenMP 4+ with device constructs, OpenCL, or CUDA. John is probably right, but I want to add a caveat. We don’t know what HPC systems will look like in a decade. We don’t even know what the HPC systems of the near future will look like. The familiar (I refrain from “simple”) MPI+OpenMP programming model that has served us so well is unlikely to apply to the newest designs five or ten years from now, because the network of homogeneous nodes of homogeneous cores with uniform shared memory is going to change. HPC is at another inflection point, where the drivers are both economic and technological.

Technology is driving designs towards slower cores with more parallelism to make up the performance slack. Parallelism will come in three forms: classical multicore or multiprocessor parallel processes, threads, or tasks; classical vector operations, sometimes recast as SIMD operations; and multithreading used to overlap long latency operations, mostly to tolerate long memory latencies. If your system has enough native bandwidth, multithreading can allow an appropriately organized application to achieve close to peak performance as long as there is another thread ready to execute when one thread stalls on a cache miss. Bandwidth increase is easier to achieve than latency reduction, though it’s still not free.

The economics of HPC leads us to use the best, lowest-cost commodity parts whenever possible. The economic drivers in computing will be mobile and embedded, such as smart phones and tablets. High end workstations will still be important, but vendors are learning that growth is in mobile, and we can expect the biggest investments for new computing parts there. Luckily for us, those parts need increasing performance and continue to require lower power and low cost, so even if painful, a shift from workstation parts to mobile parts may be beneficial.

But the important point is that we don’t know exactly what these systems will look like, and there’s likely to be quite a bit of variability. Some will look like today’s systems, though with higher core counts on each processor chip. Some will have low power chips designed for or inspired by the embedded space; think Blue Gene L/P/Q, Intel Atom/Quark, or ARM. The ratios of parallelism on a node vs. across nodes will differ.

Many will have accelerators of some sort: GPUs (NVIDIA, AMD), manycore (Intel Xeon Phi), or other (TI DSP). The amount and style of parallelism for each kind of accelerator differs.

Memory on today’s accelerators is physically and logically separate from the host memory; that will not always remain so. There is movement to share the same address space between accelerator and host, which allows a kind of shared memory. However, there will still be a performance penalty for accessing the more distant memory, either using remote access or software managed distributed shared memory, like the classical Treadmarks system. In particular, today’s accelerators are designed with a high bandwidth memory; if the host memory doesn’t provide the same bandwidth, the performance will suffer. The same can be true if the host and accelerator share physical memory. Today’s laptops have an integrated graphics unit, often on the processor chip itself; the IGU uses the system memory for the graphics buffer. This is sufficient for most laptop uses (email, presentations), but for interactive games and other challenging graphical applications, you want a discrete GPU with its own memory. Upcoming memory subsystems, such as the Hybrid Memory Cube, may change the balance allowing high bandwidth shared memory, if we’re willing to pay the cost.

So the challenge proposed by John Barr isn’t just what language we should use to write programs, but how will we write programs that will give us satisfactory performance across the variety of systems available today, tomorrow, and over the next decade. Will any of the options he presented suffice? Let’s discuss each of them.

MPI + OpenMP 3.1

OpenMP 3.1 was a minor update to OpenMP 3.0, released in 2011 after three years of development. For nodes with small numbers of homogeneous cores and uniform memory access, OpenMP serves quite well. It’s hard to argue that it needs to be replaced, unless you are using a higher level language than C, C++ or Fortran. However, even those languages are starting to adopt native parallel constructs. Fortran has long had array assignments and a forall construct, expressing vector-style or SIMD parallelism, and now has a do concurrent construct, subsuming much of the behavior of the OpenMP parallel do There is ongoing discussion in the C and C++ language committees about adding OpenMP-style threading or Cilk-style tasking to those languages. Language standards committees move slowly, thankfully, but depending on how these discussions play out, the need for OpenMP directives may diminish in the future.

Advantages: OpenMP 3.1 has the advantage of being relatively high level, and is quite mature. It is supported across essentially all multicore and multiprocessor systems, giving good functional and performance portability, though not necessarily scalability.

Disadvantages: Anyone just starting to port an existing program to a multicore will either really like the simplicity of OpenMP parallel loops, or will really dislike the necessity of annotating each and every parallel loop in the program. The OpenMP 3.1 model of a system is somewhat dated; systems being built now and in the future are no longer so homogeneous. OpenMP is really dependent on long-lived parallel threads, which pervade the execution model and are exposed in the programming model. While OpenMP now has tasks, the tasking model was grafted (quite elegantly, in my opinion) onto the existing OpenMP threading model.

Portability: As mentioned, OpenMP is supported on essentially all multiprocessor and multicore systems used in HPC today, and provides pretty good performance portability across the homogeneous shared memory (low core-count) multicore environment for which it was designed. However, OpenMP 3.1 has no support for heterogeneous or accelerated systems. There is an impressive research project at Purdue University to implement OpenMP on NVIDIA GPUs with some significant compiler and runtime effort. They have produced a prototype compiler and have some impressive results on benchmark programs. However, it’s still in the research stage, and it suffers from trying to hide or virtualize some of the most performance sensitive issues of accelerator programming, such as data movement between host and accelerator memories and the performance of the synchronizations required by OpenMP.

Future: OpenMP 4 adds many new features, including some support for heterogeneous systems and more control to take advantage of data and compute locality. The language is becoming more complex, to adapt to and allow control over a more complex world of parallel systems. Eventually, meaning a decade or more from now, the base languages may support enough native parallelism to obviate the need for OpenMP directives.


CUDA is an explicit programming model for GPU-style accelerators. It exposes many of the architectural features of a GPU, such as the 3D grid and 3D thread block, the software-managed cache (CUDA shared memory), the SIMD-style execution of adjacent threads in the same thread block, the separate data memory on the GPU, and so on. Used the right way, this is a strength of the model and language, since it gives a dedicated developer access to all the power of the GPU. However, it does require significant investment when porting to the GPU. In addition, when moving from one generation of GPU to the next, highly tuned CUDA kernels will often have to be retuned for the newer hardware, increasing code maintenance cost.

Moreover, CUDA today is a single-vendor (NVIDIA) solution. PGI has developed a CUDA-X86 solution, but its performance is not always as good as the corresponding native code. One can envision a CUDA implementation for other GPUs or other devices, but there are no hints of such a development.

Advantages: The CUDA model is relatively mature, over five years old now. It gives control to the programmer over all the relevant features of the GPU. It comes with debugging and performance analysis tools, and is well-supported by NVIDIA. The language and programming model evolve as the capabilities of the GPU evolve. Recently, CUDA added Unified Memory, which makes porting easier by simplifying the memory management and allowing automatic data movement between the host and device memories. CUDA Fortran provides the same programming model for Fortran programmers, with the advantages of using the higher level language features available in Fortran.

Disadvantages: It’s an explicit programming model. Converting an existing application in C, C++ or Fortran to CUDA is a significant rewrite, although such rewrites allow and encourage one to replace inappropriate algorithms and data structures with others more suited to the parallel world.

Portability: CUDA is currently limited to NVIDIA GPUs. There have been efforts to port CUDA programs to parallel execution on CPUs; the CUDA-X86 product uses some of the same techniques. These efforts have some successes, but highly parallel CUDA programs don’t always map all that well to the limited multicore and SIMD parallelism available on CPUs. Moreover, a kernel tuned for one generation of NVIDIA GPU (experimenting with thread block size and shape, unrolling loops) will often have to be retuned for the next generation.

Future: NVIDIA will continue to support and develop CUDA. CUDA will continue to play an important role in GPU programming, especially where the investment pays off handsomely, such as highly tuned libraries or very performance sensitive application kernels.

MPI + OpenCL

OpenCL is essentially the same programming model for GPU-style accelerators as CUDA, but with a lower level platform layer. It exposes the same features of the GPU, but is not NVIDIA-specific. It is also supported on many targets, including DSPs and even x86 and ARM processors. Like CUDA, it gives complete control to the programmer. Unlike CUDA, there is no OpenCL compiler for the host code; the platform API is completely library-based, making it look very verbose, especially for small programs. The difference becomes clear when looking at how a kernel is launched in CUDA and OpenCL. In CUDA C, a kernel launch looks like a procedure call to the kernel, with additional syntax to specify the launch configuration (grid and block sizes):

      kernel<<< gridsize, blocksize >>>( arg0, arg1, ... );

The CUDA compiler translates this into the series of runtime calls to manage the arguments and the actual launch. Since OpenCL doesn’t have such a compiler, a kernel launch becomes a series of explicit runtime calls: one for each argument, and one for the launch:

      clSetKernelArg( kernel_handle, 0, sizeof(arg0), &arg1 );
      clSetKernelArg( kernel_handle, 1, sizeof(arg1), &arg1 );
      clEnqueueNDRangeKernel( queue_handle, kernel_handle, 1, NULL,
                              &globalsize, &localsize, 0, NULL, NULL );

Other than that, OpenCL programming is potentially just as powerful as CUDA programming.

Given the wider range of target devices implementing OpenCL, there has been quite a discussion about OpenCL and performance portability for the past several years. Use your favorite search engine for the phrase OpenCL performance portability for a sample. A recent article on HPCwire advocates quite well in favor of OpenCL for heterogeneous computing, in part because it accommodates all the available architectures, separately or together. I have to agree with almost everything the article says. As for performance portability, this article states: True, we may need to write a new version of our kernel to get the best performance on Architecture A, but isn’t this what we actually want? The most telling word in that sentence is kernel, singular. In the embedded world, one developer is quite likely to be focused on a single kernel, such as some video processing, or audio encoding, or the like. In that scenario, with only one kernel, then you can afford to tune and retune for each architecture. However, if you have tens of thousands or millions of lines of code to support over the multi-decade lifetime of your application, this is not a productive path.

Advantages: The kernels are written in the same explicit programming model as CUDA, and OpenCL is supported across a wide range of devices. As with CUDA, the OpenCL API continues to evolve, adding features aimed at simplifying programming and supporting larger applications.

Disadvantages: It doesn’t have a standard set of tools across all devices, so debugging and profiling support is sometimes spotty. The language is evolving, as mentioned, but evolves slowly, as any committee-designed language will.

Portability: Since the programming model is lower-level, your kernels have to explicitly take advantage of the features of the target architecture. That means your kernels have to be retuned or rewritten for different targets.

Future: The OpenCL model is more aimed at the embedded or mobile market, where the application writer doesn’t necessarily have details about the target device. Features like dynamic, just-in-time compiling are important in that arena. In HPC, having to recompile every kernel on every node of your cluster every time you run your program quickly sounds like overhead you’d rather avoid. However, I predict that like CUDA, OpenCL can play an important role in programming for GPUs and other accelerators, especially where the investment will pay off handsomely.


OpenACC is a directive-based programming model targeting a CPU+accelerator system, intentionally similar in many ways to OpenMP. It is designed to do for accelerator programming systems what OpenMP does for multicore systems. It hides or virtualizes those features of the system that can be managed automatically by the system without performance penalty, and exposes those features that must be managed by the programmer. For instance, the programming and execution models expose the presence of separate memories on a host+GPU system, and requires the user to manage data movement between the host and device memories for device-resident data. However, OpenACC virtualizes the variable names, using the same name for the CPU and device copies of the data, resolving them depending on where the name is used, and will even work with no overhead when the device shares memory with the host. The original intent was to eventually merge OpenACC into OpenMP, but the two groups decided on different approaches. In full disclosure, this author is a key member of the OpenACC language committee, and formerly participated in the OpenMP language committee.

OpenACC was designed to focus on parallel algorithms that perform well on the accelerators of interest. Accelerators today need a high degree of multicore parallelism as well as vector or SIMD parallelism. They don’t work very efficiently on scalar code, or on scalar tasks; scalar code will always be more efficient on a high-speed host CPU. This is the primary motivation for a programming model that supports closely coupled CPU+Accelerator systems, which will likely be most efficient on maximizing performance across complete applications.

Advantages: OpenACC has demonstrated support for multiple devices, and there is some initial evidence for performance portability across device types.

Disadvantages: OpenACC is still relatively young, and there are as yet no available open-source implementations of the full language. As yet, there are no implementations that target the host multicore as a device, so it currently must be combined with OpenMP. Some critical features are still under development, and there are some differences in how features are implemented by different compilers, making portability across vendors an issue.

Portability: OpenACC can virtualize many aspects of the architecture, so programs should work well across divergent targets. At the NVIDIA GPU Technology Conference in March, I presented results showing the performance of the SPEC ACCEL OpenACC benchmarks running on NVIDIA Kepler and AMD Radeon GPUs, providing initial evidence of performance portability across devices.

Future: OpenACC version 2.0 includes some important new features, such as defined support for separate compilation, unstructured data lifetimes, and more asynchronous operations; work proceeds on additional critical and important features. There are quite a few large applications being ported to GPUs with OpenACC, and this experience feeds back to the language design and implementations. Expect a great deal of exciting work on OpenACC in the next few years, including commercial and open-source implementations across all HPC target accelerators, as well as host CPUs.

MPI + OpenMP 4.0 Target Directives

One may ask why OpenACC looks so different from OpenMP? Why not just implement OpenMP on GPUs? This is a good question and an important one. I tried to answer this with an articleseveral years ago. Current GPUs, in particular, have some limitations that we should avoid and some capabilities that we should exploit when programming for the best parallel performance.

The OpenMP ARB released version 4.0 of the specification last July, a major revision just two years after releasing 3.1. Among many new features are the device constructs. Some of these constructs mirror those in OpenACC, such as data management with the OpenMP target data and target update constructs. The OpenMP committee chose to add a new level of parallelism to the classical OpenMP threads: teams. In OpenMP 3.1, a parallel region was executed by a team of threads, and parallel loop iterations or tasks were work-shared across the threads in a team. With the OpenMP 4.0 device constructs, the user can now create a league of teams of threads. The addition of teams was to accommodate devices like GPUs which do not support an efficient global barrier synchronization across all the threads in the system. This addition has two unfortunate side effects. First, porting an existing OpenMP program to a device like a GPU is not as easy as simply adding a target directive around the parallel loops. As with OpenACC, the programmer will have to manage the data traffic to the device, but now the programmer has to add new types of parallelism as well. Second, some devices, such as the manycore Intel Xeon Phi Coprocessor, don’t need and probably don’t want this extra level of parallelism; OpenMP can be implemented natively on that target. This means that tuning for one target may differ significantly from tuning for another target, making performance portability a real challenge. I fear that the OpenMP committee has made some unfortunate decisions that will be hard to fix.

Intel supports many of the OpenMP 4 device constructs, and presumably will have a conformant implementation for the Xeon Phi Coprocessor. Texas Instruments has also announced plans to support OpenMP 4 on some ARM+DSP heterogeneous SoCs. The new Convey MX computer has support for hybrid OpenMP that matches OpenMP 4 features. Cray has announced plans to deliver OpenMP 4 for their systems as well, including Xeon Phi and Kepler accelerators, though they don’t promise performance portability. The other OpenMP vendor members have yet to announce support for the device constructs.

Advantages: For those targets that can implement full OpenMP, the OpenMP 4 device constructs look like a simpler method to port an existing OpenMP program. There is still the requirement to manage data transfers, but that’s the case on any of these programming models.

Disadvantages: For targets that aren’t designed to implement OpenMP, it’s not clear whether good performance can be delivered. No prototype implementations of the teams construct were available. Performance portability is going to be a challenge, though, as with OpenCL, language portability is still a big step in the right direction. Parts of the current OpenMP 4 device constructs are not well specified and need more careful definition.

Portability: It’s too early to say for certain, but I fear the OpenMP device constructs are too prescriptive to support performance portability across the wide variety of accelerator architectures being used now and designed for the future. If a programmer has to write programs differently for each target, then the language loses many of its advantages.

Future: OpenMP 4.0 is still relatively new and has many new features beyond the device constructs. It will be some time before we have several mature complete implementations. The OpenMP ARB seems committed to more rapid exploration and adoption of new features across the spectrum of parallel programming. The device constructs will be expanded or contracted as necessary to address the targets of interest. Only if the OpenMP device constructs result in good performance on the host as well as DSPs, manycores, and GPUs, will it become the language of choice.

Any Others?

John Barr left off two other heterogeneous programming languages that have been bandied about, important if only because of the large company promoting each. Google Renderscript uses a kernels computation model and managed memory allocations (like OpenCL kernels and buffers). Unlike OpenCL, Renderscript is optimized for kernels that access graphics-like data structures: rectangular arrays of structs. It’s pretty hard to get the kind of random indexed addressing that is available in OpenCL or CUDA. The runtime can launch kernels on any appropriate available device, such as a CPU or GPU, moving data as required to make it accessible on that device.

Microsoft C++AMP (Accelerated Massive Parallelism) is a pretty sophisticated C++-specific solution, aiming at the sophisticated C++ programmer (or perhaps all C++ programmers are sophisticated), using C++ templates and lambda expressions for data and compute. It introduces a very rich parallel_for_each function that takes a computation domain and kernel function as arguments, and runs that function across the parallel domain. Like Renderscript and OpenCL, the runtime manages the memory movement; it can also decide whether to run a kernel on the host or the GPU. Currently it is available only from Microsoft, though there are one or two experimental implementations from outside Redmond.


As John Barr argues, the tools and programming methodologies for upcoming HPC systems will change as the architectures evolve. Resolving to a high level portable heterogeneous programming strategy will attract more users early in this evolution. Before MPI, there were several message passing libraries, some of which were vendor-specific. Settling on MPI allowed applications to port across cluster architectures without rewrites, and allowed vendors to focus on innovation at the architecture and implementation level. It also limited innovation on the message passing libraries. Before OpenMP, there were several sets of high level directives and low-level threading libraries for programming multiprocessor workstations and cluster nodes. Settling on OpenMP similarly allowed applications to port across multiprocessor and eventually multicore architectures without rewrites, allowing vendors to focus on innovation at the architecture and implementation level. It also limited innovation at the programming language level.

To program heterogeneous systems, we want to settle on a high level programming strategy. It should be as target agnostic as possible. In particular, it must not require a user to write a different program for different targets, and is must not require a user to write a different program for heterogeneous systems from homogeneous systems. We’re going to always require low-level or explicit programming as well, and CUDA and OpenCL will fit this bill nicely for appropriate devices. At the high level, the obvious possibilities for X are OpenACC or OpenMP; the eventual choice will be made by HPC users. If the OpenMP device constructs mature to better support performance portable programming across the wide variety of devices, it may become the best option. If OpenACC demonstrates portability and more generality, it should be the language of choice. There’s still hope that the two specifications will converge. Users will eventually decide based on how well each language supports the variety of targets in terms of ease of use and features, but most importantly, performance and performance portability.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Weekly Twitter Roundup (Jan. 12, 2017)

January 12, 2017

Here at HPCwire, we aim to keep the HPC community apprised of the most relevant and interesting news items that get tweeted throughout the week. Read more…

By Thomas Ayres

NSF Seeks Input on Cyberinfrastructure Advances Needed

January 12, 2017

In cased you missed it, the National Science Foundation posted a “Dear Colleague Letter” (DCL) late last week seeking input on needs for the next generation of cyberinfrastructure to support science and engineering. Read more…

By John Russell

NSF Approves Bridges Phase 2 Upgrade for Broader Research Use

January 12, 2017

The recently completed phase 2 upgrade of the Bridges supercomputer at the Pittsburgh Supercomputing Center (PSC) has been approved by the National Science Foundation (NSF) making it now available for research allocations to the national scientific community, according to an announcement posted this week on the XSEDE web site. Read more…

By John Russell

Clemson Software Optimizes Big Data Transfers

January 11, 2017

Data-intensive science is not a new phenomenon as the high-energy physics and astrophysics communities can certainly attest, but today more and more scientists are facing steep data and throughput challenges fueled by soaring data volumes and the demands of global-scale collaboration. Read more…

By Tiffany Trader

HPE Extreme Performance Solutions

Remote Visualization: An Integral Technology for Upstream Oil & Gas

As the exploration and production (E&P) of natural resources evolves into an even more complex and vital task, visualization technology has become integral for the upstream oil and gas industry. Read more…

For IBM/OpenPOWER: Success in 2017 = (Volume) Sales

January 11, 2017

To a large degree IBM and the OpenPOWER Foundation have done what they said they would – assembling a substantial and growing ecosystem and bringing Power-based products to market, all in about three years. Read more…

By John Russell

UberCloud Cites Progress in HPC Cloud Computing

January 10, 2017

200 HPC cloud experiments, 80 case studies, and a ton of hands-on experience gained, that’s the harvest of four years of UberCloud HPC Experiments. Read more…

By Wolfgang Gentzsch and Burak Yenier

A Conversation with Women in HPC Director Toni Collis

January 6, 2017

In this SC16 video interview, HPCwire Managing Editor Tiffany Trader sits down with Toni Collis, the director and founder of the Women in HPC (WHPC) network, to discuss the strides made since the organization’s debut in 2014. Read more…

By Tiffany Trader

FPGA-Based Genome Processor Bundles Storage

January 6, 2017

Bio-processor developer Edico Genome is collaborating with storage specialist Dell EMC to bundle computing and storage for analyzing gene-sequencing data. Read more…

By George Leopold

For IBM/OpenPOWER: Success in 2017 = (Volume) Sales

January 11, 2017

To a large degree IBM and the OpenPOWER Foundation have done what they said they would – assembling a substantial and growing ecosystem and bringing Power-based products to market, all in about three years. Read more…

By John Russell

UberCloud Cites Progress in HPC Cloud Computing

January 10, 2017

200 HPC cloud experiments, 80 case studies, and a ton of hands-on experience gained, that’s the harvest of four years of UberCloud HPC Experiments. Read more…

By Wolfgang Gentzsch and Burak Yenier

A Conversation with Women in HPC Director Toni Collis

January 6, 2017

In this SC16 video interview, HPCwire Managing Editor Tiffany Trader sits down with Toni Collis, the director and founder of the Women in HPC (WHPC) network, to discuss the strides made since the organization’s debut in 2014. Read more…

By Tiffany Trader

BioTeam’s Berman Charts 2017 HPC Trends in Life Sciences

January 4, 2017

Twenty years ago high performance computing was nearly absent from life sciences. Today it’s used throughout life sciences and biomedical research. Genomics and the data deluge from modern lab instruments are the main drivers, but so is the longer-term desire to perform predictive simulation in support of Precision Medicine (PM). There’s even a specialized life sciences supercomputer, ‘Anton’ from D.E. Shaw Research, and the Pittsburgh Supercomputing Center is standing up its second Anton 2 and actively soliciting project proposals. There’s a lot going on. Read more…

By John Russell

Fast Rewind: 2016 Was a Wild Ride for HPC

December 23, 2016

Some years quietly sneak by – 2016 not so much. It’s safe to say there are always forces reshaping the HPC landscape but this year’s bunch seemed like a noisy lot. Among the noisemakers: TaihuLight, DGX-1/Pascal, Dell EMC & HPE-SGI et al., KNL to market, OPA-IB chest thumping, Fujitsu-ARM, new U.S. President-elect, BREXIT, JR’s Intel Exit, Exascale (whatever that means now), NCSA@30, whither NSCI, Deep Learning mania, HPC identity crisis…You get the picture. Read more…

By John Russell

AWI Uses New Cray Cluster for Earth Sciences and Bioinformatics

December 22, 2016

The Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research (AWI), headquartered in Bremerhaven, Germany, is one of the country's premier research institutes within the Helmholtz Association of German Research Centres, and is an internationally respected center of expertise for polar and marine research. In November 2015, AWI awarded Cray a contract to install a cluster supercomputer that would help the institute accelerate time to discovery. Now the effort is starting to pay off. Read more…

By Linda Barney

Addison Snell: The ‘Wild West’ of HPC Disaggregation

December 16, 2016

We caught up with Addison Snell, CEO of HPC industry watcher Intersect360, at SC16 last month, and Snell had his expected, extensive list of insights into trends driving advanced-scale technology in both the commercial and research sectors. Read more…

By Doug Black

KNUPATH Hermosa-based Commercial Boards Expected in Q1 2017

December 15, 2016

Last June tech start-up KnuEdge emerged from stealth mode to begin spreading the word about its new processor and fabric technology that’s been roughly a decade in the making. Read more…

By John Russell

AWS Beats Azure to K80 General Availability

September 30, 2016

Amazon Web Services has seeded its cloud with Nvidia Tesla K80 GPUs to meet the growing demand for accelerated computing across an increasingly-diverse range of workloads. The P2 instance family is a welcome addition for compute- and data-focused users who were growing frustrated with the performance limitations of Amazon's G2 instances, which are backed by three-year-old Nvidia GRID K520 graphics cards. Read more…

By Tiffany Trader

US, China Vie for Supercomputing Supremacy

November 14, 2016

The 48th edition of the TOP500 list is fresh off the presses and while there is no new number one system, as previously teased by China, there are a number of notable entrants from the US and around the world and significant trends to report on. Read more…

By Tiffany Trader

Vectors: How the Old Became New Again in Supercomputing

September 26, 2016

Vector instructions, once a powerful performance innovation of supercomputing in the 1970s and 1980s became an obsolete technology in the 1990s. But like the mythical phoenix bird, vector instructions have arisen from the ashes. Here is the history of a technology that went from new to old then back to new. Read more…

By Lynd Stringer

Container App ‘Singularity’ Eases Scientific Computing

October 20, 2016

HPC container platform Singularity is just six months out from its 1.0 release but already is making inroads across the HPC research landscape. It's in use at Lawrence Berkeley National Laboratory (LBNL), where Singularity founder Gregory Kurtzer has worked in the High Performance Computing Services (HPCS) group for 16 years. Read more…

By Tiffany Trader

Dell EMC Engineers Strategy to Democratize HPC

September 29, 2016

The freshly minted Dell EMC division of Dell Technologies is on a mission to take HPC mainstream with a strategy that hinges on engineered solutions, beginning with a focus on three industry verticals: manufacturing, research and life sciences. "Unlike traditional HPC where everybody bought parts, assembled parts and ran the workloads and did iterative engineering, we want folks to focus on time to innovation and let us worry about the infrastructure," said Jim Ganthier, senior vice president, validated solutions organization at Dell EMC Converged Platforms Solution Division. Read more…

By Tiffany Trader

For IBM/OpenPOWER: Success in 2017 = (Volume) Sales

January 11, 2017

To a large degree IBM and the OpenPOWER Foundation have done what they said they would – assembling a substantial and growing ecosystem and bringing Power-based products to market, all in about three years. Read more…

By John Russell

Lighting up Aurora: Behind the Scenes at the Creation of the DOE’s Upcoming 200 Petaflops Supercomputer

December 1, 2016

In April 2015, U.S. Department of Energy Undersecretary Franklin Orr announced that Intel would be the prime contractor for Aurora: Read more…

By Jan Rowell

Enlisting Deep Learning in the War on Cancer

December 7, 2016

Sometime in Q2 2017 the first ‘results’ of the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) will become publicly available according to Rick Stevens. He leads one of three JDACS4C pilot projects pressing deep learning (DL) into service in the War on Cancer. Read more…

By John Russell

Leading Solution Providers

D-Wave SC16 Update: What’s Bo Ewald Saying These Days

November 18, 2016

Tucked in a back section of the SC16 exhibit hall, quantum computing pioneer D-Wave has been talking up its new 2000-qubit processor announced in September. Forget for a moment the criticism sometimes aimed at D-Wave. This small Canadian company has sold several machines including, for example, ones to Lockheed and NASA, and has worked with Google on mapping machine learning problems to quantum computing. In July Los Alamos National Laboratory took possession of a 1000-quibit D-Wave 2X system that LANL ordered a year ago around the time of SC15. Read more…

By John Russell

CPU Benchmarking: Haswell Versus POWER8

June 2, 2015

With OpenPOWER activity ramping up and IBM’s prominent role in the upcoming DOE machines Summit and Sierra, it’s a good time to look at how the IBM POWER CPU stacks up against the x86 Xeon Haswell CPU from Intel. Read more…

By Tiffany Trader

Nvidia Sees Bright Future for AI Supercomputing

November 23, 2016

Graphics chipmaker Nvidia made a strong showing at SC16 in Salt Lake City last week. Read more…

By Tiffany Trader

New Genomics Pipeline Combines AWS, Local HPC, and Supercomputing

September 22, 2016

Declining DNA sequencing costs and the rush to do whole genome sequencing (WGS) of large cohort populations – think 5000 subjects now, but many more thousands soon – presents a formidable computational challenge to researchers attempting to make sense of large cohort datasets. Read more…

By John Russell

Beyond von Neumann, Neuromorphic Computing Steadily Advances

March 21, 2016

Neuromorphic computing – brain inspired computing – has long been a tantalizing goal. The human brain does with around 20 watts what supercomputers do with megawatts. And power consumption isn’t the only difference. Fundamentally, brains ‘think differently’ than the von Neumann architecture-based computers. While neuromorphic computing progress has been intriguing, it has still not proven very practical. Read more…

By John Russell

The Exascale Computing Project Awards $39.8M to 22 Projects

September 7, 2016

The Department of Energy’s Exascale Computing Project (ECP) hit an important milestone today with the announcement of its first round of funding, moving the nation closer to its goal of reaching capable exascale computing by 2023. Read more…

By Tiffany Trader

Dell Knights Landing Machine Sets New STAC Records

November 2, 2016

The Securities Technology Analysis Center, commonly known as STAC, has released a new report characterizing the performance of the Knight Landing-based Dell PowerEdge C6320p server on the STAC-A2 benchmarking suite, widely used by the financial services industry to test and evaluate computing platforms. The Dell machine has set new records for both the baseline Greeks benchmark and the large Greeks benchmark. Read more…

By Tiffany Trader

Deep Learning Paves Way for Better Diagnostics

September 19, 2016

Stanford researchers are leveraging GPU-based machines in the Amazon EC2 cloud to run deep learning workloads with the goal of improving diagnostics for a chronic eye disease, called diabetic retinopathy. The disease is a complication of diabetes that can lead to blindness if blood sugar is poorly controlled. It affects about 45 percent of diabetics and 100 million people worldwide, many in developing nations. Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Share This