Compilers and More: MPI+X

By Michael Wolfe

July 16, 2014

At ISC’14, there was intense and continuing interest in the choice of a standard approach for programming the next generation HPC systems. While not guaranteed, many of these systems are likely to be large clusters of nodes with multicore CPUs and some sort of attached accelerators. A standard programming approach is necessary to convince developers, and particularly ISVs, to start adoption now in preparation for this coming generation of systems. John Barr raised the same question in an article at Scientific Computing World from a more philosophical point of view. Here I address this question from a deeper technical perspective.

HPC programming is currently dominated by either a flat model with MPI across nodes as well as cores within a node, or a hybrid model with MPI across the nodes and OpenMP shared memory parallelism across the cores in a node. The advantage of flat MPI is a simpler programming model, only one level of parallelism and only one API. The disadvantage is it doesn’t take advantage of the shared data across the ranks on the same node, requiring message and buffer management across all ranks. MPI+OpenMP roughly inverts those advantages and disadvantages.

The reason MPI and MPI+OpenMP have worked so well over the past 20 years now is that most HPC systems are roughly isomorphic, with some differences in instruction set, node topology, and performance profiles. The system is a network of nodes, the nodes have one or more processors, the processors have one or more cores. The cores on a node share virtual and physical memory, with hardware cache coherence to make shared memory programming relatively safe. There are some outliers, like the big SGI shared memory systems, which have some programming model and performance advantages for certain applications.

John Barr suggests that the options for programming next-generation HPC systems will be MPI+X, where X is one of: OpenMP (if accelerators don’t persist), OpenACC, OpenMP 4+ with device constructs, OpenCL, or CUDA. John is probably right, but I want to add a caveat. We don’t know what HPC systems will look like in a decade. We don’t even know what the HPC systems of the near future will look like. The familiar (I refrain from “simple”) MPI+OpenMP programming model that has served us so well is unlikely to apply to the newest designs five or ten years from now, because the network of homogeneous nodes of homogeneous cores with uniform shared memory is going to change. HPC is at another inflection point, where the drivers are both economic and technological.

Technology is driving designs towards slower cores with more parallelism to make up the performance slack. Parallelism will come in three forms: classical multicore or multiprocessor parallel processes, threads, or tasks; classical vector operations, sometimes recast as SIMD operations; and multithreading used to overlap long latency operations, mostly to tolerate long memory latencies. If your system has enough native bandwidth, multithreading can allow an appropriately organized application to achieve close to peak performance as long as there is another thread ready to execute when one thread stalls on a cache miss. Bandwidth increase is easier to achieve than latency reduction, though it’s still not free.

The economics of HPC leads us to use the best, lowest-cost commodity parts whenever possible. The economic drivers in computing will be mobile and embedded, such as smart phones and tablets. High end workstations will still be important, but vendors are learning that growth is in mobile, and we can expect the biggest investments for new computing parts there. Luckily for us, those parts need increasing performance and continue to require lower power and low cost, so even if painful, a shift from workstation parts to mobile parts may be beneficial.

But the important point is that we don’t know exactly what these systems will look like, and there’s likely to be quite a bit of variability. Some will look like today’s systems, though with higher core counts on each processor chip. Some will have low power chips designed for or inspired by the embedded space; think Blue Gene L/P/Q, Intel Atom/Quark, or ARM. The ratios of parallelism on a node vs. across nodes will differ.

Many will have accelerators of some sort: GPUs (NVIDIA, AMD), manycore (Intel Xeon Phi), or other (TI DSP). The amount and style of parallelism for each kind of accelerator differs.

Memory on today’s accelerators is physically and logically separate from the host memory; that will not always remain so. There is movement to share the same address space between accelerator and host, which allows a kind of shared memory. However, there will still be a performance penalty for accessing the more distant memory, either using remote access or software managed distributed shared memory, like the classical Treadmarks system. In particular, today’s accelerators are designed with a high bandwidth memory; if the host memory doesn’t provide the same bandwidth, the performance will suffer. The same can be true if the host and accelerator share physical memory. Today’s laptops have an integrated graphics unit, often on the processor chip itself; the IGU uses the system memory for the graphics buffer. This is sufficient for most laptop uses (email, presentations), but for interactive games and other challenging graphical applications, you want a discrete GPU with its own memory. Upcoming memory subsystems, such as the Hybrid Memory Cube, may change the balance allowing high bandwidth shared memory, if we’re willing to pay the cost.

So the challenge proposed by John Barr isn’t just what language we should use to write programs, but how will we write programs that will give us satisfactory performance across the variety of systems available today, tomorrow, and over the next decade. Will any of the options he presented suffice? Let’s discuss each of them.

MPI + OpenMP 3.1

OpenMP 3.1 was a minor update to OpenMP 3.0, released in 2011 after three years of development. For nodes with small numbers of homogeneous cores and uniform memory access, OpenMP serves quite well. It’s hard to argue that it needs to be replaced, unless you are using a higher level language than C, C++ or Fortran. However, even those languages are starting to adopt native parallel constructs. Fortran has long had array assignments and a forall construct, expressing vector-style or SIMD parallelism, and now has a do concurrent construct, subsuming much of the behavior of the OpenMP parallel do There is ongoing discussion in the C and C++ language committees about adding OpenMP-style threading or Cilk-style tasking to those languages. Language standards committees move slowly, thankfully, but depending on how these discussions play out, the need for OpenMP directives may diminish in the future.

Advantages: OpenMP 3.1 has the advantage of being relatively high level, and is quite mature. It is supported across essentially all multicore and multiprocessor systems, giving good functional and performance portability, though not necessarily scalability.

Disadvantages: Anyone just starting to port an existing program to a multicore will either really like the simplicity of OpenMP parallel loops, or will really dislike the necessity of annotating each and every parallel loop in the program. The OpenMP 3.1 model of a system is somewhat dated; systems being built now and in the future are no longer so homogeneous. OpenMP is really dependent on long-lived parallel threads, which pervade the execution model and are exposed in the programming model. While OpenMP now has tasks, the tasking model was grafted (quite elegantly, in my opinion) onto the existing OpenMP threading model.

Portability: As mentioned, OpenMP is supported on essentially all multiprocessor and multicore systems used in HPC today, and provides pretty good performance portability across the homogeneous shared memory (low core-count) multicore environment for which it was designed. However, OpenMP 3.1 has no support for heterogeneous or accelerated systems. There is an impressive research project at Purdue University to implement OpenMP on NVIDIA GPUs with some significant compiler and runtime effort. They have produced a prototype compiler and have some impressive results on benchmark programs. However, it’s still in the research stage, and it suffers from trying to hide or virtualize some of the most performance sensitive issues of accelerator programming, such as data movement between host and accelerator memories and the performance of the synchronizations required by OpenMP.

Future: OpenMP 4 adds many new features, including some support for heterogeneous systems and more control to take advantage of data and compute locality. The language is becoming more complex, to adapt to and allow control over a more complex world of parallel systems. Eventually, meaning a decade or more from now, the base languages may support enough native parallelism to obviate the need for OpenMP directives.

MPI + CUDA

CUDA is an explicit programming model for GPU-style accelerators. It exposes many of the architectural features of a GPU, such as the 3D grid and 3D thread block, the software-managed cache (CUDA shared memory), the SIMD-style execution of adjacent threads in the same thread block, the separate data memory on the GPU, and so on. Used the right way, this is a strength of the model and language, since it gives a dedicated developer access to all the power of the GPU. However, it does require significant investment when porting to the GPU. In addition, when moving from one generation of GPU to the next, highly tuned CUDA kernels will often have to be retuned for the newer hardware, increasing code maintenance cost.

Moreover, CUDA today is a single-vendor (NVIDIA) solution. PGI has developed a CUDA-X86 solution, but its performance is not always as good as the corresponding native code. One can envision a CUDA implementation for other GPUs or other devices, but there are no hints of such a development.

Advantages: The CUDA model is relatively mature, over five years old now. It gives control to the programmer over all the relevant features of the GPU. It comes with debugging and performance analysis tools, and is well-supported by NVIDIA. The language and programming model evolve as the capabilities of the GPU evolve. Recently, CUDA added Unified Memory, which makes porting easier by simplifying the memory management and allowing automatic data movement between the host and device memories. CUDA Fortran provides the same programming model for Fortran programmers, with the advantages of using the higher level language features available in Fortran.

Disadvantages: It’s an explicit programming model. Converting an existing application in C, C++ or Fortran to CUDA is a significant rewrite, although such rewrites allow and encourage one to replace inappropriate algorithms and data structures with others more suited to the parallel world.

Portability: CUDA is currently limited to NVIDIA GPUs. There have been efforts to port CUDA programs to parallel execution on CPUs; the CUDA-X86 product uses some of the same techniques. These efforts have some successes, but highly parallel CUDA programs don’t always map all that well to the limited multicore and SIMD parallelism available on CPUs. Moreover, a kernel tuned for one generation of NVIDIA GPU (experimenting with thread block size and shape, unrolling loops) will often have to be retuned for the next generation.

Future: NVIDIA will continue to support and develop CUDA. CUDA will continue to play an important role in GPU programming, especially where the investment pays off handsomely, such as highly tuned libraries or very performance sensitive application kernels.

MPI + OpenCL

OpenCL is essentially the same programming model for GPU-style accelerators as CUDA, but with a lower level platform layer. It exposes the same features of the GPU, but is not NVIDIA-specific. It is also supported on many targets, including DSPs and even x86 and ARM processors. Like CUDA, it gives complete control to the programmer. Unlike CUDA, there is no OpenCL compiler for the host code; the platform API is completely library-based, making it look very verbose, especially for small programs. The difference becomes clear when looking at how a kernel is launched in CUDA and OpenCL. In CUDA C, a kernel launch looks like a procedure call to the kernel, with additional syntax to specify the launch configuration (grid and block sizes):

      kernel<<< gridsize, blocksize >>>( arg0, arg1, ... );

The CUDA compiler translates this into the series of runtime calls to manage the arguments and the actual launch. Since OpenCL doesn’t have such a compiler, a kernel launch becomes a series of explicit runtime calls: one for each argument, and one for the launch:

      clSetKernelArg( kernel_handle, 0, sizeof(arg0), &arg1 );
      clSetKernelArg( kernel_handle, 1, sizeof(arg1), &arg1 );
      ...
      clEnqueueNDRangeKernel( queue_handle, kernel_handle, 1, NULL,
                              &globalsize, &localsize, 0, NULL, NULL );

Other than that, OpenCL programming is potentially just as powerful as CUDA programming.

Given the wider range of target devices implementing OpenCL, there has been quite a discussion about OpenCL and performance portability for the past several years. Use your favorite search engine for the phrase OpenCL performance portability for a sample. A recent article on HPCwire advocates quite well in favor of OpenCL for heterogeneous computing, in part because it accommodates all the available architectures, separately or together. I have to agree with almost everything the article says. As for performance portability, this article states: True, we may need to write a new version of our kernel to get the best performance on Architecture A, but isn’t this what we actually want? The most telling word in that sentence is kernel, singular. In the embedded world, one developer is quite likely to be focused on a single kernel, such as some video processing, or audio encoding, or the like. In that scenario, with only one kernel, then you can afford to tune and retune for each architecture. However, if you have tens of thousands or millions of lines of code to support over the multi-decade lifetime of your application, this is not a productive path.

Advantages: The kernels are written in the same explicit programming model as CUDA, and OpenCL is supported across a wide range of devices. As with CUDA, the OpenCL API continues to evolve, adding features aimed at simplifying programming and supporting larger applications.

Disadvantages: It doesn’t have a standard set of tools across all devices, so debugging and profiling support is sometimes spotty. The language is evolving, as mentioned, but evolves slowly, as any committee-designed language will.

Portability: Since the programming model is lower-level, your kernels have to explicitly take advantage of the features of the target architecture. That means your kernels have to be retuned or rewritten for different targets.

Future: The OpenCL model is more aimed at the embedded or mobile market, where the application writer doesn’t necessarily have details about the target device. Features like dynamic, just-in-time compiling are important in that arena. In HPC, having to recompile every kernel on every node of your cluster every time you run your program quickly sounds like overhead you’d rather avoid. However, I predict that like CUDA, OpenCL can play an important role in programming for GPUs and other accelerators, especially where the investment will pay off handsomely.

MPI + OpenACC

OpenACC is a directive-based programming model targeting a CPU+accelerator system, intentionally similar in many ways to OpenMP. It is designed to do for accelerator programming systems what OpenMP does for multicore systems. It hides or virtualizes those features of the system that can be managed automatically by the system without performance penalty, and exposes those features that must be managed by the programmer. For instance, the programming and execution models expose the presence of separate memories on a host+GPU system, and requires the user to manage data movement between the host and device memories for device-resident data. However, OpenACC virtualizes the variable names, using the same name for the CPU and device copies of the data, resolving them depending on where the name is used, and will even work with no overhead when the device shares memory with the host. The original intent was to eventually merge OpenACC into OpenMP, but the two groups decided on different approaches. In full disclosure, this author is a key member of the OpenACC language committee, and formerly participated in the OpenMP language committee.

OpenACC was designed to focus on parallel algorithms that perform well on the accelerators of interest. Accelerators today need a high degree of multicore parallelism as well as vector or SIMD parallelism. They don’t work very efficiently on scalar code, or on scalar tasks; scalar code will always be more efficient on a high-speed host CPU. This is the primary motivation for a programming model that supports closely coupled CPU+Accelerator systems, which will likely be most efficient on maximizing performance across complete applications.

Advantages: OpenACC has demonstrated support for multiple devices, and there is some initial evidence for performance portability across device types.

Disadvantages: OpenACC is still relatively young, and there are as yet no available open-source implementations of the full language. As yet, there are no implementations that target the host multicore as a device, so it currently must be combined with OpenMP. Some critical features are still under development, and there are some differences in how features are implemented by different compilers, making portability across vendors an issue.

Portability: OpenACC can virtualize many aspects of the architecture, so programs should work well across divergent targets. At the NVIDIA GPU Technology Conference in March, I presented results showing the performance of the SPEC ACCEL OpenACC benchmarks running on NVIDIA Kepler and AMD Radeon GPUs, providing initial evidence of performance portability across devices.

Future: OpenACC version 2.0 includes some important new features, such as defined support for separate compilation, unstructured data lifetimes, and more asynchronous operations; work proceeds on additional critical and important features. There are quite a few large applications being ported to GPUs with OpenACC, and this experience feeds back to the language design and implementations. Expect a great deal of exciting work on OpenACC in the next few years, including commercial and open-source implementations across all HPC target accelerators, as well as host CPUs.

MPI + OpenMP 4.0 Target Directives

One may ask why OpenACC looks so different from OpenMP? Why not just implement OpenMP on GPUs? This is a good question and an important one. I tried to answer this with an articleseveral years ago. Current GPUs, in particular, have some limitations that we should avoid and some capabilities that we should exploit when programming for the best parallel performance.

The OpenMP ARB released version 4.0 of the specification last July, a major revision just two years after releasing 3.1. Among many new features are the device constructs. Some of these constructs mirror those in OpenACC, such as data management with the OpenMP target data and target update constructs. The OpenMP committee chose to add a new level of parallelism to the classical OpenMP threads: teams. In OpenMP 3.1, a parallel region was executed by a team of threads, and parallel loop iterations or tasks were work-shared across the threads in a team. With the OpenMP 4.0 device constructs, the user can now create a league of teams of threads. The addition of teams was to accommodate devices like GPUs which do not support an efficient global barrier synchronization across all the threads in the system. This addition has two unfortunate side effects. First, porting an existing OpenMP program to a device like a GPU is not as easy as simply adding a target directive around the parallel loops. As with OpenACC, the programmer will have to manage the data traffic to the device, but now the programmer has to add new types of parallelism as well. Second, some devices, such as the manycore Intel Xeon Phi Coprocessor, don’t need and probably don’t want this extra level of parallelism; OpenMP can be implemented natively on that target. This means that tuning for one target may differ significantly from tuning for another target, making performance portability a real challenge. I fear that the OpenMP committee has made some unfortunate decisions that will be hard to fix.

Intel supports many of the OpenMP 4 device constructs, and presumably will have a conformant implementation for the Xeon Phi Coprocessor. Texas Instruments has also announced plans to support OpenMP 4 on some ARM+DSP heterogeneous SoCs. The new Convey MX computer has support for hybrid OpenMP that matches OpenMP 4 features. Cray has announced plans to deliver OpenMP 4 for their systems as well, including Xeon Phi and Kepler accelerators, though they don’t promise performance portability. The other OpenMP vendor members have yet to announce support for the device constructs.

Advantages: For those targets that can implement full OpenMP, the OpenMP 4 device constructs look like a simpler method to port an existing OpenMP program. There is still the requirement to manage data transfers, but that’s the case on any of these programming models.

Disadvantages: For targets that aren’t designed to implement OpenMP, it’s not clear whether good performance can be delivered. No prototype implementations of the teams construct were available. Performance portability is going to be a challenge, though, as with OpenCL, language portability is still a big step in the right direction. Parts of the current OpenMP 4 device constructs are not well specified and need more careful definition.

Portability: It’s too early to say for certain, but I fear the OpenMP device constructs are too prescriptive to support performance portability across the wide variety of accelerator architectures being used now and designed for the future. If a programmer has to write programs differently for each target, then the language loses many of its advantages.

Future: OpenMP 4.0 is still relatively new and has many new features beyond the device constructs. It will be some time before we have several mature complete implementations. The OpenMP ARB seems committed to more rapid exploration and adoption of new features across the spectrum of parallel programming. The device constructs will be expanded or contracted as necessary to address the targets of interest. Only if the OpenMP device constructs result in good performance on the host as well as DSPs, manycores, and GPUs, will it become the language of choice.

Any Others?

John Barr left off two other heterogeneous programming languages that have been bandied about, important if only because of the large company promoting each. Google Renderscript uses a kernels computation model and managed memory allocations (like OpenCL kernels and buffers). Unlike OpenCL, Renderscript is optimized for kernels that access graphics-like data structures: rectangular arrays of structs. It’s pretty hard to get the kind of random indexed addressing that is available in OpenCL or CUDA. The runtime can launch kernels on any appropriate available device, such as a CPU or GPU, moving data as required to make it accessible on that device.

Microsoft C++AMP (Accelerated Massive Parallelism) is a pretty sophisticated C++-specific solution, aiming at the sophisticated C++ programmer (or perhaps all C++ programmers are sophisticated), using C++ templates and lambda expressions for data and compute. It introduces a very rich parallel_for_each function that takes a computation domain and kernel function as arguments, and runs that function across the parallel domain. Like Renderscript and OpenCL, the runtime manages the memory movement; it can also decide whether to run a kernel on the host or the GPU. Currently it is available only from Microsoft, though there are one or two experimental implementations from outside Redmond.

Conclusions

As John Barr argues, the tools and programming methodologies for upcoming HPC systems will change as the architectures evolve. Resolving to a high level portable heterogeneous programming strategy will attract more users early in this evolution. Before MPI, there were several message passing libraries, some of which were vendor-specific. Settling on MPI allowed applications to port across cluster architectures without rewrites, and allowed vendors to focus on innovation at the architecture and implementation level. It also limited innovation on the message passing libraries. Before OpenMP, there were several sets of high level directives and low-level threading libraries for programming multiprocessor workstations and cluster nodes. Settling on OpenMP similarly allowed applications to port across multiprocessor and eventually multicore architectures without rewrites, allowing vendors to focus on innovation at the architecture and implementation level. It also limited innovation at the programming language level.

To program heterogeneous systems, we want to settle on a high level programming strategy. It should be as target agnostic as possible. In particular, it must not require a user to write a different program for different targets, and is must not require a user to write a different program for heterogeneous systems from homogeneous systems. We’re going to always require low-level or explicit programming as well, and CUDA and OpenCL will fit this bill nicely for appropriate devices. At the high level, the obvious possibilities for X are OpenACC or OpenMP; the eventual choice will be made by HPC users. If the OpenMP device constructs mature to better support performance portable programming across the wide variety of devices, it may become the best option. If OpenACC demonstrates portability and more generality, it should be the language of choice. There’s still hope that the two specifications will converge. Users will eventually decide based on how well each language supports the variety of targets in terms of ease of use and features, but most importantly, performance and performance portability.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

ISC19 Cluster Competition: Application Results, Finally!

July 15, 2019

Our exhaustive coverage of the ISC19 Student Cluster Competition continues as we discuss the application scores below. While the scores were typically high, some of the apps, like SWIFT and OpenFOAM, really pushed the st Read more…

By Dan Olds

Portugal Launches Its First Supercomputer

July 12, 2019

Portugal has officially inaugurated its first-ever supercomputer. The unassumingly named “Bob” supercomputer is housed in the Minho Advanced Computer Center (MACC) at the University of Minho.  Bob was announced i Read more…

By Oliver Peckham

What’s New in HPC Research: Traffic Simulation, Performance Variations, Scheduling & More

July 11, 2019

In this bimonthly feature, HPCwire highlights newly published research in the high-performance computing community and related domains. From parallel programming to exascale to quantum computing, the details are here. Read more…

By Oliver Peckham

HPE Extreme Performance Solutions

HPE and Intel® Omni-Path Architecture: How to Power a Cloud

Learn how HPE and Intel® Omni-Path Architecture provide critical infrastructure for leading Nordic HPC provider’s HPCFLOW cloud service.

For decades, HPE has been at the forefront of high-performance computing, and we’ve powered some of the fastest and most robust supercomputers in the world. Read more…

IBM Accelerated Insights

How AI Powers Up Data Management and Analytics

Companies are making more decisions based on data. However, the ability to intelligently process the growing volume of data is a bottleneck to extracting actionable insights. Read more…

Nvidia Expands DGX-Ready AI Program to 19 Countries

July 11, 2019

Nvidia’s DGX-Ready Data Center Program, announced in January and designed to provide colo and public cloud-like options to access the company’s GPU-powered servers for AI workloads, has expanded the program beyond th Read more…

By Doug Black

ISC19 Cluster Competition: Application Results, Finally!

July 15, 2019

Our exhaustive coverage of the ISC19 Student Cluster Competition continues as we discuss the application scores below. While the scores were typically high, som Read more…

By Dan Olds

Nvidia Expands DGX-Ready AI Program to 19 Countries

July 11, 2019

Nvidia’s DGX-Ready Data Center Program, announced in January and designed to provide colo and public cloud-like options to access the company’s GPU-powered Read more…

By Doug Black

Argonne Team Makes Record Globus File Transfer

July 10, 2019

A team of scientists at Argonne National Laboratory has broken a data transfer record by moving a staggering 2.9 petabytes of data for a research project.  The data – from three large cosmological simulations – was generated and stored on the Summit supercomputer at the Oak Ridge Leadership Computing Facility (OLCF)... Read more…

By Oliver Peckham

Nvidia, Google Tie in Second MLPerf Training ‘At-Scale’ Round

July 10, 2019

Results for the second round of the AI benchmarking suite known as MLPerf were published today with Google Cloud and Nvidia each picking up three wins in the at Read more…

By Tiffany Trader

Applied Materials Embedding New Memory Technologies in Chips

July 9, 2019

Applied Materials, the $17 billion Santa Clara-based materials engineering company for the semiconductor industry, today announced manufacturing systems enablin Read more…

By Doug Black

ISC19 Cluster Competition: HPCC Deep Dive

July 7, 2019

The biggest benchmark the student warriors tackled during the ISC19 Student Cluster Competition was the colossal HPC Challenge. This is a collection of benchmar Read more…

By Dan Olds

OLCF Bids Farewell to Its Titan Supercomputer

July 4, 2019

After seven years of faithful service, and a long reign as the United States' fastest supercomputer, the Cray XK7-based Titan supercomputer at the Oak Ridge Lea Read more…

By Staff report

Quantum Bits: Neven’s Law (Who Asked for That), D-Wave’s Steady Push, IBM’s Li-O2- Simulation

July 3, 2019

Quantum computing’s (QC) many-faceted R&D train keeps slogging ahead and recently Japan is taking a leading role. Yesterday D-Wave Systems announced it ha Read more…

By John Russell

High Performance (Potato) Chips

May 5, 2006

In this article, we focus on how Procter & Gamble is using high performance computing to create some common, everyday supermarket products. Tom Lange, a 27-year veteran of the company, tells us how P&G models products, processes and production systems for the betterment of consumer package goods. Read more…

By Michael Feldman

Cray, AMD to Extend DOE’s Exascale Frontier

May 7, 2019

Cray and AMD are coming back to Oak Ridge National Laboratory to partner on the world’s largest and most expensive supercomputer. The Department of Energy’s Read more…

By Tiffany Trader

Graphene Surprises Again, This Time for Quantum Computing

May 8, 2019

Graphene is fascinating stuff with promise for use in a seeming endless number of applications. This month researchers from the University of Vienna and Institu Read more…

By John Russell

AMD Verifies Its Largest 7nm Chip Design in Ten Hours

June 5, 2019

AMD announced last week that its engineers had successfully executed the first physical verification of its largest 7nm chip design – in just ten hours. The AMD Radeon Instinct Vega20 – which boasts 13.2 billion transistors – was tested using a TSMC-certified Calibre nmDRC software platform from Mentor. Read more…

By Oliver Peckham

It’s Official: Aurora on Track to Be First US Exascale Computer in 2021

March 18, 2019

The U.S. Department of Energy along with Intel and Cray confirmed today that an Intel/Cray supercomputer, "Aurora," capable of sustained performance of one exaf Read more…

By Tiffany Trader

TSMC and Samsung Moving to 5nm; Whither Moore’s Law?

June 12, 2019

With reports that Taiwan Semiconductor Manufacturing Co. (TMSC) and Samsung are moving quickly to 5nm manufacturing, it’s a good time to again ponder whither goes the venerable Moore’s law. Shrinking feature size has of course been the primary hallmark of achieving Moore’s law... Read more…

By John Russell

Deep Learning Competitors Stalk Nvidia

May 14, 2019

There is no shortage of processing architectures emerging to accelerate deep learning workloads, with two more options emerging this week to challenge GPU leader Nvidia. First, Intel researchers claimed a new deep learning record for image classification on the ResNet-50 convolutional neural network. Separately, Israeli AI chip startup Hailo.ai... Read more…

By George Leopold

Nvidia Embraces Arm, Declares Intent to Accelerate All CPU Architectures

June 17, 2019

As the Top500 list was being announced at ISC in Frankfurt today with an upgraded petascale Arm supercomputer in the top third of the list, Nvidia announced its Read more…

By Tiffany Trader

Leading Solution Providers

ISC 2019 Virtual Booth Video Tour

CRAY
CRAY
DDN
DDN
DELL EMC
DELL EMC
GOOGLE
GOOGLE
ONE STOP SYSTEMS
ONE STOP SYSTEMS
PANASAS
PANASAS
VERNE GLOBAL
VERNE GLOBAL

Why Nvidia Bought Mellanox: ‘Future Datacenters Will Be…Like High Performance Computers’

March 14, 2019

“Future datacenters of all kinds will be built like high performance computers,” said Nvidia CEO Jensen Huang during a phone briefing on Monday after Nvidia revealed scooping up the high performance networking company Mellanox for $6.9 billion. Read more…

By Tiffany Trader

Top500 Purely Petaflops; US Maintains Performance Lead

June 17, 2019

With the kick-off of the International Supercomputing Conference (ISC) in Frankfurt this morning, the 53rd Top500 list made its debut, and this one's for petafl Read more…

By Tiffany Trader

Intel Launches Cascade Lake Xeons with Up to 56 Cores

April 2, 2019

At Intel's Data-Centric Innovation Day in San Francisco (April 2), the company unveiled its second-generation Xeon Scalable (Cascade Lake) family and debuted it Read more…

By Tiffany Trader

Cray – and the Cray Brand – to Be Positioned at Tip of HPE’s HPC Spear

May 22, 2019

More so than with most acquisitions of this kind, HPE’s purchase of Cray for $1.3 billion, announced last week, seems to have elements of that overused, often Read more…

By Doug Black and Tiffany Trader

A Behind-the-Scenes Look at the Hardware That Powered the Black Hole Image

June 24, 2019

Two months ago, the first-ever image of a black hole took the internet by storm. A team of scientists took years to produce and verify the striking image – an Read more…

By Oliver Peckham

Announcing four new HPC capabilities in Google Cloud Platform

April 15, 2019

When you’re running compute-bound or memory-bound applications for high performance computing or large, data-dependent machine learning training workloads on Read more…

By Wyatt Gorman, HPC Specialist, Google Cloud; Brad Calder, VP of Engineering, Google Cloud; Bart Sano, VP of Platforms, Google Cloud

Chinese Company Sugon Placed on US ‘Entity List’ After Strong Showing at International Supercomputing Conference

June 26, 2019

After more than a decade of advancing its supercomputing prowess, operating the world’s most powerful supercomputer from June 2013 to June 2018, China is keep Read more…

By Tiffany Trader

In Wake of Nvidia-Mellanox: Xilinx to Acquire Solarflare

April 25, 2019

With echoes of Nvidia’s recent acquisition of Mellanox, FPGA maker Xilinx has announced a definitive agreement to acquire Solarflare Communications, provider Read more…

By Doug Black

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This