Compilers and More: Accelerated Programming

By Michael Wolfe, PGI

December 3, 2013

Having just returned from SC13, one burning issue is the choice of a standard approach for programming the next generation HPC systems. While not guaranteed, these systems are likely to be large clusters of nodes with multicore CPUs and some sort of attached accelerators. A standard programming approach is necessary to convince developers, and particularly ISVs, to start adoption now in preparation for this coming generation of systems. John Barr raised the same question in a recent article at Scientific Computing World from a more philosophical point of view. Here I address this question from a deeper technical perspective.

HPC programming is currently dominated by either a flat model with MPI across nodes as well as cores within a node, or a hybrid model with MPI across the nodes and OpenMP shared memory parallelism across the cores in a node. The advantage of flat MPI is a simpler programming model, only one level of parallelism and only one API. The disadvantage is it doesn’t take advantage of the shared data across the ranks on the same node, requiring message and buffer management across all ranks. MPI+OpenMP roughly inverts those advantages and disadvantages.

The reason MPI and MPI+OpenMP have worked so well over the past 20 years now is that most HPC systems are roughly isomorphic, with some differences in instruction set, node topology, and performance profiles. The system is a network of nodes, the nodes have one or more processors, the processors have one or more cores. The cores on a node share virtual and physical memory, with hardware cache coherence to make shared memory programming relatively safe. There are some outliers, like the big SGI shared memory systems, which have some programming model and performance advantages for certain applications.

John Barr suggests that the options for programming next-generation HPC systems will be MPI+X, where X is one of: OpenMP (if accelerators don’t), OpenACC, OpenMP 4.0+ with device constructs, OpenCL, or CUDA. John is probably right, but I want to add a caveat. We don’t know what HPC systems will look like in a decade. We don’t even know what the HPC systems of the near future will look like. The familiar (I hesitate to say “simple”) MPI+OpenMP programming model that has served us so well is unlikely to apply to the newest designs five or ten years from now, because the network of homogeneous nodes of homogeneous cores with uniform shared memory is going to change. HPC is at another inflection point, where the drivers are both economic and technological.

Technology is driving designs towards slower cores with more parallelism to make up the performance slack. Parallelism will come in three forms: classical multicore or multiprocessor parallel processes, threads, or tasks; classical vector operations, sometimes recast as SIMD operations; and multithreading used to overlap long latency operations, mostly to tolerate long memory latencies. If your system has enough native bandwidth, multithreading can allow an appropriately organized application to achieve close to peak performance as long as there is another thread ready to execute when one thread stalls on a cache miss. Bandwidth increase is easier to achieve than latency reduction, though it’s still not free.

The economics of HPC leads us to use the best, lowest-cost commodity parts whenever possible. The economic drivers in computing will be mobile and embedded, such as smart phones and tablets. High end workstations will still be important, but vendors are learning that growth is in mobile, and we can expect the biggest investments for new computing parts there. Luckily for us, those parts need increasing performance and continue to require lower power and low cost, so even if painful, a shift from workstation parts to mobile parts may be beneficial.

But the important point is that we don’t know exactly what these systems will look like, and there’s likely to be quite a bit of variability. Some will look like today’s systems, though with higher core counts on each processor chip. Some will have low power chips designed for or inspired by the embedded space; think Blue Gene L/P/Q, Intel Atom/Quark, or ARM. The ratios of parallelism on a node vs. across nodes will differ.

Many will have accelerators of some sort: GPUs (NVIDIA, AMD), manycore (Intel Xeon Phi), or other (TI DSP). The amount and style of parallelism for each kind of accelerator differs.

Memory on today’s accelerators is physically and logically separate from the host memory; that will not always remain so. There is movement to share the same address space between accelerator and host, which allows a kind of shared memory. However, there will still be a performance penalty for accessing the more distant memory, either using remote access or software managed distributed shared memory, like the classical Treadmarks system. In particular, today’s accelerators are designed with a high bandwidth memory; if the host memory doesn’t provide the same bandwidth, the performance will suffer. The same can be true if the host and accelerator share physical memory. Today’s laptops have an integrated graphics unit, often on the processor chip itself; the IGU uses the system memory for the graphics buffer. This is sufficient for most laptop uses (email, presentations), but for interactive games and other challenging graphical applications, you want a discrete GPU with its own memory. Future memory subsystems, such as the Hybrid Memory Cube, may change the balance allowing high bandwidth shared memory, if we’re willing to pay the cost.

So the challenge proposed by John Barr isn’t just what language we should use to write programs, but how will we write programs that will give us satisfactory performance across the variety of systems available today, tomorrow, and over the next decade. Will any of the options he presented suffice? Let’s discuss each of them.

MPI + OpenMP 3.1

OpenMP 3.1 was a minor update to OpenMP 3.0, released in 2011 after three years of development. For nodes with small numbers of homogeneous cores and uniform memory access, OpenMP serves quite well. It’s hard to argue that it needs to be replaced, unless you are using a higher level language than C, C++ or Fortran. However, even those languages are starting to adopt native parallel constructs. Fortran has long had array assignments and a forall construct, expressing vector-style or SIMD parallelism, and now has a do concurrent construct, subsuming much of the behavior of the OpenMP parallel do There is ongoing discussion in the C and C++ language committees about adding OpenMP-style threading or Cilk-style tasking to those languages. Language standards committees move slowly, thankfully, but depending on how these discussions play out, the need for OpenMP directives may diminish in the future.

Advantages: OpenMP 3.1 has the advantage of being relatively high level, and is quite mature. It is supported across essentially all multicore and multiprocessor systems, giving good functional and performance portability, though not necessarily scalability.

Disadvantages: Anyone just starting to port an existing program to a multicore will either really like the simplicity of OpenMP parallel loops, or will really dislike the necessity of annotating each and every parallel loop in the program. The OpenMP 3.1 model of a system is somewhat dated; systems being built now and in the future are no longer so homogeneous. OpenMP is really dependent on long-lived parallel threads, which pervade the execution model and are exposed in the programming model. While OpenMP now has tasks, the tasking model was grafted (quite elegantly, in my opinion) onto the existing OpenMP threading model.

Portability: As mentioned, OpenMP is supported on essentially all multiprocessor and multicore systems used in HPC today, and provides pretty good performance portability across the homogeneous shared memory (low core-count) multicore environment for which it was designed. However, OpenMP 3.1 has no support for heterogeneous or accelerated systems. There is an impressive research project at Purdue University to implement OpenMP on NVIDIA GPUs with some significant compiler and runtime effort. They have produced a prototype compiler and have some impressive results on benchmark programs. However, it’s still in the research stage, and it suffers from trying to hide or virtualize some of the most performance sensitive issues of accelerator programming.

Future: OpenMP 4.0 adds many new features, including some support for heterogeneous systems and more control to take advantage of data and compute locality. The language is becoming more complex, to adapt to and allow control over a more complex world of parallel systems. Eventually, meaning a decade or more from now, the base languages may support enough native parallelism to obviate the need for OpenMP directives.

MPI + CUDA

CUDA is a lower-level programming model for GPU-style accelerators. It exposes many of the architectural features of a GPU, such as the 3D grid and 3D thread block, the software-managed cache (CUDA shared memory), the SIMD-style execution of adjacent threads in the same thread block, the separate data memory on the GPU, and so on. Used the right way, this is a strength of the model and language, since it gives a dedicated developer access to all the power of the GPU. However, it is a challenging programming model and does require significant investment when porting to the GPU. In addition, when moving from one generation of GPU to the next, highly tuned CUDA kernels will often have to be retuned for the newer hardware, increasing code maintenance cost.

Moreover, CUDA today is a single-vendor (NVIDIA) solution. PGI has developed a CUDA-X86 solution, but its performance is not as good as the corresponding native code. One can envision a CUDA implementation for other GPUs or other devices, but I’ve seen no hints of such a development.

Advantages: The CUDA toolkit is relatively mature, over five years old now. It gives control to the programmer over all the relevant features of the GPU. It comes with debugging and performance analysis tools, and is well-supported by NVIDIA. The language and programming model evolve as the capabilities of the GPU evolve. CUDA Fortran provides the same programming model for Fortran programmers, with the advantages of using the higher level language features available in Fortran.

Disadvantages: It’s a lower-level programming model. Converting an existing application in C, C++ or Fortran to CUDA is a significant rewrite, although such rewrites allow and encourage one to replace inappropriate algorithms and data structures with others more suited to the parallel world.

Portability: CUDA is currently limited to NVIDIA GPUs. There have been efforts to port CUDA programs to parallel execution on CPUs; the CUDA-X86 product uses some of the same techniques. These efforts have some successes, but highly parallel CUDA programs don’t always map all that well to the limited multicore and SIMD parallelism available on CPUs. Moreover, a kernel tuned for one generation of NVIDIA GPU (experimenting with thread block size and shape, unrolling loops) will often have to be retuned for the next generation.

Future: NVIDIA will continue to support and develop CUDA. CUDA will likely play an important role in GPU programming, especially where the investment pays off handsomely, such as highly tuned libraries or very performance sensitive application kernels.

MPI + OpenCL

OpenCL is essentially the same programming model for GPU-style accelerators as CUDA. It exposes the same features of the GPU, but is not NVIDIA-specific. It is also supported on many targets, including DSPs and even x86 and ARM processors. Like CUDA, it gives complete control to the programmer. Unlike CUDA, there is no OpenCL compiler for the host code; the platform API is completely library-based, making it look very verbose, especially for small programs. The difference becomes clear when looking at how a kernel is launched in CUDA and OpenCL. In CUDA C, a kernel launch looks like a procedure call to the kernel, with additional syntax to specify the launch configuration (grid and block sizes):

		kernel<<< gridsize, blocksize >>>( arg0, arg1, ... );

The CUDA compiler translates this into the series of runtime calls to manage the arguments and the actual launch. Since OpenCL doesn’t have such a compiler, a kernel launch becomes a series of explicit runtime calls: one for each argument, and one for the launch:

		clSetKernelArg( kernel_handle, 0, sizeof(arg0), &arg1 );
      clSetKernelArg( kernel_handle, 1, sizeof(arg1), &arg1 );
      ...
      clEnqueueNDRangeKernel( queue_handle, kernel_handle, 1, NULL,
                              &globalsize, &localsize, 0, NULL, NULL );

Other than that, OpenCL programming is potentially just as powerful as CUDA programming.

Given the wider range of target devices implementing OpenCL, there has been quite a discussion about OpenCL and performance portability for the past several years. Use your favorite search engine for the phrase OpenCL performance portability for a sample. A recent article on HPCwire advocates quite well in favor of OpenCL for heterogeneous computing, in part because it accommodates all the available architectures, separately or together. I have to agree with almost everything the article says. As for performance portability, this article states: True, we may need to write a new version of our kernel to get the best performance on Architecture A, but isn’t this what we actually want? The most telling word in that sentence is kernel, singular. In the embedded world, one developer is quite likely to be focused on a single kernel, such as some video processing, or audio encoding, or the like. In that scenario, with only one kernel, then you can afford to tune and retune for each architecture. However, if you have tens of thousands or millions of lines of code to support over the multi-decade lifetime of your application, this is not a productive path.

Advantages: The kernels are written in the same explicit programming model as CUDA, and OpenCL is supported across a wide range of devices.

Disadvantages: It doesn’t have a standard set of tools across all devices, so debugging and profiling support is sometimes spotty. The language is evolving, but slowly, as any committee-designed language will.

Portability: Since the programming model is lower-level, your kernels have to explicitly take advantage of the features of the target architecture. That means your kernels have to be retuned or rewritten for different targets.

Future: The OpenCL model is more aimed at the embedded or mobile market, where the application writer doesn’t necessarily have details about the target device. Features like dynamic, just-in-time compiling are important in that arena. In HPC, having to recompile every kernel on every node of your cluster every time you run your program quickly sounds like overhead you’d rather avoid. However, I predict that like CUDA, OpenCL can play an important role in programming for GPUs and other accelerators, especially where the investment will pay off handsomely.

MPI + OpenACC

OpenACC is a directive-based programming model targeting a CPU+accelerator system, intentionally similar in many ways to OpenMP. It is designed to do for accelerator programming systems what OpenMP does for multicore systems. It hides or virtualizes those features of the system that can be managed automatically by the system without performance penalty, and exposes those features that must be managed by the programmer. For instance, the programming and execution models expose the presence of separate memories on a host+GPU system, and requires the user to manage data movement between the host and device memories for device-resident data. However, OpenACC virtualizes the variable names, using the same name for the CPU and device copies of the data, resolving them depending on where the name is used, and will even work with no overhead when the device shares memory with the host. The original intent was to eventually merge OpenACC into OpenMP, but the two groups decided on different approaches. In full disclosure, this author is a key member of the OpenACC language committee, and formerly participated in the OpenMP language committee.

OpenACC was designed to focus on parallel algorithms that perform well on the accelerators of interest. Accelerators today need a high degree of multicore parallelism as well as vector or SIMD parallelism. They don’t work very efficiently on scalar code, or on scalar tasks; scalar code will always be more efficient on a high-speed host CPU. This is the primary motivation for a programming model that supports closely coupled CPU+Accelerator systems, which will likely be most efficient on maximizing performance across complete applications.

Advantages: OpenACC has demonstrated support for multiple devices, and there is some initial evidence for performance portability across device types.

Disadvantages: OpenACC is still relatively young, and there are as yet no available open-source implementations of the full language. As yet, there are no implementations that target the host multicore as a device, so it currently must be combined with OpenMP. Some critical features are still under development, and there are some differences in how features are implemented by different compilers, making portability across vendors an issue.

Portability: OpenACC can virtualize many aspects of the architecture, so programs should work well across divergent targets. There is some initial evidence of performance portability across devices, but no convincing proof as yet.

Future: OpenACC version 2.0 was just released, and work proceeds on additional critical and important features. There are quite a few large applications being ported to GPUs with OpenACC, and this experience feeds back to the language design and implementations. Expect a great deal of exciting work on OpenACC in the next few years, including commercial and open-source implementations across all HPC target accelerators, as well as host CPUs.

MPI + OpenMP 4.0 Target Directives

One may ask why OpenACC looks so different from OpenMP? Why not just implement OpenMP on GPUs? This is a good question and an important one. I tried to answer this with an article several years ago. GPUs, in particular, have some limitations that we should avoid and some capabilities that we should exploit when programming for the best parallel performance.

The OpenMP ARB recently released version 4.0 of the specification this past summer, a major revision just two years after 3.1. Among many new features are the device constructs. Some of these constructs mirror those in OpenACC, such as data management with the OpenMP target data and target update constructs. The OpenMP committee chose to add a new level of parallelism to the classical OpenMP threads: teams. In OpenMP 3.1, a parallel region was executed by a team of threads, and parallel loop iterations or tasks were work-shared across the threads in a team. With the OpenMP 4.0 device constructs, the user can now create a league of teams of threads. The addition of teams was to accommodate devices like GPUs which do not support an efficient global barrier synchronization across all the threads in the system. This addition has two unfortunate side effects. First, porting an existing OpenMP program to a device like a GPU is not as easy as simply adding a target directive around the parallel loops. As with OpenACC, the programmer will have to manage the data traffic to the device, but now the programmer has to add new types of parallelism as well. Second, some devices, such as the manycore Intel Xeon Phi Coprocessor, don’t need and probably don’t want this extra level of parallelism; OpenMP can be implemented natively on that target. This means that tuning for one target may differ significantly from tuning for another target, making performance portability a real challenge. I fear that the OpenMP committee has made some unfortunate decisions that will be hard to fix.

Intel has been a major supporter of the OpenMP 4.0 device constructs, and presumably will have a conformant implementation for the Xeon Phi Coprocessor. Texas Instruments has also announced plans to support OpenMP 4.0 on some ARM+DSP heterogeneous SoCs. The new Convey MX computer has support for hybrid OpenMP that matches OpenMP 4.0 features. Cray has announced plans to deliver OpenMP 4.0 for their systems as well, including Xeon Phi and Kepler accelerators, though they don’t promise performance portability. The other OpenMP vendor members have yet to announce support for the device constructs.

Advantages: For those targets that can implement full OpenMP, the OpenMP 4.0 device constructs look like a simpler method to port an existing OpenMP program. There is still the requirement to manage data transfers, but that’s the case on any of these programming models.

Disadvantages: For targets that aren’t designed to implement OpenMP, it’s not clear whether good performance can be delivered. No prototype implementations of the teams construct were available. Performance portability is going to be a challenge, though, as with OpenCL, language portability is still a big step in the right direction. Parts of the current 4.0 device constructs are not well specified and need more careful definition.

Portability: It’s too early to say for certain, but I fear the OpenMP device constructs are too prescriptive to support performance portability across the wide variety of accelerator architectures being used now and designed for the future. If a programmer has to write programs differently for each target, then the language loses many of its advantages.

Future: OpenMP 4.0 is still new and has many new features beyond the device constructs. It will be some time before we have several mature complete implementations. The OpenMP group seems committed to more rapid exploration and adoption of new features across the spectrum of parallel programming. The device constructs will be expanded or contracted as necessary to address the targets of interest. Only if the OpenMP device constructs result in good performance on the host as well as DSPs, manycores, and GPUs, will it become the language of choice.

Any Others?

John Barr left off two other heterogeneous programming languages that have been bandied about, important if only because of the large company promoting each. Google Renderscript uses a kernels computation model and managed memory allocations (like OpenCL kernels and buffers). Unlike OpenCL, Renderscript is optimized for kernels that access graphics-like data structures: rectangular arrays of structs. It’s pretty hard to get the kind of random indexed addressing that is available in OpenCL or CUDA. The runtime can launch kernels on any appropriate available device, such as a CPU or GPU, moving data as required to make it accessible on that device. Microsoft C++AMP (Accelerated Massive Parallelism) is a pretty sophisticated C++-specific solution, aiming at the sophisticated C++ programmer (perhaps all C++ programmers are sophisticated), using C++ templates and lambda functions for data and compute. It introduces a very rich parallel_for_each function that takes a computation domain and kernel function as arguments, and runs that function across the parallel domain. Like Renderscript and OpenCL, the runtime manages the memory movement; it can also decide whether to run a kernel on the host or the GPU. Currently it is available only from Microsoft, though there is at least one experimental implementation from outside Redmond.

Conclusions

As John Barr argues, the tools and programming methodologies for upcoming HPC systems will change as the architectures evolve. Resolving to a high level portable heterogeneous programming strategy will attract more users early in this evolution. Before MPI, there were several message passing libraries, some of which were vendor-specific. Settling on MPI allowed applications to port across cluster architectures without rewrites, and allowed vendors to focus on innovation at the architecture and implementation level. It also limited innovation on the message passing libraries. Before OpenMP, there were several sets of high level directives and low-level threading libraries for programming multiprocessor workstations and cluster nodes. Settling on OpenMP similarly allowed applications to port across multiprocessor and eventually multicore architectures without rewrites, allowing vendors to focus on innovation at the architecture and implementation level. It also limited innovation at the programming language level.

For heterogeneous systems, we want to settle on a high level programming strategy. It should be as target agnostic as possible. In particular, it must not require a user to write a different program for different targets. We’re going to always require low-level programming as well, and CUDA and OpenCL will fit this bill nicely for appropriate devices. At the high level, the obvious possibilities are OpenACC or OpenMP; the eventual choice will be made by the users. If the OpenMP device constructs mature to better support performance portable programming across the wide variety of devices, it may become a viable option. If OpenACC demonstrates portability and more generality, it should be the language of choice. There’s still hope that the two specifications will converge. Users will eventually decide based on how well each language supports the variety of targets in terms of ease of use and features, but most importantly, performance and performance portability.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Researchers Scale COSMO Climate Code to 4888 GPUs on Piz Daint

October 17, 2017

Effective global climate simulation, sorely needed to anticipate and cope with global warming, has long been computationally challenging. Two of the major obstacles are the needed resolution and prolonged time to compute Read more…

By John Russell

Student Cluster Competition Coverage New Home

October 16, 2017

Hello computer sports fans! This is the first of many (many!) articles covering the world-wide phenomenon of Student Cluster Competitions. Finally, the Student Cluster Competition coverage has come to its natural home: H Read more…

By Dan Olds

UCSD Web-based Tool Tracking CA Wildfires Generates 1.5M Views

October 16, 2017

Tracking the wildfires raging in northern CA is an unpleasant but necessary part of guiding efforts to fight the fires and safely evacuate affected residents. One such tool – Firemap – is a web-based tool developed b Read more…

By John Russell

HPE Extreme Performance Solutions

Transforming Genomic Analytics with HPC-Accelerated Insights

Advancements in the field of genomics are revolutionizing our understanding of human biology, rapidly accelerating the discovery and treatment of genetic diseases, and dramatically improving human health. Read more…

Exascale Imperative: New Movie from HPE Makes a Compelling Case

October 13, 2017

Why is pursuing exascale computing so important? In a new video – Hewlett Packard Enterprise: Eighteen Zeros – four HPE executives, a prominent national lab HPC researcher, and HPCwire managing editor Tiffany Trader Read more…

By John Russell

Student Cluster Competition Coverage New Home

October 16, 2017

Hello computer sports fans! This is the first of many (many!) articles covering the world-wide phenomenon of Student Cluster Competitions. Finally, the Student Read more…

By Dan Olds

Intel Delivers 17-Qubit Quantum Chip to European Research Partner

October 10, 2017

On Tuesday, Intel delivered a 17-qubit superconducting test chip to research partner QuTech, the quantum research institute of Delft University of Technology (TU Delft) in the Netherlands. The announcement marks a major milestone in the 10-year, $50-million collaborative relationship with TU Delft and TNO, the Dutch Organization for Applied Research, to accelerate advancements in quantum computing. Read more…

By Tiffany Trader

Fujitsu Tapped to Build 37-Petaflops ABCI System for AIST

October 10, 2017

Fujitsu announced today it will build the long-planned AI Bridging Cloud Infrastructure (ABCI) which is set to become the fastest supercomputer system in Japan Read more…

By John Russell

HPC Chips – A Veritable Smorgasbord?

October 10, 2017

For the first time since AMD's ill-fated launch of Bulldozer the answer to the question, 'Which CPU will be in my next HPC system?' doesn't have to be 'Whichever variety of Intel Xeon E5 they are selling when we procure'. Read more…

By Dairsie Latimer

Delays, Smoke, Records & Markets – A Candid Conversation with Cray CEO Peter Ungaro

October 5, 2017

Earlier this month, Tom Tabor, publisher of HPCwire and I had a very personal conversation with Cray CEO Peter Ungaro. Cray has been on something of a Cinderell Read more…

By Tiffany Trader & Tom Tabor

Intel Debuts Programmable Acceleration Card

October 5, 2017

With a view toward supporting complex, data-intensive applications, such as AI inference, video streaming analytics, database acceleration and genomics, Intel i Read more…

By Doug Black

OLCF’s 200 Petaflops Summit Machine Still Slated for 2018 Start-up

October 3, 2017

The Department of Energy’s planned 200 petaflops Summit computer, which is currently being installed at Oak Ridge Leadership Computing Facility, is on track t Read more…

By John Russell

US Exascale Program – Some Additional Clarity

September 28, 2017

The last time we left the Department of Energy’s exascale computing program in July, things were looking very positive. Both the U.S. House and Senate had pas Read more…

By Alex R. Larzelere

How ‘Knights Mill’ Gets Its Deep Learning Flops

June 22, 2017

Intel, the subject of much speculation regarding the delayed, rewritten or potentially canceled “Aurora” contract (the Argonne Lab part of the CORAL “ Read more…

By Tiffany Trader

Reinders: “AVX-512 May Be a Hidden Gem” in Intel Xeon Scalable Processors

June 29, 2017

Imagine if we could use vector processing on something other than just floating point problems.  Today, GPUs and CPUs work tirelessly to accelerate algorithms Read more…

By James Reinders

NERSC Scales Scientific Deep Learning to 15 Petaflops

August 28, 2017

A collaborative effort between Intel, NERSC and Stanford has delivered the first 15-petaflops deep learning software running on HPC platforms and is, according Read more…

By Rob Farber

Oracle Layoffs Reportedly Hit SPARC and Solaris Hard

September 7, 2017

Oracle’s latest layoffs have many wondering if this is the end of the line for the SPARC processor and Solaris OS development. As reported by multiple sources Read more…

By John Russell

US Coalesces Plans for First Exascale Supercomputer: Aurora in 2021

September 27, 2017

At the Advanced Scientific Computing Advisory Committee (ASCAC) meeting, in Arlington, Va., yesterday (Sept. 26), it was revealed that the "Aurora" supercompute Read more…

By Tiffany Trader

Google Releases Deeplearn.js to Further Democratize Machine Learning

August 17, 2017

Spreading the use of machine learning tools is one of the goals of Google’s PAIR (People + AI Research) initiative, which was introduced in early July. Last w Read more…

By John Russell

GlobalFoundries Puts Wind in AMD’s Sails with 12nm FinFET

September 24, 2017

From its annual tech conference last week (Sept. 20), where GlobalFoundries welcomed more than 600 semiconductor professionals (reaching the Santa Clara venue Read more…

By Tiffany Trader

Graphcore Readies Launch of 16nm Colossus-IPU Chip

July 20, 2017

A second $30 million funding round for U.K. AI chip developer Graphcore sets up the company to go to market with its “intelligent processing unit” (IPU) in Read more…

By Tiffany Trader

Leading Solution Providers

Amazon Debuts New AMD-based GPU Instances for Graphics Acceleration

September 12, 2017

Last week Amazon Web Services (AWS) streaming service, AppStream 2.0, introduced a new GPU instance called Graphics Design intended to accelerate graphics. The Read more…

By John Russell

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

EU Funds 20 Million Euro ARM+FPGA Exascale Project

September 7, 2017

At the Barcelona Supercomputer Centre on Wednesday (Sept. 6), 16 partners gathered to launch the EuroEXA project, which invests €20 million over three-and-a-half years into exascale-focused research and development. Led by the Horizon 2020 program, EuroEXA picks up the banner of a triad of partner projects — ExaNeSt, EcoScale and ExaNoDe — building on their work... Read more…

By Tiffany Trader

Delays, Smoke, Records & Markets – A Candid Conversation with Cray CEO Peter Ungaro

October 5, 2017

Earlier this month, Tom Tabor, publisher of HPCwire and I had a very personal conversation with Cray CEO Peter Ungaro. Cray has been on something of a Cinderell Read more…

By Tiffany Trader & Tom Tabor

Cray Moves to Acquire the Seagate ClusterStor Line

July 28, 2017

This week Cray announced that it is picking up Seagate's ClusterStor HPC storage array business for an undisclosed sum. "In short we're effectively transitioning the bulk of the ClusterStor product line to Cray," said CEO Peter Ungaro. Read more…

By Tiffany Trader

Intel Launches Software Tools to Ease FPGA Programming

September 5, 2017

Field Programmable Gate Arrays (FPGAs) have a reputation for being difficult to program, requiring expertise in specialty languages, like Verilog or VHDL. Easin Read more…

By Tiffany Trader

IBM Advances Web-based Quantum Programming

September 5, 2017

IBM Research is pairing its Jupyter-based Data Science Experience notebook environment with its cloud-based quantum computer, IBM Q, in hopes of encouraging a new class of entrepreneurial user to solve intractable problems that even exceed the capabilities of the best AI systems. Read more…

By Alex Woodie

Intel, NERSC and University Partners Launch New Big Data Center

August 17, 2017

A collaboration between the Department of Energy’s National Energy Research Scientific Computing Center (NERSC), Intel and five Intel Parallel Computing Cente Read more…

By Linda Barney

  • arrow
  • Click Here for More Headlines
  • arrow
Share This