Compilers and More: Accelerating High Performance

By Michael Wolfe

May 7, 2008

My prediction: High performance computing will soon be dominated by accelerator-based systems.

You may ask: Why will accelerators be better than multicore processors? Why now and not ten years ago? Why accelerators and not exciting new processors? Who will produce them? And how will we migrate to accelerators and how will we program them? I’ll answer these questions in order.

Why accelerators? The market for accelerators has always been, and likely always will be, much smaller than the market for commodity processors. This means that they don’t have a cost model that will support legions of designers and billion-dollar fab plants to use the very latest technology; accelerators will always be, technologically, one or two generations behind the big chip vendors. The big vendors must go after the high volume market to generate the revenue to pay for the most aggressive chip technology. Face it, HPC is not a high volume business.

Given that the clock race has ended, we move to multicore. Multicore designs aimed at the general-purpose market look a lot like shared memory multiprocessors, except with less total memory bandwidth. I’ve heard vendors and researchers point out that we could increase the core count dramatically if we’re willing to simplify the cores by eliminating superscalar instruction issue, speculative and out-of-order instruction execution, register renaming, and so on. This is true, but look at what we’re accomplishing. Today’s aggressive processors manage many levels of low-level parallelism: multiple instructions issued simultaneously, dozens (perhaps a hundred) instructions in flight, speculative memory loads and branch prediction, all managed and synchronized by hardware at clock granularity. We can remove all that and move to software-managed parallelism. Software thread creation, software speculation (and mis-speculation, including squashing), software synchronization, all because the multiple cores have no hardware support for parallelism.

But accelerators can use proven chip technology with lower cost. This allows them to attack smaller markets, even niche markets. Where a major chip vendor makes its bread and butter with binary compatibility, an accelerator can (indeed, must) make up what it lacks in technology using architecture. Today’s Clearspeed and programmable GPUs use multiple SIMD cores with high bandwidth memory. Any number of possible architectures may be replayed in the accelerator arena. These designs embrace parallelism in a way that multicore designs don’t — or won’t. Hardware support for thread creation, synchronization, and so on make small-grain parallelism feasible.

Why now? Previously, the general-purpose chip vendors could always stay a step ahead, or only slightly behind, the accelerators just using clock rates. Who would make the considerable investment in an accelerator when the next generation processor would be just as fast?

Today’s equations lean toward accelerators. Chips aren’t getting faster, just fatter. We’re going to have to invest in parallelism to get any performance increase, with multicore or with accelerators. We should invest in a strategy with the best support for parallelism and with the biggest upside. Accelerators depend on parallelism and have integral support for it; multicore processors are aimed at a much broader market, and only incidentally address HPC issues. Moore’s law still works, for now, and on-chip density will increase predictably. Since accelerators are farther behind on that curve, they can enjoy more of that benefit.

Why not just new processors? We’re back to economics on this question. Trying to develop and market a new processor means migrating a whole software ecosystem. This was done successfully in the RISC revolution of the 1980s, producing today’s SPARC and Power processors, among others. More recently, Intel and HP developed Itanium, which has achieved more limited success. Only a few vendors have the resources to develop a new processor with the full support necessary to make it viable, and those vendors have vested interests in their current processor strategy.

However, a processor with an accelerator can still run standard system software and tools. The migration to such a system can be limited to the HPC applications. Most of the cost of the whole system will be in components that would be necessary anyway; the additional cost of the accelerator is relatively low, but the performance boost is compelling.

Whose accelerators? Accelerators come and go. Some focus on particular applications.The CNAPS chip developed by Adaptive Solutions (founded by a former colleague at the Oregon Graduate Institute) was intended for neural network simulations, and was quite successful at accelerating Photoshop functions. GPUs are, and have always been, accelerators for pixel processing. In HPC today, we have Clearspeed and NVIDIA GPUs. I’m going to declare myself neutral on this question, though I envision a growing industry here, as the cost of entry is relatively low. It will be interesting to see what develops with the open HyperTransport and QuickPath interfaces, and what the chip vendors may put on their own silicon.

How to program accelerators? This is where it gets interesting. The current programming models — NVIDIA’s CUDA, AMD’s Brook+, RapidMind, and other research into stream programming — require a complete rewrite of the computationally intensive portion of the code. Each model constrains the programmer to the types of solutions that will run well in that model. For example, a stream model requires the programmer to define data streams and the operation that takes place, element-wise, on the stream. If the computation can be cast into that model, good performance is assured.

Coming from a compiler background, we at PGI wanted to know whether it is feasible to present as classical a programming model as possible, adding no more complexity than, say, OpenMP, or vectorization pragmas, or directives. Such approaches have many advantages: they are incremental, compatible, and have a long successful history. They are particularly successful if the programming model works efficiently across a range of targets. Codes for vectorizing compilers were largely portable across all vector machines, for instance. It doesn’t relieve the programmer from all rewriting, but it does use a familiar environment for testing and experimentation.

Borrowing from Bob Morgan’s book, “Building an Optimizing Compiler,” we used a thin spike approach to produce an entire toolchain to generate host+accelerator code. We modified our Fortran compiler to accept an “ACCEL/ENDACCEL” directive pair, where a loop between the directives is compiled for the accelerator. Our target was the NVIDIA GPU, and we used parts of their NVCC toolchain. The compiler identified the data that needed to be sent over and back, replaced the loop by runtime calls to initiate the work, brought the results back, and cleaned up afterwards. The result was working object code that we could link and execute. Apart from the compiler, the other parts of the toolchain — linker, make and makefiles, etc. — were unchanged.

And we made it work. Now, this is not a product announcement, or even a preannouncement. We developed this as an internal feasibility demonstration — more like a science fair compiler project. The decision to complete it as a product will depend largely on market forces. Other issues include choosing an accelerator target (or targets) and the commercial longevity of the target. GPUs, for instance, are much less restricted by silly things like binary compatibility, and the vendors come up with new versions every 6-12 months. We probably can’t develop and tune new compilers nearly that quickly.

But the basis is sound. We believe we can produce compilers that allow evolutionary migration from today’s processors to accelerators, and that accelerators provide the most promising path to high performance in the future.

—–

Michael Wolfe has developed compilers for over 30 years in both academia and industry, and is now a senior compiler engineer at The Portland Group, Inc. (www.pgroup.com), a wholly-owned subsidiary of STMicroelectronics, Inc. The opinions stated here are those of the author, and do not represent opinions of The Portland Group, Inc. or STMicroelectronics, Inc.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Google Frames Quantum Race as Two-Dimensional

April 26, 2018

Quantum error correction, essential for achieving universal fault-tolerant quantum computation, is one of the main challenges of the quantum computing field and it’s top of mind for Google’s John Martinis. At a pres Read more…

By Tiffany Trader

Affordable Optical Technology Needed Says HPE’s Daley

April 26, 2018

While not new, the challenges presented by computer cabling/PCB circuit routing design – cost, performance, space requirements, and power management – have coalesced into a major headache in advanced HPC system desig Read more…

By John Russell

AI-Focused ‘Genius’ Supercomputer Installed at KU Leuven

April 24, 2018

Hewlett Packard Enterprise has deployed a new approximately half-petaflops supercomputer, named Genius, at Flemish research university KU Leuven. The system is built to run artificial intelligence (AI) workloads and, as Read more…

By Tiffany Trader

HPE Extreme Performance Solutions

Hybrid HPC is Speeding Time to Insight and Revolutionizing Medicine

High performance computing (HPC) is a key driver of success in many verticals today, and health and life science industries are extensively leveraging these capabilities. Read more…

New Exascale System for Earth Simulation Introduced

April 23, 2018

After four years of development, the Energy Exascale Earth System Model (E3SM) will be unveiled today and released to the broader scientific community this month. The E3SM project is supported by the Department of Energy Read more…

By Staff

Google Frames Quantum Race as Two-Dimensional

April 26, 2018

Quantum error correction, essential for achieving universal fault-tolerant quantum computation, is one of the main challenges of the quantum computing field an Read more…

By Tiffany Trader

Affordable Optical Technology Needed Says HPE’s Daley

April 26, 2018

While not new, the challenges presented by computer cabling/PCB circuit routing design – cost, performance, space requirements, and power management – have Read more…

By John Russell

AI-Focused ‘Genius’ Supercomputer Installed at KU Leuven

April 24, 2018

Hewlett Packard Enterprise has deployed a new approximately half-petaflops supercomputer, named Genius, at Flemish research university KU Leuven. The system is Read more…

By Tiffany Trader

Cray Rolls Out AMD-Based CS500; More to Follow?

April 18, 2018

Cray was the latest OEM to bring AMD back into the fold with introduction today of a CS500 option based on AMD’s Epyc processor line. The move follows Cray’ Read more…

By John Russell

IBM: Software Ecosystem for OpenPOWER is Ready for Prime Time

April 16, 2018

With key pieces of the IBM/OpenPOWER versus Intel/x86 gambit settling into place – e.g., the arrival of Power9 chips and Power9-based systems, hyperscaler sup Read more…

By John Russell

US Plans $1.8 Billion Spend on DOE Exascale Supercomputing

April 11, 2018

On Monday, the United States Department of Energy announced its intention to procure up to three exascale supercomputers at a cost of up to $1.8 billion with th Read more…

By Tiffany Trader

Cloud-Readiness and Looking Beyond Application Scaling

April 11, 2018

There are two aspects to consider when determining if an application is suitable for running in the cloud. The first, which we will discuss here under the title Read more…

By Chris Downing

Transitioning from Big Data to Discovery: Data Management as a Keystone Analytics Strategy

April 9, 2018

The past 10-15 years has seen a stark rise in the density, size, and diversity of scientific data being generated in every scientific discipline in the world. Key among the sciences has been the explosion of laboratory technologies that generate large amounts of data in life-sciences and healthcare research. Large amounts of data are now being stored in very large storage name spaces, with little to no organization and a general unease about how to approach analyzing it. Read more…

By Ari Berman, BioTeam, Inc.

Inventor Claims to Have Solved Floating Point Error Problem

January 17, 2018

"The decades-old floating point error problem has been solved," proclaims a press release from inventor Alan Jorgensen. The computer scientist has filed for and Read more…

By Tiffany Trader

Researchers Measure Impact of ‘Meltdown’ and ‘Spectre’ Patches on HPC Workloads

January 17, 2018

Computer scientists from the Center for Computational Research, State University of New York (SUNY), University at Buffalo have examined the effect of Meltdown Read more…

By Tiffany Trader

How the Cloud Is Falling Short for HPC

March 15, 2018

The last couple of years have seen cloud computing gradually build some legitimacy within the HPC world, but still the HPC industry lies far behind enterprise I Read more…

By Chris Downing

Russian Nuclear Engineers Caught Cryptomining on Lab Supercomputer

February 12, 2018

Nuclear scientists working at the All-Russian Research Institute of Experimental Physics (RFNC-VNIIEF) have been arrested for using lab supercomputing resources to mine crypto-currency, according to a report in Russia’s Interfax News Agency. Read more…

By Tiffany Trader

Chip Flaws ‘Meltdown’ and ‘Spectre’ Loom Large

January 4, 2018

The HPC and wider tech community have been abuzz this week over the discovery of critical design flaws that impact virtually all contemporary microprocessors. T Read more…

By Tiffany Trader

How Meltdown and Spectre Patches Will Affect HPC Workloads

January 10, 2018

There have been claims that the fixes for the Meltdown and Spectre security vulnerabilities, named the KPTI (aka KAISER) patches, are going to affect applicatio Read more…

By Rosemary Francis

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

Deep Learning at 15 PFlops Enables Training for Extreme Weather Identification at Scale

March 19, 2018

Petaflop per second deep learning training performance on the NERSC (National Energy Research Scientific Computing Center) Cori supercomputer has given climate Read more…

By Rob Farber

Leading Solution Providers

Lenovo Unveils Warm Water Cooled ThinkSystem SD650 in Rampup to LRZ Install

February 22, 2018

This week Lenovo took the wraps off the ThinkSystem SD650 high-density server with third-generation direct water cooling technology developed in tandem with par Read more…

By Tiffany Trader

Fast Forward: Five HPC Predictions for 2018

December 21, 2017

What’s on your list of high (and low) lights for 2017? Volta 100’s arrival on the heels of the P100? Appearance, albeit late in the year, of IBM’s Power9? Read more…

By John Russell

AI Cloud Competition Heats Up: Google’s TPUs, Amazon Building AI Chip

February 12, 2018

Competition in the white hot AI (and public cloud) market pits Google against Amazon this week, with Google offering AI hardware on its cloud platform intended Read more…

By Doug Black

HPC and AI – Two Communities Same Future

January 25, 2018

According to Al Gara (Intel Fellow, Data Center Group), high performance computing and artificial intelligence will increasingly intertwine as we transition to Read more…

By Rob Farber

US Plans $1.8 Billion Spend on DOE Exascale Supercomputing

April 11, 2018

On Monday, the United States Department of Energy announced its intention to procure up to three exascale supercomputers at a cost of up to $1.8 billion with th Read more…

By Tiffany Trader

New Blueprint for Converging HPC, Big Data

January 18, 2018

After five annual workshops on Big Data and Extreme-Scale Computing (BDEC), a group of international HPC heavyweights including Jack Dongarra (University of Te Read more…

By John Russell

Momentum Builds for US Exascale

January 9, 2018

2018 looks to be a great year for the U.S. exascale program. The last several months of 2017 revealed a number of important developments that help put the U.S. Read more…

By Alex R. Larzelere

Google Chases Quantum Supremacy with 72-Qubit Processor

March 7, 2018

Google pulled ahead of the pack this week in the race toward "quantum supremacy," with the introduction of a new 72-qubit quantum processor called Bristlecone. Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Share This