Compilers and More: Accelerating High Performance

By Michael Wolfe

May 7, 2008

My prediction: High performance computing will soon be dominated by accelerator-based systems.

You may ask: Why will accelerators be better than multicore processors? Why now and not ten years ago? Why accelerators and not exciting new processors? Who will produce them? And how will we migrate to accelerators and how will we program them? I’ll answer these questions in order.

Why accelerators? The market for accelerators has always been, and likely always will be, much smaller than the market for commodity processors. This means that they don’t have a cost model that will support legions of designers and billion-dollar fab plants to use the very latest technology; accelerators will always be, technologically, one or two generations behind the big chip vendors. The big vendors must go after the high volume market to generate the revenue to pay for the most aggressive chip technology. Face it, HPC is not a high volume business.

Given that the clock race has ended, we move to multicore. Multicore designs aimed at the general-purpose market look a lot like shared memory multiprocessors, except with less total memory bandwidth. I’ve heard vendors and researchers point out that we could increase the core count dramatically if we’re willing to simplify the cores by eliminating superscalar instruction issue, speculative and out-of-order instruction execution, register renaming, and so on. This is true, but look at what we’re accomplishing. Today’s aggressive processors manage many levels of low-level parallelism: multiple instructions issued simultaneously, dozens (perhaps a hundred) instructions in flight, speculative memory loads and branch prediction, all managed and synchronized by hardware at clock granularity. We can remove all that and move to software-managed parallelism. Software thread creation, software speculation (and mis-speculation, including squashing), software synchronization, all because the multiple cores have no hardware support for parallelism.

But accelerators can use proven chip technology with lower cost. This allows them to attack smaller markets, even niche markets. Where a major chip vendor makes its bread and butter with binary compatibility, an accelerator can (indeed, must) make up what it lacks in technology using architecture. Today’s Clearspeed and programmable GPUs use multiple SIMD cores with high bandwidth memory. Any number of possible architectures may be replayed in the accelerator arena. These designs embrace parallelism in a way that multicore designs don’t — or won’t. Hardware support for thread creation, synchronization, and so on make small-grain parallelism feasible.

Why now? Previously, the general-purpose chip vendors could always stay a step ahead, or only slightly behind, the accelerators just using clock rates. Who would make the considerable investment in an accelerator when the next generation processor would be just as fast?

Today’s equations lean toward accelerators. Chips aren’t getting faster, just fatter. We’re going to have to invest in parallelism to get any performance increase, with multicore or with accelerators. We should invest in a strategy with the best support for parallelism and with the biggest upside. Accelerators depend on parallelism and have integral support for it; multicore processors are aimed at a much broader market, and only incidentally address HPC issues. Moore’s law still works, for now, and on-chip density will increase predictably. Since accelerators are farther behind on that curve, they can enjoy more of that benefit.

Why not just new processors? We’re back to economics on this question. Trying to develop and market a new processor means migrating a whole software ecosystem. This was done successfully in the RISC revolution of the 1980s, producing today’s SPARC and Power processors, among others. More recently, Intel and HP developed Itanium, which has achieved more limited success. Only a few vendors have the resources to develop a new processor with the full support necessary to make it viable, and those vendors have vested interests in their current processor strategy.

However, a processor with an accelerator can still run standard system software and tools. The migration to such a system can be limited to the HPC applications. Most of the cost of the whole system will be in components that would be necessary anyway; the additional cost of the accelerator is relatively low, but the performance boost is compelling.

Whose accelerators? Accelerators come and go. Some focus on particular applications.The CNAPS chip developed by Adaptive Solutions (founded by a former colleague at the Oregon Graduate Institute) was intended for neural network simulations, and was quite successful at accelerating Photoshop functions. GPUs are, and have always been, accelerators for pixel processing. In HPC today, we have Clearspeed and NVIDIA GPUs. I’m going to declare myself neutral on this question, though I envision a growing industry here, as the cost of entry is relatively low. It will be interesting to see what develops with the open HyperTransport and QuickPath interfaces, and what the chip vendors may put on their own silicon.

How to program accelerators? This is where it gets interesting. The current programming models — NVIDIA’s CUDA, AMD’s Brook+, RapidMind, and other research into stream programming — require a complete rewrite of the computationally intensive portion of the code. Each model constrains the programmer to the types of solutions that will run well in that model. For example, a stream model requires the programmer to define data streams and the operation that takes place, element-wise, on the stream. If the computation can be cast into that model, good performance is assured.

Coming from a compiler background, we at PGI wanted to know whether it is feasible to present as classical a programming model as possible, adding no more complexity than, say, OpenMP, or vectorization pragmas, or directives. Such approaches have many advantages: they are incremental, compatible, and have a long successful history. They are particularly successful if the programming model works efficiently across a range of targets. Codes for vectorizing compilers were largely portable across all vector machines, for instance. It doesn’t relieve the programmer from all rewriting, but it does use a familiar environment for testing and experimentation.

Borrowing from Bob Morgan’s book, “Building an Optimizing Compiler,” we used a thin spike approach to produce an entire toolchain to generate host+accelerator code. We modified our Fortran compiler to accept an “ACCEL/ENDACCEL” directive pair, where a loop between the directives is compiled for the accelerator. Our target was the NVIDIA GPU, and we used parts of their NVCC toolchain. The compiler identified the data that needed to be sent over and back, replaced the loop by runtime calls to initiate the work, brought the results back, and cleaned up afterwards. The result was working object code that we could link and execute. Apart from the compiler, the other parts of the toolchain — linker, make and makefiles, etc. — were unchanged.

And we made it work. Now, this is not a product announcement, or even a preannouncement. We developed this as an internal feasibility demonstration — more like a science fair compiler project. The decision to complete it as a product will depend largely on market forces. Other issues include choosing an accelerator target (or targets) and the commercial longevity of the target. GPUs, for instance, are much less restricted by silly things like binary compatibility, and the vendors come up with new versions every 6-12 months. We probably can’t develop and tune new compilers nearly that quickly.

But the basis is sound. We believe we can produce compilers that allow evolutionary migration from today’s processors to accelerators, and that accelerators provide the most promising path to high performance in the future.


Michael Wolfe has developed compilers for over 30 years in both academia and industry, and is now a senior compiler engineer at The Portland Group, Inc. (, a wholly-owned subsidiary of STMicroelectronics, Inc. The opinions stated here are those of the author, and do not represent opinions of The Portland Group, Inc. or STMicroelectronics, Inc.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

NSF Project Sets Up First Machine Learning Cyberinfrastructure – CHASE-CI

July 25, 2017

Earlier this month, the National Science Foundation issued a $1 million grant to Larry Smarr, director of Calit2, and a group of his colleagues to create a community infrastructure in support of machine learning research Read more…

By John Russell

DARPA Continues Investment in Post-Moore’s Technologies

July 24, 2017

The U.S. military long ago ceded dominance in electronics innovation to Silicon Valley, the DoD-backed powerhouse that has driven microelectronic generation for decades. With Moore's Law clearly running out of steam, the Read more…

By George Leopold

Graphcore Readies Launch of 16nm Colossus-IPU Chip

July 20, 2017

A second $30 million funding round for U.K. AI chip developer Graphcore sets up the company to go to market with its “intelligent processing unit” (IPU) in 2017 with scale-up production for enterprise datacenters and Read more…

By Tiffany Trader

HPE Extreme Performance Solutions

HPE Servers Deliver High Performance Remote Visualization

Whether generating seismic simulations, locating new productive oil reservoirs, or constructing complex models of the earth’s subsurface, energy, oil, and gas (EO&G) is a highly data-driven industry. Read more…

Trinity Supercomputer’s Haswell and KNL Partitions Are Merged

July 19, 2017

Trinity supercomputer’s two partitions – one based on Intel Xeon Haswell processors and the other on Xeon Phi Knights Landing – have been fully integrated are now available for use on classified work in the Nationa Read more…

By HPCwire Staff

NSF Project Sets Up First Machine Learning Cyberinfrastructure – CHASE-CI

July 25, 2017

Earlier this month, the National Science Foundation issued a $1 million grant to Larry Smarr, director of Calit2, and a group of his colleagues to create a comm Read more…

By John Russell

Graphcore Readies Launch of 16nm Colossus-IPU Chip

July 20, 2017

A second $30 million funding round for U.K. AI chip developer Graphcore sets up the company to go to market with its “intelligent processing unit” (IPU) in Read more…

By Tiffany Trader

Fujitsu Continues HPC, AI Push

July 19, 2017

Summer is well under way, but the so-called summertime slowdown, linked with hot temperatures and longer vacations, does not seem to have impacted Fujitsu's out Read more…

By Tiffany Trader

Researchers Use DNA to Store and Retrieve Digital Movie

July 18, 2017

From abacus to pencil and paper to semiconductor chips, the technology of computing has always been an ever-changing target. The human brain is probably the com Read more…

By John Russell

The Exascale FY18 Budget – The Next Step

July 17, 2017

On July 12, 2017, the U.S. federal budget for its Exascale Computing Initiative (ECI) took its next step forward. On that day, the full Appropriations Committee Read more…

By Alex R. Larzelere

Women in HPC Luncheon Shines Light on Female-Friendly Hiring Practices

July 13, 2017

The second annual Women in HPC luncheon was held on June 20, 2017, during the International Supercomputing Conference in Frankfurt, Germany. The luncheon provid Read more…

By Tiffany Trader

Satellite Advances, NSF Computation Power Rapid Mapping of Earth’s Surface

July 13, 2017

New satellite technologies have completely changed the game in mapping and geographical data gathering, reducing costs and placing a new emphasis on time series Read more…

By Ken Chiacchia and Tiffany Jolley

Intel Skylake: Xeon Goes from Chip to Platform

July 13, 2017

With yesterday’s New York unveiling of the new “Skylake” Xeon Scalable processors, Intel made multiple runs at multiple competitive threats and strategic Read more…

By Doug Black

Google Pulls Back the Covers on Its First Machine Learning Chip

April 6, 2017

This week Google released a report detailing the design and performance characteristics of the Tensor Processing Unit (TPU), its custom ASIC for the inference Read more…

By Tiffany Trader

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

Quantum Bits: D-Wave and VW; Google Quantum Lab; IBM Expands Access

March 21, 2017

For a technology that’s usually characterized as far off and in a distant galaxy, quantum computing has been steadily picking up steam. Just how close real-wo Read more…

By John Russell

HPC Compiler Company PathScale Seeks Life Raft

March 23, 2017

HPCwire has learned that HPC compiler company PathScale has fallen on difficult times and is asking the community for help or actively seeking a buyer for its a Read more…

By Tiffany Trader

Trump Budget Targets NIH, DOE, and EPA; No Mention of NSF

March 16, 2017

President Trump’s proposed U.S. fiscal 2018 budget issued today sharply cuts science spending while bolstering military spending as he promised during the cam Read more…

By John Russell

CPU-based Visualization Positions for Exascale Supercomputing

March 16, 2017

In this contributed perspective piece, Intel’s Jim Jeffers makes the case that CPU-based visualization is now widely adopted and as such is no longer a contrarian view, but is rather an exascale requirement. Read more…

By Jim Jeffers, Principal Engineer and Engineering Leader, Intel

Nvidia’s Mammoth Volta GPU Aims High for AI, HPC

May 10, 2017

At Nvidia's GPU Technology Conference (GTC17) in San Jose, Calif., this morning, CEO Jensen Huang announced the company's much-anticipated Volta architecture a Read more…

By Tiffany Trader

Facebook Open Sources Caffe2; Nvidia, Intel Rush to Optimize

April 18, 2017

From its F8 developer conference in San Jose, Calif., today, Facebook announced Caffe2, a new open-source, cross-platform framework for deep learning. Caffe2 is the successor to Caffe, the deep learning framework developed by Berkeley AI Research and community contributors. Read more…

By Tiffany Trader

Leading Solution Providers

How ‘Knights Mill’ Gets Its Deep Learning Flops

June 22, 2017

Intel, the subject of much speculation regarding the delayed, rewritten or potentially canceled “Aurora” contract (the Argonne Lab part of the CORAL “ Read more…

By Tiffany Trader

Reinders: “AVX-512 May Be a Hidden Gem” in Intel Xeon Scalable Processors

June 29, 2017

Imagine if we could use vector processing on something other than just floating point problems.  Today, GPUs and CPUs work tirelessly to accelerate algorithms Read more…

By James Reinders

Russian Researchers Claim First Quantum-Safe Blockchain

May 25, 2017

The Russian Quantum Center today announced it has overcome the threat of quantum cryptography by creating the first quantum-safe blockchain, securing cryptocurrencies like Bitcoin, along with classified government communications and other sensitive digital transfers. Read more…

By Doug Black

MIT Mathematician Spins Up 220,000-Core Google Compute Cluster

April 21, 2017

On Thursday, Google announced that MIT math professor and computational number theorist Andrew V. Sutherland had set a record for the largest Google Compute Engine (GCE) job. Sutherland ran the massive mathematics workload on 220,000 GCE cores using preemptible virtual machine instances. Read more…

By Tiffany Trader

Google Debuts TPU v2 and will Add to Google Cloud

May 25, 2017

Not long after stirring attention in the deep learning/AI community by revealing the details of its Tensor Processing Unit (TPU), Google last week announced the Read more…

By John Russell

Groq This: New AI Chips to Give GPUs a Run for Deep Learning Money

April 24, 2017

CPUs and GPUs, move over. Thanks to recent revelations surrounding Google’s new Tensor Processing Unit (TPU), the computing world appears to be on the cusp of Read more…

By Alex Woodie

Six Exascale PathForward Vendors Selected; DoE Providing $258M

June 15, 2017

The much-anticipated PathForward awards for hardware R&D in support of the Exascale Computing Project were announced today with six vendors selected – AMD Read more…

By John Russell

Top500 Results: Latest List Trends and What’s in Store

June 19, 2017

Greetings from Frankfurt and the 2017 International Supercomputing Conference where the latest Top500 list has just been revealed. Although there were no major Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Share This