Compilers and More: Accelerating High Performance

By Michael Wolfe

May 7, 2008

My prediction: High performance computing will soon be dominated by accelerator-based systems.

You may ask: Why will accelerators be better than multicore processors? Why now and not ten years ago? Why accelerators and not exciting new processors? Who will produce them? And how will we migrate to accelerators and how will we program them? I’ll answer these questions in order.

Why accelerators? The market for accelerators has always been, and likely always will be, much smaller than the market for commodity processors. This means that they don’t have a cost model that will support legions of designers and billion-dollar fab plants to use the very latest technology; accelerators will always be, technologically, one or two generations behind the big chip vendors. The big vendors must go after the high volume market to generate the revenue to pay for the most aggressive chip technology. Face it, HPC is not a high volume business.

Given that the clock race has ended, we move to multicore. Multicore designs aimed at the general-purpose market look a lot like shared memory multiprocessors, except with less total memory bandwidth. I’ve heard vendors and researchers point out that we could increase the core count dramatically if we’re willing to simplify the cores by eliminating superscalar instruction issue, speculative and out-of-order instruction execution, register renaming, and so on. This is true, but look at what we’re accomplishing. Today’s aggressive processors manage many levels of low-level parallelism: multiple instructions issued simultaneously, dozens (perhaps a hundred) instructions in flight, speculative memory loads and branch prediction, all managed and synchronized by hardware at clock granularity. We can remove all that and move to software-managed parallelism. Software thread creation, software speculation (and mis-speculation, including squashing), software synchronization, all because the multiple cores have no hardware support for parallelism.

But accelerators can use proven chip technology with lower cost. This allows them to attack smaller markets, even niche markets. Where a major chip vendor makes its bread and butter with binary compatibility, an accelerator can (indeed, must) make up what it lacks in technology using architecture. Today’s Clearspeed and programmable GPUs use multiple SIMD cores with high bandwidth memory. Any number of possible architectures may be replayed in the accelerator arena. These designs embrace parallelism in a way that multicore designs don’t — or won’t. Hardware support for thread creation, synchronization, and so on make small-grain parallelism feasible.

Why now? Previously, the general-purpose chip vendors could always stay a step ahead, or only slightly behind, the accelerators just using clock rates. Who would make the considerable investment in an accelerator when the next generation processor would be just as fast?

Today’s equations lean toward accelerators. Chips aren’t getting faster, just fatter. We’re going to have to invest in parallelism to get any performance increase, with multicore or with accelerators. We should invest in a strategy with the best support for parallelism and with the biggest upside. Accelerators depend on parallelism and have integral support for it; multicore processors are aimed at a much broader market, and only incidentally address HPC issues. Moore’s law still works, for now, and on-chip density will increase predictably. Since accelerators are farther behind on that curve, they can enjoy more of that benefit.

Why not just new processors? We’re back to economics on this question. Trying to develop and market a new processor means migrating a whole software ecosystem. This was done successfully in the RISC revolution of the 1980s, producing today’s SPARC and Power processors, among others. More recently, Intel and HP developed Itanium, which has achieved more limited success. Only a few vendors have the resources to develop a new processor with the full support necessary to make it viable, and those vendors have vested interests in their current processor strategy.

However, a processor with an accelerator can still run standard system software and tools. The migration to such a system can be limited to the HPC applications. Most of the cost of the whole system will be in components that would be necessary anyway; the additional cost of the accelerator is relatively low, but the performance boost is compelling.

Whose accelerators? Accelerators come and go. Some focus on particular applications.The CNAPS chip developed by Adaptive Solutions (founded by a former colleague at the Oregon Graduate Institute) was intended for neural network simulations, and was quite successful at accelerating Photoshop functions. GPUs are, and have always been, accelerators for pixel processing. In HPC today, we have Clearspeed and NVIDIA GPUs. I’m going to declare myself neutral on this question, though I envision a growing industry here, as the cost of entry is relatively low. It will be interesting to see what develops with the open HyperTransport and QuickPath interfaces, and what the chip vendors may put on their own silicon.

How to program accelerators? This is where it gets interesting. The current programming models — NVIDIA’s CUDA, AMD’s Brook+, RapidMind, and other research into stream programming — require a complete rewrite of the computationally intensive portion of the code. Each model constrains the programmer to the types of solutions that will run well in that model. For example, a stream model requires the programmer to define data streams and the operation that takes place, element-wise, on the stream. If the computation can be cast into that model, good performance is assured.

Coming from a compiler background, we at PGI wanted to know whether it is feasible to present as classical a programming model as possible, adding no more complexity than, say, OpenMP, or vectorization pragmas, or directives. Such approaches have many advantages: they are incremental, compatible, and have a long successful history. They are particularly successful if the programming model works efficiently across a range of targets. Codes for vectorizing compilers were largely portable across all vector machines, for instance. It doesn’t relieve the programmer from all rewriting, but it does use a familiar environment for testing and experimentation.

Borrowing from Bob Morgan’s book, “Building an Optimizing Compiler,” we used a thin spike approach to produce an entire toolchain to generate host+accelerator code. We modified our Fortran compiler to accept an “ACCEL/ENDACCEL” directive pair, where a loop between the directives is compiled for the accelerator. Our target was the NVIDIA GPU, and we used parts of their NVCC toolchain. The compiler identified the data that needed to be sent over and back, replaced the loop by runtime calls to initiate the work, brought the results back, and cleaned up afterwards. The result was working object code that we could link and execute. Apart from the compiler, the other parts of the toolchain — linker, make and makefiles, etc. — were unchanged.

And we made it work. Now, this is not a product announcement, or even a preannouncement. We developed this as an internal feasibility demonstration — more like a science fair compiler project. The decision to complete it as a product will depend largely on market forces. Other issues include choosing an accelerator target (or targets) and the commercial longevity of the target. GPUs, for instance, are much less restricted by silly things like binary compatibility, and the vendors come up with new versions every 6-12 months. We probably can’t develop and tune new compilers nearly that quickly.

But the basis is sound. We believe we can produce compilers that allow evolutionary migration from today’s processors to accelerators, and that accelerators provide the most promising path to high performance in the future.

—–

Michael Wolfe has developed compilers for over 30 years in both academia and industry, and is now a senior compiler engineer at The Portland Group, Inc. (www.pgroup.com), a wholly-owned subsidiary of STMicroelectronics, Inc. The opinions stated here are those of the author, and do not represent opinions of The Portland Group, Inc. or STMicroelectronics, Inc.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Cray Completes ClusterStor Deal, Sunsets Sonexion Brand

September 25, 2017

Having today completed the transaction and strategic partnership with Seagate announced back in July, Cray is now home to the ClusterStor line and will be sunsetting the Sonexion brand. This is not an acquisition; the ClusterStor assets are transferring from Seagate to Cray (minus the Seagate ClusterStor IBM Spectrum Scale product) and Cray is taking over support and maintenance for the entire ClusterStor base. Read more…

By Tiffany Trader

China’s TianHe-2A will Use Proprietary Accelerator and Boast 94 Petaflops Peak

September 25, 2017

The details of China’s upgrade to TianHe-2 (MilkyWay-2) – now TianHe-2A – were revealed last week at the Third International High Performance Computing Forum (IHPCF2017) in China. The TianHe-2A will use a proprieta Read more…

By John Russell

SC17 Preview: Invited Talk Lineup Includes Gordon Bell, Paul Messina and Many Others

September 25, 2017

With the addition of esteemed supercomputing pioneer Gordon Bell to its invited talk lineup, SC17 now boasts a total of 12 invited talks on its agenda. As SC explains, "Invited Talks are a premier component of the SC Read more…

By Tiffany Trader

HPE Extreme Performance Solutions

HPE Prepares Customers for Success with the HPC Software Portfolio

High performance computing (HPC) software is key to harnessing the full power of HPC environments. Development and management tools enable IT departments to streamline installation and maintenance of their systems as well as create, optimize, and run their HPC applications. Read more…

GlobalFoundries Puts Wind in AMD’s Sails with 12nm FinFET

September 24, 2017

From its annual tech conference last week (Sept. 20), where GlobalFoundries welcomed more than 600 semiconductor professionals (reaching the Santa Clara venue’s max capacity and doubling 2016 attendee numbers), the one Read more…

By Tiffany Trader

Cray Completes ClusterStor Deal, Sunsets Sonexion Brand

September 25, 2017

Having today completed the transaction and strategic partnership with Seagate announced back in July, Cray is now home to the ClusterStor line and will be sunsetting the Sonexion brand. This is not an acquisition; the ClusterStor assets are transferring from Seagate to Cray (minus the Seagate ClusterStor IBM Spectrum Scale product) and Cray is taking over support and maintenance for the entire ClusterStor base. Read more…

By Tiffany Trader

China’s TianHe-2A will Use Proprietary Accelerator and Boast 94 Petaflops Peak

September 25, 2017

The details of China’s upgrade to TianHe-2 (MilkyWay-2) – now TianHe-2A – were revealed last week at the Third International High Performance Computing Fo Read more…

By John Russell

GlobalFoundries Puts Wind in AMD’s Sails with 12nm FinFET

September 24, 2017

From its annual tech conference last week (Sept. 20), where GlobalFoundries welcomed more than 600 semiconductor professionals (reaching the Santa Clara venue Read more…

By Tiffany Trader

Machine Learning at HPC User Forum: Drilling into Specific Use Cases

September 22, 2017

The 66th HPC User Forum held September 5-7, in Milwaukee, Wisconsin, at the elegant and historic Pfister Hotel, highlighting the 1893 Victorian décor and art o Read more…

By Arno Kolster

Stanford University and UberCloud Achieve Breakthrough in Living Heart Simulations

September 21, 2017

Cardiac arrhythmia can be an undesirable and potentially lethal side effect of drugs. During this condition, the electrical activity of the heart turns chaotic, Read more…

By Wolfgang Gentzsch, UberCloud, and Francisco Sahli, Stanford University

PNNL’s Center for Advanced Tech Evaluation Seeks Wider HPC Community Ties

September 21, 2017

Two years ago the Department of Energy established the Center for Advanced Technology Evaluation (CENATE) at Pacific Northwest National Laboratory (PNNL). CENAT Read more…

By John Russell

Exascale Computing Project Names Doug Kothe as Director

September 20, 2017

The Department of Energy’s Exascale Computing Project (ECP) has named Doug Kothe as its new director effective October 1. He replaces Paul Messina, who is stepping down after two years to return to Argonne National Laboratory. Kothe is a 32-year veteran of DOE’s National Laboratory System. Read more…

Takeaways from the Milwaukee HPC User Forum

September 19, 2017

Milwaukee’s elegant Pfister Hotel hosted approximately 100 attendees for the 66th HPC User Forum (September 5-7, 2017). In the original home city of Pabst Blu Read more…

By Merle Giles

How ‘Knights Mill’ Gets Its Deep Learning Flops

June 22, 2017

Intel, the subject of much speculation regarding the delayed, rewritten or potentially canceled “Aurora” contract (the Argonne Lab part of the CORAL “ Read more…

By Tiffany Trader

Reinders: “AVX-512 May Be a Hidden Gem” in Intel Xeon Scalable Processors

June 29, 2017

Imagine if we could use vector processing on something other than just floating point problems.  Today, GPUs and CPUs work tirelessly to accelerate algorithms Read more…

By James Reinders

NERSC Scales Scientific Deep Learning to 15 Petaflops

August 28, 2017

A collaborative effort between Intel, NERSC and Stanford has delivered the first 15-petaflops deep learning software running on HPC platforms and is, according Read more…

By Rob Farber

Oracle Layoffs Reportedly Hit SPARC and Solaris Hard

September 7, 2017

Oracle’s latest layoffs have many wondering if this is the end of the line for the SPARC processor and Solaris OS development. As reported by multiple sources Read more…

By John Russell

Six Exascale PathForward Vendors Selected; DoE Providing $258M

June 15, 2017

The much-anticipated PathForward awards for hardware R&D in support of the Exascale Computing Project were announced today with six vendors selected – AMD Read more…

By John Russell

Top500 Results: Latest List Trends and What’s in Store

June 19, 2017

Greetings from Frankfurt and the 2017 International Supercomputing Conference where the latest Top500 list has just been revealed. Although there were no major Read more…

By Tiffany Trader

IBM Clears Path to 5nm with Silicon Nanosheets

June 5, 2017

Two years since announcing the industry’s first 7nm node test chip, IBM and its research alliance partners GlobalFoundries and Samsung have developed a proces Read more…

By Tiffany Trader

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

Leading Solution Providers

Graphcore Readies Launch of 16nm Colossus-IPU Chip

July 20, 2017

A second $30 million funding round for U.K. AI chip developer Graphcore sets up the company to go to market with its “intelligent processing unit” (IPU) in Read more…

By Tiffany Trader

Google Releases Deeplearn.js to Further Democratize Machine Learning

August 17, 2017

Spreading the use of machine learning tools is one of the goals of Google’s PAIR (People + AI Research) initiative, which was introduced in early July. Last w Read more…

By John Russell

EU Funds 20 Million Euro ARM+FPGA Exascale Project

September 7, 2017

At the Barcelona Supercomputer Centre on Wednesday (Sept. 6), 16 partners gathered to launch the EuroEXA project, which invests €20 million over three-and-a-half years into exascale-focused research and development. Led by the Horizon 2020 program, EuroEXA picks up the banner of a triad of partner projects — ExaNeSt, EcoScale and ExaNoDe — building on their work... Read more…

By Tiffany Trader

Amazon Debuts New AMD-based GPU Instances for Graphics Acceleration

September 12, 2017

Last week Amazon Web Services (AWS) streaming service, AppStream 2.0, introduced a new GPU instance called Graphics Design intended to accelerate graphics. The Read more…

By John Russell

Cray Moves to Acquire the Seagate ClusterStor Line

July 28, 2017

This week Cray announced that it is picking up Seagate's ClusterStor HPC storage array business for an undisclosed sum. "In short we're effectively transitioning the bulk of the ClusterStor product line to Cray," said CEO Peter Ungaro. Read more…

By Tiffany Trader

Russian Researchers Claim First Quantum-Safe Blockchain

May 25, 2017

The Russian Quantum Center today announced it has overcome the threat of quantum cryptography by creating the first quantum-safe blockchain, securing cryptocurrencies like Bitcoin, along with classified government communications and other sensitive digital transfers. Read more…

By Doug Black

IBM Advances Web-based Quantum Programming

September 5, 2017

IBM Research is pairing its Jupyter-based Data Science Experience notebook environment with its cloud-based quantum computer, IBM Q, in hopes of encouraging a new class of entrepreneurial user to solve intractable problems that even exceed the capabilities of the best AI systems. Read more…

By Alex Woodie

GlobalFoundries: 7nm Chips Coming in 2018, EUV in 2019

June 13, 2017

GlobalFoundries has formally announced that its 7nm technology is ready for customer engagement with product tape outs expected for the first half of 2018. The Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Share This