Compilers and More: Programming at Exascale

By Michael Wolfe

March 8, 2011

Part 1: All that Parallelism

There are at least two ways exascale computing can go, as exemplified by the top two systems on the latest (November 2010) TOP500 list (Tianhe-1A and Jaguar). The Chinese Tianhe-1A uses 14,000 Intel multicore processors with 7,000 NVIDIA Fermi GPUs as compute accelerators, whereas the American Jaguar Cray XT-5 uses 35,000 AMD 6-core processors. Four of the top ten supercomputers in the TOP500 list use accelerators (NVIDIA GPUs or IBM PowerXCell in Roadrunner). However, only one of the five announced 10-petaflop+ systems uses accelerators. The IBM Blue Waters at Illinois will have 40,000 8-core Power7 processors; the Fujitsu Kei will have 80,000 8-core Sparc64 processors; IBM will deliver two Blue Gene/Q systems, Mira to Argonne National Lab with 49,000 16-core Power A2 processors, and Sequoia to Lawrence Livermore National Lab with twice as many as Mira. Only Cray’s Titan system at Oak Ridge is using GPUs as accelerators (to be fair, Titan is not fully approved just yet). Many would say that we will need accelerators to reach exascale, but I think those same people would have said we need accelerators to reach or exceed the 10-petaflop scale as well. We should watch these announcements over the coming year. When I first started this article, there were only four such announced systems, and I’m predicting two or more 10-petaflop+ systems before the year is out. If we can jump from 2-petaflop (2009) to 20-petaflop (2012) in three years, perhaps exascale computing by 2018 is achievable.

So we might have to deal with accelerators in some form, or not, but that’s not the subject of this series of articles. All the systems in the TOP500 use thousands, tens of thousands, or at the high end, hundreds of thousands of cores. Moreover, the cores themselves exhibit a high degree of internal parallelism. At exascale, there will be a lot of parallelism, on many levels, and to reach really high performance, we are going to have to program and tune to take advantage of all those levels of parallelism. And that is the subject of these articles.

Levels of Parallelism

I break down the parallelism that we see in really high end computing into six levels. We tend to focus on the node-level or core-level parallelism, since that’s where the biggest numbers come from, but it’s the product of all the parallelism that gives us the total performance, so we need to understand all the levels. You may split the levels differently, and that would make for an interesting academic discussion in itself.

  • Node Level: How many nodes in the computer. I’ll mostly use the Jaguar numbers as an example; Jaguar has 18,000 nodes. We could debate what constitutes a node, but I’ll use the common definition that a node is a set of processors or cores that share physical memory and a network interface. All (or almost all) large systems use identical nodes, except for a small number dedicated to IO, storage, or interactive interfacing, so we haven’t had to deal with heterogeneity across nodes, and I expect this to continue. I’m going to go out on a limb by predicting that node count will top out at exascale closer to 100,000, rather than the more aggressive predictions of millions of nodes from other experts.
  • Socket Level: How many sockets at each node. Jaguar nodes have two sockets. Systems with accelerators exhibit heterogeneity at this level, with CPU and accelerator sockets. Current systems using GPUs as accelerators have the GPUs plugged into the IO bus (PCI-express) instead of a CPU socket, which seriously affects the bandwidth between the CPU and accelerator. An accelerator like the Convey coprocessor sits on the memory bus, just like another processor, and can be much more tightly integrated. I’m going to ignore other accelerators, such as network interfaces with MPI acceleration capabilities. They can be just as important, but are generally not programmed. We can expect socket parallelism to go up in each node, from 2 or 4 now to 8 or 16 sockets. As mentioned in the introduction, these may include compute accelerators.
  • Core Level: How many cores in each socket. Jaguar uses 6-core AMD processors. Core counting is open to debate. Does an Intel Core i7 hyperthreaded six-core processor count as twelve? The operating system certainly thinks there are twelve, and there can be twelve simultaneous active threads, but there are only six sets of functional units. Does each AMD Bulldozer “2-core module” count as two cores, or are they more like a single dual-issue core? Like the hyperthreaded core, the two “cores” on the 2-core module share many resources, such as the instruction fetch and decode logic, and the floating point functional units, yet there are two complete sets of integer functional units. Does each NVIDIA GPU thread processor (CUDA core) count as a core, or only the multiprocessors?) One could argue that if we count each CUDA core as one core, then each lane of an SSE instruction unit should count as one, making that six-core, hyperthreaded Intel Core i7 equivalent to 48 SSE-cores (single precision). With the new AMD Fusion and Intel Sandy Bridge processors, we see heterogeneity at the core level as well, with 64-bit x86 cores and integrated graphics cores on-chip. We should expect core parallelism to increase due to improved silicon densities, but we’re still waiting for through-silicon vias (TSVs) to give enough off-chip bandwidth to feed the beast.
  • Vector Level: How wide are the vector or SIMD instructions in each core. Focusing on 64-bit precision, Jaguar uses X86 SSE with 2-wide vectors. If we count an NVIDIA streaming multiprocessor as a single core, then each core has 8-wide (Tesla-10) or 16-wide (Fermi) vector width in hardware, with software vector lengths of 32. We can argue over whether to count hardware SIMD parallelism or software vector length, but recall that the Cray-1 used pipelining, not SIMD parallelism, to implement its 64-long vector instruction set. Intel has taken the next shot (or two) along the vector level on x86 with AVX (and the Larrabee vector instructions), increasing SIMD widths from 2 to 4 (and 8) in double precision. Lengthening the vectors is a tradeoff, whether to use those gates for longer vectors or for more cores. Increasing SIMD or vector parallelism is easy and low cost relative to adding cores, comparing performance per gate, and I expect this trend to continue. For accelerators, such as GPUs, we can expect very long vector operations. These devices are already optimized for regular parallelism, and there’s no reason to believe that won’t continue to be the case.
  • Pipeline Level: How many instructions are in a partial state of completion at once. This is hard to measure. If we treat multithreading as a type of pipeline parallelism implemented across threads, NVIDIA Fermi GPUs can support pipeline parallelism factors of close to 50. With multiscalar instruction issue, out-of-order instruction execution, and aggressive speculative execution, any modern high powered processor can have dozens of instructions in some partial state of completion. However, look at the CPU design that Intel used in the Knight’s Ferry vs. the current Intel Core processors. The Knight’s Ferry packs more cores on a single chip by simplifying the core to a dual-issue, in-order control unit. This is another interesting parallelism-level tradeoff, reducing the control unit (pipeline-level parallelism) to increase core count (core-level parallelism). I expect the pipeline-level parallelism for CPUs to decrease, in order to increase core count. GPUs, on the other hand, are already single-issue in-order cores but use a high degree of multithreading to tolerate memory latency, rather than depend on a cache and memory reference locality; expect this design point to continue. For these chips, I expect pipeline-level parallelism to increase, as memory latencies get relatively larger.
  • Instruction Level: How many instructions get dispatched or executed at one time. Typical x86 processors can issue up to three instructions per cycle. I expect this number may decrease to two, again to simplify the control unit and pack more cores on a chip. Some processors, such as Intel Itanium and AMD GPUs, use a VLIW design for static instruction-level parallelism, trading a simpler control unit for a more complex software stack (compiler). VLIW advocates believe that this truly is the best way to implement instruction-level parallelism, considering all complexity, power and performance tradeoffs. It will be hard for the HPC world to sustain the costs of processor design, so we’re likely to have to live with whatever the commodity world delivers.

So lots of levels of parallelism, and lots of complexity. You might count the levels differently, such as combining the socket and core levels, since cores in different sockets differ mostly only in relative latency. You might insert an explicit level for heterogeneity, or you might combine the pipeline and instruction levels as a single microarchitectural level. But I’m going to live with the levels as given, and the table below summarizes the numbers:Parallelism+Chart

Application Parallelism

Application programming is mostly focused on the higher levels of parallelism, the top four levels of my chart. Large scale parallel programming today is largely dominated by the Single Program-Multiple Data (SPMD) model across distributed memory, using MPI for communication. You can use SPMD (MPI-style) parallelism and map that across nodes, sockets and cores, which has the advantage of a single parallelism model across many levels of parallelism. You can then map your application parallelism seamlessly to a machine with many nodes and low core count, or with fewer nodes and higher core count, but you can’t take advantage of the locality inherent between cores on a single socket or node.

Some applications use OpenMP within a node or socket and MPI across nodes or sockets, to take advantage of lower data communication cost in shared memory. OpenMP doesn’t manage parallelism across a network, but has most of the features you need for shared-memory parallelism, and is even looking to add features to tune for non-uniform memory access costs. Unfortunately, these two programming models, each around 15 years old now, were and are designed by separate committees that seemingly have no mutual interest in interoperability. This is not too hard to understand, given that OpenMP is a language model, whereas MPI is an opaque library (opaque to the language and compiler). Having to deal with two different models is unfortunate, but each deals only with a subset of the total problem.

Finally, many applications are programmed to take advantage of vector instructions, in the form of vectorizable loops. The vectorization technology that served as my personal introduction to advanced compilers (about 35 years ago) is still used today for those SIMD instructions. Moreover, this will be more important in the future. As hardware vector lengths increase, a larger fraction of the total performance will come from vectors. The downside is that there’s no single way to program a parallel operation that can be mapped across vector parallelism, shared memory core or socket parallelism, or node-level parallelism, with the choice made at program compile or execution time.

Compilers, on the other hand, mostly focus on the lower levels of parallelism, particularly instruction-level and pipeline parallelism. One could say that the microarchitectural parallelism is implemented in hardware, not in the compiler, but the compiler has to express the parallelism with the right instruction stream for the hardware to exploit it. Compilers also generate vector instructions, as mentioned above. Automatic parallelization as well as efficient implementation of shared-memory programming models like OpenMP and Cilk require integration into the compiler as well.

The interface between the application and the compiler is the programming language. The bandwidth of that interface is the amount of information that a programmer can give to the compiler. Recent linguistic research implies that the language we use to communicate changes or focuses the way we think; this is true for programming languages as well. Parallelism at the language level is becoming more common. Some languages focus on different levels of parallelism.

OpenMP, as mentioned, is used in shared memory environments, across cores or sockets, and continues to evolve, adding features for unstructured (task) parallelism. Cilk and many other parallel languages also target shared memory parallel systems. These languages usually assume uniform memory access cost, but allow for load balancing and dynamic parallelism. They focus entirely on core- and socket-level parallelism, with uniform (homogeneous) cores. The next C++ standard will likely include some sort of parallel construct, probably assuming shared memory as well. Fortran 2008 has a parallel loop construct, the do concurrent, which allows a compiler to execute the iterations in any order, including in parallel. This is a bit different than a parallel loop expressed in OpenMP, which specifies the mapping between loop iterations and threads more explicitly, and different than array assignments and the forall construct, which allow vector-style parallelism, but not necessarily multicore-style parallelism.

MPI is used mostly to manage parallelism across nodes. Coarrays in Fortran, Unified Parallel C (UPC) and other PGAS languages all target the same parallelism structures as MPI. Programming using these primitives leads one to large scale, static parallelism with explicit locality and infrequent communication. As mentioned, it’s difficult to take advantage of locality between cores on a single socket or node with these.

OpenMP, Cilk, coarrays, UPC, and others are implemented and supported by compilers, but require the application programmer to expose the parallelism and express it using specific syntax. MPI is not a language, but it may as well be. It requires the same application structure as a PGAS program, without the advantage of compiler error checking or optimization.

OpenCL and (to some extent) CUDA target the middle parallelism levels. They can be used to program sockets or cores on a socket. OpenCL could even be used to program sockets or cores on different nodes, but it wouldn’t scale. You’d need some other mechanism to generate symmetric parallelism across the nodes. They are designed for a master (host) / worker (accelerator) execution model.

Then there are myriads of low-level programming tricks to foil or cause the compiler to generate the right code to take advantage of the instruction-level and pipeline-level parallelism, such as manual loop unrolling, which you still see being used today. These are mostly artifacts of compilers that are not fully optimized, or of cases where the programmer knows more about the program, the data, or the target machine than he or she can express in the language.

To reach exascale computing, we need to productively take advantage of all the levels of parallelism. This has ramifications for applications, languages, and compilers. In my next column, I’ll introduce The Three Ex’s of Exascale, which applications developers and system providers need to consider as we move forward.

About the Author

Michael Wolfe has developed compilers for over 30 years in both academia and industry, and is now a senior compiler engineer at The Portland Group, Inc. (, a wholly-owned subsidiary of STMicroelectronics, Inc. The opinions stated here are those of the author, and do not represent opinions of The Portland Group, Inc. or STMicroelectronics, Inc.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Long Flights to Cluster Fights: Meet the Asian Student Cluster Teams

November 22, 2017

Five teams from Asia traveled thousands of miles to compete at the SC17 Student Cluster Competition in Denver. Our cameras were there to meet ‘em, greet ‘em, and grill ‘em about their clusters and how they’re doi Read more…

By Dan Olds

Japan Unveils Quantum Neural Network

November 22, 2017

The U.S. and China are leading the race toward productive quantum computing, but it's early enough that ultimate leadership is still something of an open question. The latest geo-region to throw its hat in the quantum co Read more…

By Tiffany Trader

Perspective: What Really Happened at SC17?

November 22, 2017

SC is over. Now comes the myriad of follow-ups. Inboxes are filled with templated emails from vendors and other exhibitors hoping to win a place in the post-SC thinking of booth visitors. Attendees of tutorials, workshop Read more…

By Andrew Jones

HPE Extreme Performance Solutions

HPE Wins “Best HPC Server” for the Apollo 6000 Gen10 System

Hewlett Packard Enterprise (HPE) was nominated for 14 HPCwire Readers’ and Editors’ Choice Awards—including “Best High Performance Computing (HPC) Server Product or Technology” and “Top Supercomputing Achievement.” The HPE Apollo 6000 Gen10 was named “Best HPC Server” of 2017. Read more…

Turnaround Complete, HPE’s Whitman Departs

November 22, 2017

Having turned around the aircraft carrier the Silicon Valley icon had become, Meg Whitman is leaving the helm of a restructured Hewlett Packard. Her successor, technologist Antonio Neri will now guide what Whitman assert Read more…

By George Leopold

Long Flights to Cluster Fights: Meet the Asian Student Cluster Teams

November 22, 2017

Five teams from Asia traveled thousands of miles to compete at the SC17 Student Cluster Competition in Denver. Our cameras were there to meet ‘em, greet ‘em Read more…

By Dan Olds

Perspective: What Really Happened at SC17?

November 22, 2017

SC is over. Now comes the myriad of follow-ups. Inboxes are filled with templated emails from vendors and other exhibitors hoping to win a place in the post-SC Read more…

By Andrew Jones

SC Bids Farewell to Denver, Heads to Dallas for 30th Anniversary

November 17, 2017

After a jam-packed four-day expo and intensive six-day technical program, SC17 has wrapped up another successful event that brought together nearly 13,000 visit Read more…

By Tiffany Trader

SC17 Keynote – HPC Powers SKA Efforts to Peer Deep into the Cosmos

November 17, 2017

This week’s SC17 keynote – Life, the Universe and Computing: The Story of the SKA Telescope – was a powerful pitch for the potential of Big Science projects that also showcased the foundational role of high performance computing in modern science. It was also visually stunning. Read more…

By John Russell

How Cities Use HPC at the Edge to Get Smarter

November 17, 2017

Cities are sensoring up, collecting vast troves of data that they’re running through predictive models and using the insights to solve problems that, in some Read more…

By Doug Black

Student Cluster LINPACK Record Shattered! More LINs Packed Than Ever before!

November 16, 2017

Nanyang Technological University, the pride of Singapore, utterly destroyed the Student Cluster Competition LINPACK record by posting a score of 51.77 TFlop/s a Read more…

By Dan Olds

Hyperion Market Update: ‘Decent’ Growth Led by HPE; AI Transparency a Risk Issue

November 15, 2017

The HPC market update from Hyperion Research (formerly IDC) at the annual SC conference is a business and social “must,” and this year’s presentation at S Read more…

By Doug Black

Nvidia Focuses Its Cloud Containers on HPC Applications

November 14, 2017

Having migrated its top-of-the-line datacenter GPU to the largest cloud vendors, Nvidia is touting its Volta architecture for a range of scientific computing ta Read more…

By George Leopold

US Coalesces Plans for First Exascale Supercomputer: Aurora in 2021

September 27, 2017

At the Advanced Scientific Computing Advisory Committee (ASCAC) meeting, in Arlington, Va., yesterday (Sept. 26), it was revealed that the "Aurora" supercompute Read more…

By Tiffany Trader

NERSC Scales Scientific Deep Learning to 15 Petaflops

August 28, 2017

A collaborative effort between Intel, NERSC and Stanford has delivered the first 15-petaflops deep learning software running on HPC platforms and is, according Read more…

By Rob Farber

Oracle Layoffs Reportedly Hit SPARC and Solaris Hard

September 7, 2017

Oracle’s latest layoffs have many wondering if this is the end of the line for the SPARC processor and Solaris OS development. As reported by multiple sources Read more…

By John Russell

AMD Showcases Growing Portfolio of EPYC and Radeon-based Systems at SC17

November 13, 2017

AMD’s charge back into HPC and the datacenter is on full display at SC17. Having launched the EPYC processor line in June along with its MI25 GPU the focus he Read more…

By John Russell

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

Google Releases Deeplearn.js to Further Democratize Machine Learning

August 17, 2017

Spreading the use of machine learning tools is one of the goals of Google’s PAIR (People + AI Research) initiative, which was introduced in early July. Last w Read more…

By John Russell

GlobalFoundries Puts Wind in AMD’s Sails with 12nm FinFET

September 24, 2017

From its annual tech conference last week (Sept. 20), where GlobalFoundries welcomed more than 600 semiconductor professionals (reaching the Santa Clara venue Read more…

By Tiffany Trader

Amazon Debuts New AMD-based GPU Instances for Graphics Acceleration

September 12, 2017

Last week Amazon Web Services (AWS) streaming service, AppStream 2.0, introduced a new GPU instance called Graphics Design intended to accelerate graphics. The Read more…

By John Russell

Leading Solution Providers

SC17 Booth Video Tours

EU Funds 20 Million Euro ARM+FPGA Exascale Project

September 7, 2017

At the Barcelona Supercomputer Centre on Wednesday (Sept. 6), 16 partners gathered to launch the EuroEXA project, which invests €20 million over three-and-a-half years into exascale-focused research and development. Led by the Horizon 2020 program, EuroEXA picks up the banner of a triad of partner projects — ExaNeSt, EcoScale and ExaNoDe — building on their work... Read more…

By Tiffany Trader

Delays, Smoke, Records & Markets – A Candid Conversation with Cray CEO Peter Ungaro

October 5, 2017

Earlier this month, Tom Tabor, publisher of HPCwire and I had a very personal conversation with Cray CEO Peter Ungaro. Cray has been on something of a Cinderell Read more…

By Tiffany Trader & Tom Tabor

Cray Moves to Acquire the Seagate ClusterStor Line

July 28, 2017

This week Cray announced that it is picking up Seagate's ClusterStor HPC storage array business for an undisclosed sum. "In short we're effectively transitioning the bulk of the ClusterStor product line to Cray," said CEO Peter Ungaro. Read more…

By Tiffany Trader

Reinders: “AVX-512 May Be a Hidden Gem” in Intel Xeon Scalable Processors

June 29, 2017

Imagine if we could use vector processing on something other than just floating point problems.  Today, GPUs and CPUs work tirelessly to accelerate algorithms Read more…

By James Reinders

Intel Launches Software Tools to Ease FPGA Programming

September 5, 2017

Field Programmable Gate Arrays (FPGAs) have a reputation for being difficult to program, requiring expertise in specialty languages, like Verilog or VHDL. Easin Read more…

By Tiffany Trader

HPC Chips – A Veritable Smorgasbord?

October 10, 2017

For the first time since AMD's ill-fated launch of Bulldozer the answer to the question, 'Which CPU will be in my next HPC system?' doesn't have to be 'Whichever variety of Intel Xeon E5 they are selling when we procure'. Read more…

By Dairsie Latimer

Flipping the Flops and Reading the Top500 Tea Leaves

November 13, 2017

The 50th edition of the Top500 list, the biannual publication of the world’s fastest supercomputers based on public Linpack benchmarking results, was released Read more…

By Tiffany Trader

IBM Advances Web-based Quantum Programming

September 5, 2017

IBM Research is pairing its Jupyter-based Data Science Experience notebook environment with its cloud-based quantum computer, IBM Q, in hopes of encouraging a new class of entrepreneurial user to solve intractable problems that even exceed the capabilities of the best AI systems. Read more…

By Alex Woodie

Share This