Compilers and More: Programming at Exascale

By Michael Wolfe

March 8, 2011

Part 1: All that Parallelism

There are at least two ways exascale computing can go, as exemplified by the top two systems on the latest (November 2010) TOP500 list (Tianhe-1A and Jaguar). The Chinese Tianhe-1A uses 14,000 Intel multicore processors with 7,000 NVIDIA Fermi GPUs as compute accelerators, whereas the American Jaguar Cray XT-5 uses 35,000 AMD 6-core processors. Four of the top ten supercomputers in the TOP500 list use accelerators (NVIDIA GPUs or IBM PowerXCell in Roadrunner). However, only one of the five announced 10-petaflop+ systems uses accelerators. The IBM Blue Waters at Illinois will have 40,000 8-core Power7 processors; the Fujitsu Kei will have 80,000 8-core Sparc64 processors; IBM will deliver two Blue Gene/Q systems, Mira to Argonne National Lab with 49,000 16-core Power A2 processors, and Sequoia to Lawrence Livermore National Lab with twice as many as Mira. Only Cray’s Titan system at Oak Ridge is using GPUs as accelerators (to be fair, Titan is not fully approved just yet). Many would say that we will need accelerators to reach exascale, but I think those same people would have said we need accelerators to reach or exceed the 10-petaflop scale as well. We should watch these announcements over the coming year. When I first started this article, there were only four such announced systems, and I’m predicting two or more 10-petaflop+ systems before the year is out. If we can jump from 2-petaflop (2009) to 20-petaflop (2012) in three years, perhaps exascale computing by 2018 is achievable.

So we might have to deal with accelerators in some form, or not, but that’s not the subject of this series of articles. All the systems in the TOP500 use thousands, tens of thousands, or at the high end, hundreds of thousands of cores. Moreover, the cores themselves exhibit a high degree of internal parallelism. At exascale, there will be a lot of parallelism, on many levels, and to reach really high performance, we are going to have to program and tune to take advantage of all those levels of parallelism. And that is the subject of these articles.

Levels of Parallelism

I break down the parallelism that we see in really high end computing into six levels. We tend to focus on the node-level or core-level parallelism, since that’s where the biggest numbers come from, but it’s the product of all the parallelism that gives us the total performance, so we need to understand all the levels. You may split the levels differently, and that would make for an interesting academic discussion in itself.

  • Node Level: How many nodes in the computer. I’ll mostly use the Jaguar numbers as an example; Jaguar has 18,000 nodes. We could debate what constitutes a node, but I’ll use the common definition that a node is a set of processors or cores that share physical memory and a network interface. All (or almost all) large systems use identical nodes, except for a small number dedicated to IO, storage, or interactive interfacing, so we haven’t had to deal with heterogeneity across nodes, and I expect this to continue. I’m going to go out on a limb by predicting that node count will top out at exascale closer to 100,000, rather than the more aggressive predictions of millions of nodes from other experts.
     
  • Socket Level: How many sockets at each node. Jaguar nodes have two sockets. Systems with accelerators exhibit heterogeneity at this level, with CPU and accelerator sockets. Current systems using GPUs as accelerators have the GPUs plugged into the IO bus (PCI-express) instead of a CPU socket, which seriously affects the bandwidth between the CPU and accelerator. An accelerator like the Convey coprocessor sits on the memory bus, just like another processor, and can be much more tightly integrated. I’m going to ignore other accelerators, such as network interfaces with MPI acceleration capabilities. They can be just as important, but are generally not programmed. We can expect socket parallelism to go up in each node, from 2 or 4 now to 8 or 16 sockets. As mentioned in the introduction, these may include compute accelerators.
     
  • Core Level: How many cores in each socket. Jaguar uses 6-core AMD processors. Core counting is open to debate. Does an Intel Core i7 hyperthreaded six-core processor count as twelve? The operating system certainly thinks there are twelve, and there can be twelve simultaneous active threads, but there are only six sets of functional units. Does each AMD Bulldozer “2-core module” count as two cores, or are they more like a single dual-issue core? Like the hyperthreaded core, the two “cores” on the 2-core module share many resources, such as the instruction fetch and decode logic, and the floating point functional units, yet there are two complete sets of integer functional units. Does each NVIDIA GPU thread processor (CUDA core) count as a core, or only the multiprocessors?) One could argue that if we count each CUDA core as one core, then each lane of an SSE instruction unit should count as one, making that six-core, hyperthreaded Intel Core i7 equivalent to 48 SSE-cores (single precision). With the new AMD Fusion and Intel Sandy Bridge processors, we see heterogeneity at the core level as well, with 64-bit x86 cores and integrated graphics cores on-chip. We should expect core parallelism to increase due to improved silicon densities, but we’re still waiting for through-silicon vias (TSVs) to give enough off-chip bandwidth to feed the beast.
     
  • Vector Level: How wide are the vector or SIMD instructions in each core. Focusing on 64-bit precision, Jaguar uses X86 SSE with 2-wide vectors. If we count an NVIDIA streaming multiprocessor as a single core, then each core has 8-wide (Tesla-10) or 16-wide (Fermi) vector width in hardware, with software vector lengths of 32. We can argue over whether to count hardware SIMD parallelism or software vector length, but recall that the Cray-1 used pipelining, not SIMD parallelism, to implement its 64-long vector instruction set. Intel has taken the next shot (or two) along the vector level on x86 with AVX (and the Larrabee vector instructions), increasing SIMD widths from 2 to 4 (and 8) in double precision. Lengthening the vectors is a tradeoff, whether to use those gates for longer vectors or for more cores. Increasing SIMD or vector parallelism is easy and low cost relative to adding cores, comparing performance per gate, and I expect this trend to continue. For accelerators, such as GPUs, we can expect very long vector operations. These devices are already optimized for regular parallelism, and there’s no reason to believe that won’t continue to be the case.
     
  • Pipeline Level: How many instructions are in a partial state of completion at once. This is hard to measure. If we treat multithreading as a type of pipeline parallelism implemented across threads, NVIDIA Fermi GPUs can support pipeline parallelism factors of close to 50. With multiscalar instruction issue, out-of-order instruction execution, and aggressive speculative execution, any modern high powered processor can have dozens of instructions in some partial state of completion. However, look at the CPU design that Intel used in the Knight’s Ferry vs. the current Intel Core processors. The Knight’s Ferry packs more cores on a single chip by simplifying the core to a dual-issue, in-order control unit. This is another interesting parallelism-level tradeoff, reducing the control unit (pipeline-level parallelism) to increase core count (core-level parallelism). I expect the pipeline-level parallelism for CPUs to decrease, in order to increase core count. GPUs, on the other hand, are already single-issue in-order cores but use a high degree of multithreading to tolerate memory latency, rather than depend on a cache and memory reference locality; expect this design point to continue. For these chips, I expect pipeline-level parallelism to increase, as memory latencies get relatively larger.
     
  • Instruction Level: How many instructions get dispatched or executed at one time. Typical x86 processors can issue up to three instructions per cycle. I expect this number may decrease to two, again to simplify the control unit and pack more cores on a chip. Some processors, such as Intel Itanium and AMD GPUs, use a VLIW design for static instruction-level parallelism, trading a simpler control unit for a more complex software stack (compiler). VLIW advocates believe that this truly is the best way to implement instruction-level parallelism, considering all complexity, power and performance tradeoffs. It will be hard for the HPC world to sustain the costs of processor design, so we’re likely to have to live with whatever the commodity world delivers.

So lots of levels of parallelism, and lots of complexity. You might count the levels differently, such as combining the socket and core levels, since cores in different sockets differ mostly only in relative latency. You might insert an explicit level for heterogeneity, or you might combine the pipeline and instruction levels as a single microarchitectural level. But I’m going to live with the levels as given, and the table below summarizes the numbers:Parallelism+Chart

Application Parallelism

Application programming is mostly focused on the higher levels of parallelism, the top four levels of my chart. Large scale parallel programming today is largely dominated by the Single Program-Multiple Data (SPMD) model across distributed memory, using MPI for communication. You can use SPMD (MPI-style) parallelism and map that across nodes, sockets and cores, which has the advantage of a single parallelism model across many levels of parallelism. You can then map your application parallelism seamlessly to a machine with many nodes and low core count, or with fewer nodes and higher core count, but you can’t take advantage of the locality inherent between cores on a single socket or node.

Some applications use OpenMP within a node or socket and MPI across nodes or sockets, to take advantage of lower data communication cost in shared memory. OpenMP doesn’t manage parallelism across a network, but has most of the features you need for shared-memory parallelism, and is even looking to add features to tune for non-uniform memory access costs. Unfortunately, these two programming models, each around 15 years old now, were and are designed by separate committees that seemingly have no mutual interest in interoperability. This is not too hard to understand, given that OpenMP is a language model, whereas MPI is an opaque library (opaque to the language and compiler). Having to deal with two different models is unfortunate, but each deals only with a subset of the total problem.

Finally, many applications are programmed to take advantage of vector instructions, in the form of vectorizable loops. The vectorization technology that served as my personal introduction to advanced compilers (about 35 years ago) is still used today for those SIMD instructions. Moreover, this will be more important in the future. As hardware vector lengths increase, a larger fraction of the total performance will come from vectors. The downside is that there’s no single way to program a parallel operation that can be mapped across vector parallelism, shared memory core or socket parallelism, or node-level parallelism, with the choice made at program compile or execution time.

Compilers, on the other hand, mostly focus on the lower levels of parallelism, particularly instruction-level and pipeline parallelism. One could say that the microarchitectural parallelism is implemented in hardware, not in the compiler, but the compiler has to express the parallelism with the right instruction stream for the hardware to exploit it. Compilers also generate vector instructions, as mentioned above. Automatic parallelization as well as efficient implementation of shared-memory programming models like OpenMP and Cilk require integration into the compiler as well.

The interface between the application and the compiler is the programming language. The bandwidth of that interface is the amount of information that a programmer can give to the compiler. Recent linguistic research implies that the language we use to communicate changes or focuses the way we think; this is true for programming languages as well. Parallelism at the language level is becoming more common. Some languages focus on different levels of parallelism.

OpenMP, as mentioned, is used in shared memory environments, across cores or sockets, and continues to evolve, adding features for unstructured (task) parallelism. Cilk and many other parallel languages also target shared memory parallel systems. These languages usually assume uniform memory access cost, but allow for load balancing and dynamic parallelism. They focus entirely on core- and socket-level parallelism, with uniform (homogeneous) cores. The next C++ standard will likely include some sort of parallel construct, probably assuming shared memory as well. Fortran 2008 has a parallel loop construct, the do concurrent, which allows a compiler to execute the iterations in any order, including in parallel. This is a bit different than a parallel loop expressed in OpenMP, which specifies the mapping between loop iterations and threads more explicitly, and different than array assignments and the forall construct, which allow vector-style parallelism, but not necessarily multicore-style parallelism.

MPI is used mostly to manage parallelism across nodes. Coarrays in Fortran, Unified Parallel C (UPC) and other PGAS languages all target the same parallelism structures as MPI. Programming using these primitives leads one to large scale, static parallelism with explicit locality and infrequent communication. As mentioned, it’s difficult to take advantage of locality between cores on a single socket or node with these.

OpenMP, Cilk, coarrays, UPC, and others are implemented and supported by compilers, but require the application programmer to expose the parallelism and express it using specific syntax. MPI is not a language, but it may as well be. It requires the same application structure as a PGAS program, without the advantage of compiler error checking or optimization.

OpenCL and (to some extent) CUDA target the middle parallelism levels. They can be used to program sockets or cores on a socket. OpenCL could even be used to program sockets or cores on different nodes, but it wouldn’t scale. You’d need some other mechanism to generate symmetric parallelism across the nodes. They are designed for a master (host) / worker (accelerator) execution model.

Then there are myriads of low-level programming tricks to foil or cause the compiler to generate the right code to take advantage of the instruction-level and pipeline-level parallelism, such as manual loop unrolling, which you still see being used today. These are mostly artifacts of compilers that are not fully optimized, or of cases where the programmer knows more about the program, the data, or the target machine than he or she can express in the language.

To reach exascale computing, we need to productively take advantage of all the levels of parallelism. This has ramifications for applications, languages, and compilers. In my next column, I’ll introduce The Three Ex’s of Exascale, which applications developers and system providers need to consider as we move forward.

About the Author

Michael Wolfe has developed compilers for over 30 years in both academia and industry, and is now a senior compiler engineer at The Portland Group, Inc. (www.pgroup.com), a wholly-owned subsidiary of STMicroelectronics, Inc. The opinions stated here are those of the author, and do not represent opinions of The Portland Group, Inc. or STMicroelectronics, Inc.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Fluid HPC: How Extreme-Scale Computing Should Respond to Meltdown and Spectre

February 15, 2018

The Meltdown and Spectre vulnerabilities are proving difficult to fix, and initial experiments suggest security patches will cause significant performance penalties to HPC applications. Even as these patches are rolled o Read more…

By Pete Beckman

Intel Touts Silicon Spin Qubits for Quantum Computing

February 14, 2018

Debate around what makes a good qubit and how best to manufacture them is a sprawling topic. There are many insistent voices favoring one or another approach. Referencing a paper published today in Nature, Intel has offe Read more…

By John Russell

Brookhaven Ramps Up Computing for National Security Effort

February 14, 2018

Last week, Dan Coats, the director of Director of National Intelligence for the U.S., warned the Senate Intelligence Committee that Russia was likely to meddle in the 2018 mid-term U.S. elections, much as it stands accused of doing in the 2016 Presidential election. Read more…

By John Russell

HPE Extreme Performance Solutions

Safeguard Your HPC Environment with the World’s Most Secure Industry Standard Servers

Today’s organizations operate in an environment with ever-evolving threats, and in order to protect themselves they must continuously bolster their security strategy. Hewlett Packard Enterprise (HPE) and Intel® are addressing modern security challenges with the world’s most secure industry standard servers powered by the latest generation of Intel® Xeon® Scalable processors. Read more…

AI Cloud Competition Heats Up: Google’s TPUs, Amazon Building AI Chip

February 12, 2018

Competition in the white hot AI (and public cloud) market pits Google against Amazon this week, with Google offering AI hardware on its cloud platform intended to make it easier, faster and cheaper to train and run machi Read more…

By Doug Black

Fluid HPC: How Extreme-Scale Computing Should Respond to Meltdown and Spectre

February 15, 2018

The Meltdown and Spectre vulnerabilities are proving difficult to fix, and initial experiments suggest security patches will cause significant performance penal Read more…

By Pete Beckman

Brookhaven Ramps Up Computing for National Security Effort

February 14, 2018

Last week, Dan Coats, the director of Director of National Intelligence for the U.S., warned the Senate Intelligence Committee that Russia was likely to meddle in the 2018 mid-term U.S. elections, much as it stands accused of doing in the 2016 Presidential election. Read more…

By John Russell

AI Cloud Competition Heats Up: Google’s TPUs, Amazon Building AI Chip

February 12, 2018

Competition in the white hot AI (and public cloud) market pits Google against Amazon this week, with Google offering AI hardware on its cloud platform intended Read more…

By Doug Black

Russian Nuclear Engineers Caught Cryptomining on Lab Supercomputer

February 12, 2018

Nuclear scientists working at the All-Russian Research Institute of Experimental Physics (RFNC-VNIIEF) have been arrested for using lab supercomputing resources to mine crypto-currency, according to a report in Russia’s Interfax News Agency. Read more…

By Tiffany Trader

The Food Industry’s Next Journey — from Mars to Exascale

February 12, 2018

Global food producer and one of the world's leading chocolate companies Mars Inc. has a unique perspective on the impact that exascale computing will have on the food industry. Read more…

By Scott Gibson, Oak Ridge National Laboratory

Singularity HPC Container Start-Up – Sylabs – Emerges from Stealth

February 8, 2018

The driving force behind Singularity, the popular HPC container technology, is bringing the open source platform to the enterprise with the launch of a new vent Read more…

By George Leopold

Dell EMC Debuts PowerEdge Servers with AMD EPYC Chips

February 6, 2018

AMD notched another EPYC processor win today with Dell EMC’s introduction of three PowerEdge servers (R6415, R7415, and R7425) based on the EPYC 7000-series p Read more…

By John Russell

‘Next Generation’ Universe Simulation Is Most Advanced Yet

February 5, 2018

The research group that gave us the most detailed time-lapse simulation of the universe’s evolution in 2014, spanning 13.8 billion years of cosmic evolution, is back in the spotlight with an even more advanced cosmological model that is providing new insights into how black holes influence the distribution of dark matter, how heavy elements are produced and distributed, and where magnetic fields originate. Read more…

By Tiffany Trader

Inventor Claims to Have Solved Floating Point Error Problem

January 17, 2018

"The decades-old floating point error problem has been solved," proclaims a press release from inventor Alan Jorgensen. The computer scientist has filed for and Read more…

By Tiffany Trader

Japan Unveils Quantum Neural Network

November 22, 2017

The U.S. and China are leading the race toward productive quantum computing, but it's early enough that ultimate leadership is still something of an open questi Read more…

By Tiffany Trader

AMD Showcases Growing Portfolio of EPYC and Radeon-based Systems at SC17

November 13, 2017

AMD’s charge back into HPC and the datacenter is on full display at SC17. Having launched the EPYC processor line in June along with its MI25 GPU the focus he Read more…

By John Russell

Researchers Measure Impact of ‘Meltdown’ and ‘Spectre’ Patches on HPC Workloads

January 17, 2018

Computer scientists from the Center for Computational Research, State University of New York (SUNY), University at Buffalo have examined the effect of Meltdown Read more…

By Tiffany Trader

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

IBM Begins Power9 Rollout with Backing from DOE, Google

December 6, 2017

After over a year of buildup, IBM is unveiling its first Power9 system based on the same architecture as the Department of Energy CORAL supercomputers, Summit a Read more…

By Tiffany Trader

Fast Forward: Five HPC Predictions for 2018

December 21, 2017

What’s on your list of high (and low) lights for 2017? Volta 100’s arrival on the heels of the P100? Appearance, albeit late in the year, of IBM’s Power9? Read more…

By John Russell

Russian Nuclear Engineers Caught Cryptomining on Lab Supercomputer

February 12, 2018

Nuclear scientists working at the All-Russian Research Institute of Experimental Physics (RFNC-VNIIEF) have been arrested for using lab supercomputing resources to mine crypto-currency, according to a report in Russia’s Interfax News Agency. Read more…

By Tiffany Trader

Leading Solution Providers

Chip Flaws ‘Meltdown’ and ‘Spectre’ Loom Large

January 4, 2018

The HPC and wider tech community have been abuzz this week over the discovery of critical design flaws that impact virtually all contemporary microprocessors. T Read more…

By Tiffany Trader

Perspective: What Really Happened at SC17?

November 22, 2017

SC is over. Now comes the myriad of follow-ups. Inboxes are filled with templated emails from vendors and other exhibitors hoping to win a place in the post-SC thinking of booth visitors. Attendees of tutorials, workshops and other technical sessions will be inundated with requests for feedback. Read more…

By Andrew Jones

How Meltdown and Spectre Patches Will Affect HPC Workloads

January 10, 2018

There have been claims that the fixes for the Meltdown and Spectre security vulnerabilities, named the KPTI (aka KAISER) patches, are going to affect applicatio Read more…

By Rosemary Francis

GlobalFoundries, Ayar Labs Team Up to Commercialize Optical I/O

December 4, 2017

GlobalFoundries (GF) and Ayar Labs, a startup focused on using light, instead of electricity, to transfer data between chips, today announced they've entered in Read more…

By Tiffany Trader

Tensors Come of Age: Why the AI Revolution Will Help HPC

November 13, 2017

Thirty years ago, parallel computing was coming of age. A bitter battle began between stalwart vector computing supporters and advocates of various approaches to parallel computing. IBM skeptic Alan Karp, reacting to announcements of nCUBE’s 1024-microprocessor system and Thinking Machines’ 65,536-element array, made a public $100 wager that no one could get a parallel speedup of over 200 on real HPC workloads. Read more…

By John Gustafson & Lenore Mullin

Flipping the Flops and Reading the Top500 Tea Leaves

November 13, 2017

The 50th edition of the Top500 list, the biannual publication of the world’s fastest supercomputers based on public Linpack benchmarking results, was released Read more…

By Tiffany Trader

V100 Good but not Great on Select Deep Learning Aps, Says Xcelerit

November 27, 2017

Wringing optimum performance from hardware to accelerate deep learning applications is a challenge that often depends on the specific application in use. A benc Read more…

By John Russell

2017 Gordon Bell Prize Finalists Named

October 23, 2017

The three finalists for this year’s Gordon Bell Prize in High Performance Computing have been announced. They include two papers on projects run on China’s Read more…

By John Russell

  • arrow
  • Click Here for More Headlines
  • arrow
Share This