Visit additional Tabor Communication Publications
March 08, 2011
Part 1: All that Parallelism
There are at least two ways exascale computing can go, as exemplified by the top two systems on the latest (November 2010) TOP500 list (Tianhe-1A and Jaguar). The Chinese Tianhe-1A uses 14,000 Intel multicore processors with 7,000 NVIDIA Fermi GPUs as compute accelerators, whereas the American Jaguar Cray XT-5 uses 35,000 AMD 6-core processors. Four of the top ten supercomputers in the TOP500 list use accelerators (NVIDIA GPUs or IBM PowerXCell in Roadrunner). However, only one of the five announced 10-petaflop+ systems uses accelerators. The IBM Blue Waters at Illinois will have 40,000 8-core Power7 processors; the Fujitsu Kei will have 80,000 8-core Sparc64 processors; IBM will deliver two Blue Gene/Q systems, Mira to Argonne National Lab with 49,000 16-core Power A2 processors, and Sequoia to Lawrence Livermore National Lab with twice as many as Mira. Only Cray's Titan system at Oak Ridge is using GPUs as accelerators (to be fair, Titan is not fully approved just yet). Many would say that we will need accelerators to reach exascale, but I think those same people would have said we need accelerators to reach or exceed the 10-petaflop scale as well. We should watch these announcements over the coming year. When I first started this article, there were only four such announced systems, and I'm predicting two or more 10-petaflop+ systems before the year is out. If we can jump from 2-petaflop (2009) to 20-petaflop (2012) in three years, perhaps exascale computing by 2018 is achievable.
So we might have to deal with accelerators in some form, or not, but that's not the subject of this series of articles. All the systems in the TOP500 use thousands, tens of thousands, or at the high end, hundreds of thousands of cores. Moreover, the cores themselves exhibit a high degree of internal parallelism. At exascale, there will be a lot of parallelism, on many levels, and to reach really high performance, we are going to have to program and tune to take advantage of all those levels of parallelism. And that is the subject of these articles.
Levels of Parallelism
I break down the parallelism that we see in really high end computing into six levels. We tend to focus on the node-level or core-level parallelism, since that's where the biggest numbers come from, but it's the product of all the parallelism that gives us the total performance, so we need to understand all the levels. You may split the levels differently, and that would make for an interesting academic discussion in itself.
So lots of levels of parallelism, and lots of complexity. You might count the levels differently, such as combining the socket and core levels, since cores in different sockets differ mostly only in relative latency. You might insert an explicit level for heterogeneity, or you might combine the pipeline and instruction levels as a single microarchitectural level. But I'm going to live with the levels as given, and the table below summarizes the numbers:
Application programming is mostly focused on the higher levels of parallelism, the top four levels of my chart. Large scale parallel programming today is largely dominated by the Single Program-Multiple Data (SPMD) model across distributed memory, using MPI for communication. You can use SPMD (MPI-style) parallelism and map that across nodes, sockets and cores, which has the advantage of a single parallelism model across many levels of parallelism. You can then map your application parallelism seamlessly to a machine with many nodes and low core count, or with fewer nodes and higher core count, but you can't take advantage of the locality inherent between cores on a single socket or node.
Some applications use OpenMP within a node or socket and MPI across nodes or sockets, to take advantage of lower data communication cost in shared memory. OpenMP doesn't manage parallelism across a network, but has most of the features you need for shared-memory parallelism, and is even looking to add features to tune for non-uniform memory access costs. Unfortunately, these two programming models, each around 15 years old now, were and are designed by separate committees that seemingly have no mutual interest in interoperability. This is not too hard to understand, given that OpenMP is a language model, whereas MPI is an opaque library (opaque to the language and compiler). Having to deal with two different models is unfortunate, but each deals only with a subset of the total problem.
Finally, many applications are programmed to take advantage of vector instructions, in the form of vectorizable loops. The vectorization technology that served as my personal introduction to advanced compilers (about 35 years ago) is still used today for those SIMD instructions. Moreover, this will be more important in the future. As hardware vector lengths increase, a larger fraction of the total performance will come from vectors. The downside is that there's no single way to program a parallel operation that can be mapped across vector parallelism, shared memory core or socket parallelism, or node-level parallelism, with the choice made at program compile or execution time.
Compilers, on the other hand, mostly focus on the lower levels of parallelism, particularly instruction-level and pipeline parallelism. One could say that the microarchitectural parallelism is implemented in hardware, not in the compiler, but the compiler has to express the parallelism with the right instruction stream for the hardware to exploit it. Compilers also generate vector instructions, as mentioned above. Automatic parallelization as well as efficient implementation of shared-memory programming models like OpenMP and Cilk require integration into the compiler as well.
The interface between the application and the compiler is the programming language. The bandwidth of that interface is the amount of information that a programmer can give to the compiler. Recent linguistic research implies that the language we use to communicate changes or focuses the way we think; this is true for programming languages as well. Parallelism at the language level is becoming more common. Some languages focus on different levels of parallelism.
OpenMP, as mentioned, is used in shared memory environments, across cores or sockets, and continues to evolve, adding features for unstructured (task) parallelism. Cilk and many other parallel languages also target shared memory parallel systems. These languages usually assume uniform memory access cost, but allow for load balancing and dynamic parallelism. They focus entirely on core- and socket-level parallelism, with uniform (homogeneous) cores. The next C++ standard will likely include some sort of parallel construct, probably assuming shared memory as well. Fortran 2008 has a parallel loop construct, the do concurrent, which allows a compiler to execute the iterations in any order, including in parallel. This is a bit different than a parallel loop expressed in OpenMP, which specifies the mapping between loop iterations and threads more explicitly, and different than array assignments and the forall construct, which allow vector-style parallelism, but not necessarily multicore-style parallelism.
MPI is used mostly to manage parallelism across nodes. Coarrays in Fortran, Unified Parallel C (UPC) and other PGAS languages all target the same parallelism structures as MPI. Programming using these primitives leads one to large scale, static parallelism with explicit locality and infrequent communication. As mentioned, it's difficult to take advantage of locality between cores on a single socket or node with these.
OpenMP, Cilk, coarrays, UPC, and others are implemented and supported by compilers, but require the application programmer to expose the parallelism and express it using specific syntax. MPI is not a language, but it may as well be. It requires the same application structure as a PGAS program, without the advantage of compiler error checking or optimization.
OpenCL and (to some extent) CUDA target the middle parallelism levels. They can be used to program sockets or cores on a socket. OpenCL could even be used to program sockets or cores on different nodes, but it wouldn't scale. You'd need some other mechanism to generate symmetric parallelism across the nodes. They are designed for a master (host) / worker (accelerator) execution model.
Then there are myriads of low-level programming tricks to foil or cause the compiler to generate the right code to take advantage of the instruction-level and pipeline-level parallelism, such as manual loop unrolling, which you still see being used today. These are mostly artifacts of compilers that are not fully optimized, or of cases where the programmer knows more about the program, the data, or the target machine than he or she can express in the language.
To reach exascale computing, we need to productively take advantage of all the levels of parallelism. This has ramifications for applications, languages, and compilers. In my next column, I'll introduce The Three Ex's of Exascale, which applications developers and system providers need to consider as we move forward.
About the Author
Michael Wolfe has developed compilers for over 30 years in both academia and industry, and is now a senior compiler engineer at The Portland Group, Inc. (www.pgroup.com), a wholly-owned subsidiary of STMicroelectronics, Inc. The opinions stated here are those of the author, and do not represent opinions of The Portland Group, Inc. or STMicroelectronics, Inc.
Jun 18, 2013 |
The world's largest supercomputers, like Tianhe-2, are great at traditional, compute-intensive HPC workloads, such as simulating atomic decay or modeling tornados. But data-intensive applications--such as mining big data sets for connections--is a different sort of workload, and runs best on a different sort of computer.
Jun 18, 2013 |
Researchers are finding innovative uses for Gordon, the 285 teraflop supercomputer housed at the San Diego Supercomputer Center (SDSC) that has a unique Flash-based storage system. Since going online, researchers have put the incredibly fast I/O to use on a wide variety of workloads, ranging from chemistry to political science.
Jun 17, 2013 |
The advent of low-power mobile processors and cloud delivery models is changing the economics of computing. But just as an economy car is good at different things than a full size truck, an HPC workload still has certain computing demands that neither the fastest smartphone nor the most elastic cloud cluster can fulfill.
Jun 14, 2013 |
For all the progress we've made in IT over the last 50 years, there's one area of life that has steadfastly eluded the grasp of computers: understanding human language. Now, researchers at the Texas Advanced Computing Center (TACC) are utilizing a Hadoop cluster on its Longhorn supercomputer to move the state of the art of language processing a little bit further.
Jun 13, 2013 |
Titan, the Cray XK7 at the Oak Ridge National Lab that debuted last fall as the fastest supercomputer in the world with 17.59 petaflops of sustained computing power, will rely on its previous LINPACK test for the upcoming edition of the Top 500 list.
05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas | From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.
04/15/2013 | Bull | “50% of HPC users say their largest jobs scale to 120 cores or less.” How about yours? Are your codes ready to take advantage of today’s and tomorrow’s ultra-parallel HPC systems? Download this White Paper by Analysts Intersect360 Research to see what Bull and Intel’s Center for Excellence in Parallel Programming can do for your codes.
Join HPCwire Editor Nicole Hemsoth and Dr. David Bader from Georgia Tech as they take center stage on opening night at Atlanta's first Big Data Kick Off Week, filmed in front of a live audience. Nicole and David look at the evolution of HPC, today's big data challenges, discuss real world solutions, and reveal their predictions. Exactly what does the future holds for HPC?
Join our webinar to learn how IT managers can migrate to a more resilient, flexible and scalable solution that grows with the data center. Mellanox VMS is future-proof, efficient and brings significant CAPEX and OPEX savings. The VMS is available today.