Exascale Computing: The View from Argonne
The US Department of Energy (DOE) will be the most likely recipient of the initial crop exascale supercomputers in the country. That would certainly come as no surprise, since according the latest TOP500 rankings, the top three US machines all live at DOE labs – Sequoia at Lawrence Livermore, Mira at Argonne, and Jaguar at Oak Ridge.
These exascale machines will be 100 times as powerful as the top systems today, but will have to be something beyond a mere multiplication of today’s technology. While the first exascale supercomputers are still several years away, much thought has already gone into how they are to be designed and used. As a result of the dissolution of DARPA’s UHPC program, the driving force behind exascale research in the US now resides with the Department of Energy, which has embarked upon a program to help develop this technology.
To get a lab-centric view of the path to exascale, HPCwire asked a three of the top directors at Argonne National Laboratory — Rick Stevens, Michael Papka, and Marc Snir — to provide some context for the challenges and benefits of developing these extreme scale systems. Rick Stevens is Argonne’s Associate Laboratory Director of the Computing, Environment, and Life Sciences Directorate; Michael Papka is the Deputy Associate Laboratory Director of the Computing, Environment, and Life Sciences Directorate and Director of the Argonne Leadership Computing Facility (ALCF); and Marc Snir is the Director of the Mathematics and Computer Science (MCS) Division at Argonne. Here’s what they had to say:
HPCwire: What does the prospect of having exascale supercomputing mean for Argonne? What kinds of applications or application fidelity, will it enable that cannot be run with today’s petascale machines?
Rick Stevens: The series of DOE-sponsored workshops on exascale challenges has identified many science problems that need an exascale or beyond computing capability to solve. For example, we want to use first principles to design new materials that will enable a 500-mile electric car battery pack. We want to build end-to-end simulations of advanced nuclear reactors that are modular, safe and affordable. We want to add full atmospheric chemistry and microbial processes to climate models and to increase the resolution of climate models to get at detailed regional impacts. We want to model controls for an electric grid that has 30 percent renewable generation and smart consumers. In basic science we would like to study dark matter and dark energy by building high-resolution cosmological simulations to interpret next generation observations. All of these require machines that have more than a hundred times the processing power of current supercomputers.
Michael Papka: For Argonne, having an exascale machine means the next progression in computing resources at the lab. We have successfully housed and managed a steady succession of first-generation and otherwise groundbreaking resources over the years, and we hope this tradition continues.
As for the kinds of applications exascale would enable, expect to see more multiscale codes and dramatic increases in both the spatial and temporal dimensions. Biologists could model cells and organisms and study their evolution at a meaningful scale. Climate scientists could run highly accurate predictive models of droughts at local and regional scales. Examples like this exist in nearly every scientific field.
HPCwire: The first exascale systems will certainly be expensive to buy and, given the 20 or so megawatts power target, even more expensive to run over the machine’s lifetime – almost certainly more expensive that the petascale systems of today. How is the DOE going to rationalize spending increasing amounts of money to fund the work for essentially a handful of applications? Do you think it will mean there will be fewer top systems across the DOE than there have been in the past?
Marc Snir: There is a clear need to have open science systems as well as NNSA systems. And though power is more expensive and the purchase price may be higher, amortization is spread across more years as Moore’s Law slows down. We already went from doubling processor complexity every two years to doubling it every three. This may also enable better options for mid-life upgrades. A supercomputer is still cheap compared to a major experimental facility, and yields a broader range of scientific discoveries.
Stevens: DOE will need a mix of capability systems — exascale and beyond — as well as many capacity systems to serve the needs of DOE science and engineering. DOE will also need systems to handle increasing amounts of data and more sophisticated data analysis methods under development. The total cost, acquisition and operating will be bounded by the investments DOE is allowed to make in science and national defense. The push towards exascale systems will make all computers more power efficient and therefore more affordable.
Papka: The outcome of the science is the important component. Research being done on DOE open science supercomputers today could lead to everything from more environmentally-friendly concrete to safer nuclear reactor designs. There is no real way to predict or quantify the advancements that any specific scientific discovery will have. An algorithm developed today may enable a piece of code that runs a simulation that leads to a cure to cancer. The investment has to be made.
HPCwire: So does anyone at Argonne, or the DOE in general, believe money would be better spent on more petascale systems and fewer exascale systems because of escalating power costs and perhaps an anticipated dearth of applications that can make use of such systems?
Snir: It is always possible to partition a larger machine; however, it is impossible to assemble an exascale machine by hooking together many petascale machines.
The multiple DOE studies on exascale applications in 2008 and 2009 have clearly shown that progress in many application domains depends on the availability of exascale systems. While a jump in a factor of 1,000 in performance may seem huge, it is actually quite modest from the viewpoint of applications. In a 3D mesh code, such as used for representing the atmosphere in a climate simulation, this increase in performance enables refining meshes by a factor of less than 6(4√ 1000 ), since the time scale needs to be equally refined. This assumes no other changes. In fact, many other changes are needed, when precision increases, that is, to better represent clouds, or to do ensemble runs in order to quantify uncertainty.
It is sometimes claimed that many petascale systems may be used more efficiently than one exascale system since ensemble runs are “embarrassingly parallel” and can be executed on distinct systems. However, this is a very inefficient way of running ensembles. One would input all the initialization data many times, and one would not take advantage of more efficient methods for sampling the probability space.
Another common claim heard is that “big data” will replace “big computation.” Nothing could be further from the truth. As we collect increasingly large amounts of data through better telescopes, better satellite imagery, and better experimental facilities, we need increasingly powerful simulation capabilities. You are surely familiar with the aphorism: “All science is either physics or stamp collecting.” What I think Ernest Rutherford meant by that is that scientific progress requires the matching of deductions made from scientific hypotheses to experimental evidence. A scientific pursuit that only involves observation is “stamp collection.”
As we study increasingly complex systems, this matching of hypothesis to evidence requires increasingly complex simulations. Consider, for example, climate evolution. A climate model may include tens of equations and detailed description of initial conditions. We validate the model by matching its predictions to past observations. This match requires detailed simulations.
The complexity of these simulations increases rapidly as we refine our models and increase resolution. More detailed observations are useful only to the extent they enable better calibration of the climate models; this, in turn, requires a more detailed model, hence a more expensive simulation. The same phenomenon occurs in one discipline after another.
It is also important to remember that research on exascale will be hugely beneficial to petascale computing. If an exascale consumes 20 megawatts, then a petascale system will consume less than 20 kilowatts and become available at the departmental level. If good software solutions for resilience are developed as part of exascale research, then it becomes possible to build petascale computers out of less reliable and much cheaper components.
Papka: As we transition to the exascale era the hierarchy of systems will largely remain intact, so the advances needed for exascale will influence petascale resources and so on down through the computing space. Exascale resources will be required to tackle the next generation of computational problems.
HPCwire: How is the lab preparing for these future systems? And given the hardware architecture and programming models have not been fully fleshed out, how deeply can this preparation go?
Snir: Exascale systems will be deployed, at best, a decade from now – later if funding is not provided for the required research and development activities. Therefore, exascale is, at this stage, a research problem. The lab is heavily involved in exascale research, from architecture, through operating systems, runtime, storage, languages and libraries, to algorithms and application codes.
This research is focused in Argonne’s Mathematics and Computer Science division, which works closely with technical and research staff at the Argonne Leadership Computing Facility. Both belong to the directorate headed by Rick Stevens. Technology developed in MCS is now being deployed on Mira, our Blue Gene/Q platform. The same will most likely be repeated in the exascale timeframe.
The strong involvement of Argonne in exascale research increases our ability to predict the likely technology evolution and prepare for it. It increases our confidence that exascale is a reachable target a decade from now. Preparations will become more concrete 4 to 6 years from now, as research moves to development, and as exascale becomes the next procurement target.
Stevens: While the precise programming models are yet to be determined, we do know that data motion is the thing we have to reduce to enable lower power consumption, and that data locality (both vertically in the memory hierarchy and horizontally in the internode sense) will need to be carefully managed and improved.
Thus we can start today to think about new algorithms that will be “exascale ready” and we can build co-design teams that bring together computer scientists, mathematicians and scientific domain experts to begin the process of thinking together how to solve these problems. We can also work with existing applications communities to help them make smart choices about rewriting their codes for near term opportunities such that they will not have to throw out their codes and start again for exascale systems.
Papka: We learn from each system we use, and continue to collaborate with our research colleagues in industry. Argonne along with Lawrence Livermore National Laboratory partnered with IBM in the design of the Blue Gene P and Q. Argonne has partnerships with other leading HPC vendors too, and I’m confident that these relationships with industry will grow as we move toward exascale.
The key is to stay connected and move forward with an open mind. The ALCF has developed a suite of micro kernels and mini- and full-science DOE and HPC applications that allow us to study performance on both physical and virtual future-generation hardware.
To address future programming model uncertainty,Argonne is actively involved in defining future standards. We are, of course, very involved in the MPI forum, as well as in the OpenMP forum for CPUs and accelerators. We have been developing benchmarks to study performance and measure characteristics of programming runtime systems and advanced and experimental features of modern HPC architectures.
HPCwire: What type of architecture is Argonne expecting for its first exascale system — a homogeneous Blue Gene-like system, a heterogeneous CPU+accelerator-based machine, or something else entirely?
Snir: It is, of course, hard to predict how a top supercomputer will look ten years from now. There is a general expectation that future high-end systems will use multiple core types that are specialized for different types of computation. One could have, for example, cores that can handle asynchronous events efficiently, such as OS or runtime requests, and cores that are optimized for deep floating point pipelines. One could have more types of cores, with only a subset of the cores active at any time, as proposed by Andrew Chien and others.
There is also a general assumption that these cores will be tightly coupled in one multichip module with shared-memory type communication across cores, rather than having an accelerator on an I/O bus. Intel, AMD and NVIDIA all have or have announced products of this type. Both heterogeneity and tight coupling at the node level seems to be necessary in order to improve power consumption. The tighter integration will facilitate finer grain tasking across heterogeneous cores. Therefore, one will be able to largely handle core heterogeneity at the compiler and runtime level, rather than the application level.
The execution model of an exascale machine should be at a higher level – dynamic tasking across cores and nodes – at a level where the specific architecture of the different cores is largely hidden; same way as the specific architecture of a core, for example, x86 versus Power is largely hidden from the execution model viewed by programmers and most software layers now. Therefore, we expect that the current dichotomy between monolithic systems and CPU-plus-accelerator-based systems will not be meaningful ten years from now.
Stevens: To add to Marc’s comments, we believe there will be additional capabilities that some systems might have in the next ten years. One strategy for reducing power is to move compute elements closer to the memory. This could mean that new memory designs will have programmable logic close to the memory such that many types of operations could be offloaded from the traditional cores to the new “smart memory” systems.
Similar ideas might apply to the storage systems, where operations that now require moving data from disk to RAM to CPU and back again might be carried out in “smart storage.”
Finally, while current large-scale systems have occasionally put logic into the interconnection network to enable things like global reductions to be executed without using the CPU functional units, we could imagine that future systems might have a fair amount of computing capability in the network fabric again to try to reduce the need to move data more than necessary.
I think we have learned that tightly integrated systems like Blue Gene have certain advantages. Fewer types of parts, lowest power consumption in their class, and very high metrics such as bisection bandwidth relative to compute performance, which let them perform extremely well on benchmarks like Graph 500 and Green500. They are also highly reliable. The challenge will be to see if in the future we can get any systems that combine the strengths needed to be affordable, reliable, programmable, and lower power consumption.
HPCwire: How about the programming model? Will it be MPI+X, something more exotic, or both?
Snir: Both. It will be necessary to run current codes on a future exascale machine – too many lines of code would be wasted, otherwise. Of course, the execution model of MPI+X may be quite different in ten years than it is now: MPI processes could be much lighter-weight and migratable, the MPI library could be compiled and/or accelerated with suitable hardware, etc.
On the other hand, it is not clear that we have an X that can scale to thousands of threads, nor do we know how an MPI process can support such heavy multithreading. It is clear, however, that running many MPI processes on each node is wasteful. It is also still unclear how current programming models provide resilience, and help reduce energy consumption. We do know that using two or three programming models simultaneously is hard.
Research on new programming models, and on mechanisms that facilitate the porting of existing code to new programming models is essential. Such research, if pursued diligently, can have a significant impact ten years from now.
Our research focus in this area is to provide a deeper stack of programming models, from DSLs to low-level programming models, thus enabling different programmers to work at different levels of abstraction; to support automatic translation of code from one level to the next lower level, but ensure that a programmer can interact with the translator, so as to guide its decision; to provide programming models that largely hide heterogeneity – both the distinction between different types of cores and the distinction between different communication mechanisms, that is, shared memory versus message passing; to provide programming notations that facilitate error isolation and thus enable local recovery from failures; and to provide a runtime that is much more dynamic that currently available, in order to cope with a hardware that continuously change, due to power management and to frequent failures.
Stevens: An interesting question in programming models is if we will get an X or perhaps a Y that integrates “data” into the programming model — so we have MPI + X for simulation and MPI + Y for data intensive — such that we can move smoothly to a new set of programming models that, while they retain continuity with existing MPI codes and can treat them as a subset, will provide fundamentally more power to developers targeting future machines.
Ideally, of course, we would have one programming notation that is expressive for the applications, or a good target to compile domain specific languages too, and at the same time can be effectively mapped onto a high-performance execution model and ultimately real hardware. The simpler we can make the X’s or Y’s, the better for the community.
A big concern is that some in the community might be assuming that GPUs are the future and waste considerable time trying to develop GPU-specific codes which might be useful in the near-term but probably not in the long-term for the reasons already articulated. That would suggest that X is probably not something like CUDA or OpenCL.
HPCwire: The DOE exascale effort appears to have settled on co-design as the focus of the development approach. Why was this approach undertaken and what do you think its prospects are for developing workable exascale systems?
Papka: It’s extremely important that the delivered exascale resources meet the needs of the domain scientists and their applications; therefore, effective collaboration with system vendors is crucial. The collaboration between Argonne,Livermore, and IBM that produced the Blue Gene series of machines is a great example of co-design.
In addition to discussing our system needs, we as the end users know the types of DOE-relevant applications that both labs would be running on the resource. Co-design works, but requires lots more communication and continued refinement of ideas among a larger-than-normal group of stakeholders.
Snir: The current structure of the software and hardware stack of supercomputers is more due to historical accidents than to principled design. For example, the use of a full-bodied OS on each node is due to the fact that current supercomputers evolved from server farms and clusters. A clean sheet design would never have mapped tightly coupled applications atop a loosely coupled, distributed OS.
The incremental, ad-hoc evolution of supercomputing technology may have reduced the incremental development cost of each successive generation, but has also created systems that are increasingly inefficient in their use of power and transistor budgets and increasingly complex and error-prone. Many of us believe that “business as usual” is reaching the end of its useful life.
The challenges of exascale will require significant changes both in the underlying hardware architecture and in the many layers of software above it. “Local optimizations,” whereby one layer is changed with no interaction with the other layers, are not likely to lead to a globally optimal solution. This means that one need to consider jointly the many layers that define the architecture of current supercomputers. This is the essence of co-design.
While current co-design centers are focused on one aspect of co-design, namely the co-evolution of hardware and applications, co-design is likely to become increasingly prevalent at all levels. For example, co-design of hardware, runtime, and compilers. This is not a new idea: the “RISC revolution” entailed hardware and compiler co-design. Whenever one needs to effect a significant change in the capabilities of a system, then it becomes necessary to reconsider the functionality of its components and their relations.
The supercomputer industry is also going through a “co-design” stage, as shown by the sale by Cray to Intel of interconnect technology. The division of labor between various technology providers and integrators ten years from now could be quite different than it is now. Consequently, the definition of the subsystems that compose a supercomputer and of the interfaces across subsystem boundaries could change quite significantly.
Stevens: I believe that we will not reach exascale in the near term without an aggressive co-design process that makes visible to the whole team the costs and benefits of each set of decisions on the architecture, software stack, and algorithms. In the past it was typically the case that architects could use rules of thumb from broad classes of applications or benchmarks to resolve design choices.
However many of the tradeoffs in exascale design are likely to be so dramatic that they need to be accompanied by an explicit agreement between the parties that they can work within the resulting design space and avoid producing machines that might technically meet some exascale objective but be effectively useless to real applications.