Exascale Computing: The View from Argonne

By Nicole Hemsoth

June 21, 2012

The US Department of Energy (DOE) will be the most likely recipient of the initial crop exascale supercomputers in the country. That would certainly come as no surprise, since  according the latest TOP500 rankings, the top three US machines all live at DOE labs – Sequoia at Lawrence Livermore, Mira at Argonne, and Jaguar at Oak Ridge.

These exascale machines will be 100 times as powerful as the top systems today, but will have to be something beyond a mere multiplication of today’s technology. While the first exascale supercomputers are still several years away, much thought has already gone into how they are to be designed and used. As a result of the dissolution of DARPA’s UHPC program, the driving force behind exascale research in the US now resides with the Department of Energy, which has embarked upon a program to help develop this technology.

To get a lab-centric view of the path to exascale, HPCwire asked a three of the top directors at Argonne National Laboratory — Rick Stevens, Michael Papka, and Marc Snir  — to provide some context for the challenges and benefits of developing these extreme scale systems. Rick Stevens is Argonne’s Associate Laboratory Director of the Computing, Environment, and Life Sciences Directorate; Michael Papka is the Deputy Associate Laboratory Director of the Computing, Environment, and Life Sciences Directorate and Director of the Argonne Leadership Computing Facility (ALCF); and Marc Snir is the Director of the Mathematics and Computer Science (MCS) Division at Argonne.  Here’s what they had to say:

HPCwire: What does the prospect of having exascale supercomputing mean for Argonne? What kinds of applications or application fidelity, will it enable that cannot be run with today’s petascale machines?

Rick Stevens: The series of DOE-sponsored workshops on exascale challenges has identified many science problems that need an exascale or beyond computing capability to solve. For example, we want to use first principles to design new materials that will enable a 500-mile electric car battery pack. We want to build end-to-end simulations of advanced nuclear reactors that are modular, safe and affordable. We want to add full atmospheric chemistry and microbial processes to climate models and to increase the resolution of climate models to get at detailed regional impacts. We want to model controls for an electric grid that has 30 percent renewable generation and smart consumers. In basic science we would like to study dark matter and dark energy by building high-resolution cosmological simulations to interpret next generation observations. All of these require machines that have more than a hundred times the processing power of current supercomputers.

Michael Papka: For Argonne, having an exascale machine means the next progression in computing resources at the lab. We have successfully housed and managed a steady succession of first-generation and otherwise groundbreaking resources over the years, and we hope this tradition continues.

As for the kinds of applications exascale would enable, expect to see more multiscale codes and dramatic increases in both the spatial and temporal dimensions. Biologists could model cells and organisms and study their evolution at a meaningful scale. Climate scientists could run highly accurate predictive models of droughts at local and regional scales. Examples like this exist in nearly every scientific field.

HPCwire: The first exascale systems will certainly be expensive to buy and, given the 20 or so megawatts power target, even more expensive to run over the machine’s lifetime – almost certainly more expensive that the petascale systems of today. How is the DOE going to rationalize spending increasing amounts of money to fund the work for essentially a handful of applications? Do you think it will mean there will be fewer top systems across the DOE than there have been in the past?

Marc Snir: There is a clear need to have open science systems as well as NNSA systems. And though power is more expensive and the purchase price may be higher, amortization is spread across more years as Moore’s Law slows down. We already went from doubling processor complexity every two years to doubling it every three. This may also enable better options for mid-life upgrades. A supercomputer is still cheap compared to a major experimental facility, and yields a broader range of scientific discoveries.

Stevens: DOE will need a mix of capability systems — exascale and beyond — as well as many capacity systems to serve the needs of DOE science and engineering. DOE will also need systems to handle increasing amounts of data and more sophisticated data analysis methods under development. The total cost, acquisition and operating will be bounded by the investments DOE is allowed to make in science and national defense. The push towards exascale systems will make all computers more power efficient and therefore more affordable.

Papka: The outcome of the science is the important component. Research being done on DOE open science supercomputers today could lead to everything from more environmentally-friendly concrete to safer nuclear reactor designs. There is no real way to predict or quantify the advancements that any specific scientific discovery will have. An algorithm developed today may enable a piece of code that runs a simulation that leads to a cure to cancer. The investment has to be made.

HPCwire: So does anyone at Argonne, or the DOE in general, believe money would be better spent on more petascale systems and fewer exascale systems because of escalating power costs and perhaps an anticipated dearth of applications that can make use of such systems?

Snir: It is always possible to partition a larger machine; however, it is impossible to assemble an exascale machine by hooking together many petascale machines.

The multiple DOE studies on exascale applications in 2008 and 2009 have clearly shown that progress in many application domains depends on the availability of exascale systems. While a jump in a factor of 1,000 in performance may seem huge, it is actually quite modest from the viewpoint of applications. In a 3D mesh code, such as used for representing the atmosphere in a climate simulation, this increase in performance enables refining meshes by a factor of less than 6(4 1000 ), since the time scale needs to be equally refined. This assumes no other changes. In fact, many other changes are needed, when precision increases, that is, to better represent clouds, or to do ensemble runs in order to quantify uncertainty.

It is sometimes claimed that many petascale systems may be used more efficiently than one exascale system since ensemble runs are “embarrassingly parallel” and can be executed on distinct systems. However, this is a very inefficient way of running ensembles. One would input all the initialization data many times, and one would not take advantage of more efficient methods for sampling the probability space.

Another common claim heard is that “big data” will replace “big computation.” Nothing could be further from the truth. As we collect increasingly large amounts of data through better telescopes, better satellite imagery, and better experimental facilities, we need increasingly powerful simulation capabilities. You are surely familiar with the aphorism: “All science is either physics or stamp collecting.” What I think Ernest Rutherford meant by that is that scientific progress requires the matching of deductions made from scientific hypotheses to experimental evidence. A scientific pursuit that only involves observation is “stamp collection.”

As we study increasingly complex systems, this matching of hypothesis to evidence requires increasingly complex simulations. Consider, for example, climate evolution. A climate model may include tens of equations and detailed description of initial conditions. We validate the model by matching its predictions to past observations. This match requires detailed simulations.

The complexity of these simulations increases rapidly as we refine our models and increase resolution. More detailed observations are useful only to the extent they enable better calibration of the climate models; this, in turn, requires a more detailed model, hence a more expensive simulation. The same phenomenon occurs in one discipline after another.

It is also important to remember that research on exascale will be hugely beneficial to petascale computing. If an exascale consumes 20 megawatts, then a petascale system will consume less than 20 kilowatts and become available at the departmental level. If good software solutions for resilience are developed as part of exascale research, then it becomes possible to build petascale computers out of less reliable and much cheaper components.

Papka: As we transition to the exascale era the hierarchy of systems will largely remain intact, so the advances needed for exascale will influence petascale resources and so on down through the computing space. Exascale resources will be required to tackle the next generation of computational problems.

HPCwire: How is the lab preparing for these future systems?  And given the hardware architecture and programming models have not been fully fleshed out, how deeply can this preparation go?

Snir: Exascale systems will be deployed, at best, a decade from now – later if funding is not provided for the required research and development activities. Therefore, exascale is, at this stage, a research problem. The lab is heavily involved in exascale research, from architecture, through operating systems, runtime, storage, languages and libraries, to algorithms and application codes.

This research is focused in Argonne’s Mathematics and Computer Science division, which works closely with technical and research staff at the Argonne Leadership Computing Facility. Both belong to the directorate headed by Rick Stevens. Technology developed in MCS is now being deployed on Mira, our Blue Gene/Q platform. The same will most likely be repeated in the exascale timeframe.

The strong involvement of Argonne in exascale research increases our ability to predict the likely technology evolution and prepare for it. It increases our confidence that exascale is a reachable target a decade from now. Preparations will become more concrete 4 to 6 years from now, as research moves to development, and as exascale becomes the next procurement target.

Stevens: While the precise programming models are yet to be determined, we do know that data motion is the thing we have to reduce to enable lower power consumption, and that data locality (both vertically in the memory hierarchy and horizontally in the internode sense) will need to be carefully managed and improved.

Thus we can start today to think about new algorithms that will be “exascale ready” and we can build co-design teams that bring together computer scientists, mathematicians and scientific domain experts to begin the process of thinking together how to solve these problems. We can also work with existing applications communities to help them make smart choices about rewriting their codes for near term opportunities such that they will not have to throw out their codes and start again for exascale systems.

Papka: We learn from each system we use, and continue to collaborate with our research colleagues in industry. Argonne along with Lawrence Livermore National Laboratory partnered with IBM in the design of the Blue Gene P and Q. Argonne has partnerships with other leading HPC vendors too, and I’m confident that these relationships with industry will grow as we move toward exascale.

The key is to stay connected and move forward with an open mind. The ALCF has developed a suite of micro kernels and mini- and full-science DOE and HPC applications that allow us to study performance on both physical and virtual future-generation hardware.

To address future programming model uncertainty,Argonne is actively involved in defining future standards. We are, of course, very involved in the MPI forum, as well as in the OpenMP forum for CPUs and accelerators. We have been developing benchmarks to study performance and measure characteristics of programming runtime systems and advanced and experimental features of modern HPC architectures.

HPCwire: What type of architecture is Argonne expecting for its first exascale system — a homogeneous Blue Gene-like system, a heterogeneous CPU+accelerator-based machine, or something else entirely?

Snir: It is, of course, hard to predict how a top supercomputer will look ten years from now. There is a general expectation that future high-end systems will use multiple core types that are specialized for different types of computation. One could have, for example, cores that can handle asynchronous events efficiently, such as OS or runtime requests, and cores that are optimized for deep floating point pipelines. One could have more types of cores, with only a subset of the cores active at any time, as proposed by Andrew Chien and others.

There is also a general assumption that these cores will be tightly coupled in one multichip module with shared-memory type communication across cores, rather than having an accelerator on an I/O bus. Intel, AMD and NVIDIA all have or have announced products of this type. Both heterogeneity and tight coupling at the node level seems to be necessary in order to improve power consumption. The tighter integration will facilitate finer grain tasking across heterogeneous cores. Therefore, one will be able to largely handle core heterogeneity at the compiler and runtime level, rather than the application level.

The execution model of an exascale machine should be at a higher level – dynamic tasking across cores and nodes – at a level where the specific architecture of the different cores is largely hidden; same way as the specific architecture of a core, for example, x86 versus Power is largely hidden from the execution model viewed by programmers and most software layers now. Therefore, we expect that the current dichotomy between monolithic systems and CPU-plus-accelerator-based systems will not be meaningful ten years from now.

Stevens: To add to Marc’s comments, we believe there will be additional capabilities that some systems might have in the next ten years. One strategy for reducing power is to move compute elements closer to the memory. This could mean that new memory designs will have programmable logic close to the memory such that many types of operations could be offloaded from the traditional cores to the new “smart memory” systems.

Similar ideas might apply to the storage systems, where operations that now require moving data from disk to RAM to CPU and back again might be carried out in “smart storage.”

Finally, while current large-scale systems have occasionally put logic into the interconnection network to enable things like global reductions to be executed without using the CPU functional units, we could imagine that future systems might have a fair amount of computing capability in the network fabric again to try to reduce the need to move data more than necessary.

I think we have learned that tightly integrated systems like Blue Gene have certain advantages. Fewer types of parts, lowest power consumption in their class, and very high metrics such as bisection bandwidth relative to compute performance, which let them perform extremely well on benchmarks like Graph 500 and Green500. They are also highly reliable. The challenge will be to see if in the future we can get any systems that combine the strengths needed to be affordable, reliable, programmable, and lower power consumption.

HPCwire: How about the programming model?  Will it be MPI+X, something more exotic, or both?

Snir: Both. It will be necessary to run current codes on a future exascale machine – too many lines of code would be wasted, otherwise. Of course, the execution model of MPI+X may be quite different in ten years than it is now: MPI processes could be much lighter-weight and migratable, the MPI library could be compiled and/or accelerated with suitable hardware, etc.

On the other hand, it is not clear that we have an X that can scale to thousands of threads, nor do we know how an MPI process can support such heavy multithreading. It is clear, however, that running many MPI processes on each node is wasteful. It is also still unclear how current programming models provide resilience, and help reduce energy consumption. We do know that using two or three programming models simultaneously is hard.

Research on new programming models, and on mechanisms that facilitate the porting of existing code to new programming models is essential. Such research, if pursued diligently, can have a significant impact ten years from now.

Our research focus in this area is to provide a deeper stack of programming models, from DSLs to low-level programming models, thus enabling different programmers to work at different levels of abstraction; to support automatic translation of code from one level to the next lower level, but ensure that a programmer can interact with the translator, so as to guide its decision; to provide programming models that largely hide heterogeneity – both the distinction between different types of cores and the distinction between different communication mechanisms, that is, shared memory versus message passing; to provide programming notations that facilitate error isolation and thus enable local recovery from failures; and to provide a runtime that is much more dynamic that currently available, in order to cope with a hardware that continuously change, due to power management and to frequent failures.

Stevens: An interesting question in programming models is if we will get an X or perhaps a Y that integrates “data” into the programming model — so we have MPI + X for simulation and MPI + Y for data intensive — such that we can move smoothly to a new set of programming models that, while they retain continuity with existing MPI codes and can treat them as a subset, will provide fundamentally more power to developers targeting future machines.

Ideally, of course, we would have one programming notation that is expressive for the applications, or a good target to compile domain specific languages too, and at the same time can be effectively mapped onto a high-performance execution model and ultimately real hardware. The simpler we can make the X’s or Y’s, the better for the community.

A big concern is that some in the community might be assuming that GPUs are the future and waste considerable time trying to develop GPU-specific codes which might be useful in the near-term but probably not in the long-term for the reasons already articulated. That would suggest that X is probably not something like CUDA or OpenCL.

HPCwire: The DOE exascale effort appears to have settled on co-design as the focus of the development approach. Why was this approach undertaken and what do you think its prospects are for developing workable exascale systems?

Papka: It’s extremely important that the delivered exascale resources meet the needs of the domain scientists and their applications; therefore, effective collaboration with system vendors is crucial. The collaboration between Argonne,Livermore, and IBM that produced the Blue Gene series of machines is a great example of co-design.

In addition to discussing our system needs, we as the end users know the types of DOE-relevant applications that both labs would be running on the resource. Co-design works, but requires lots more communication and continued refinement of ideas among a larger-than-normal group of stakeholders.

Snir: The current structure of the software and hardware stack of supercomputers is more due to historical accidents than to principled design. For example, the use of a full-bodied OS on each node is due to the fact that current supercomputers evolved from server farms and clusters. A clean sheet design would never have mapped tightly coupled applications atop a loosely coupled, distributed OS.

The incremental, ad-hoc evolution of supercomputing technology may have reduced the incremental development cost of each successive generation, but has also created systems that are increasingly inefficient in their use of power and transistor budgets and increasingly complex and error-prone. Many of us believe that “business as usual” is reaching the end of its useful life.

The challenges of exascale will require significant changes both in the underlying hardware architecture and in the many layers of software above it. “Local optimizations,” whereby one layer is changed with no interaction with the other layers, are not likely to lead to a globally optimal solution. This means that one need to consider jointly the many layers that define the architecture of current supercomputers. This is the essence of co-design.

While current co-design centers are focused on one aspect of co-design, namely the co-evolution of hardware and applications, co-design is likely to become increasingly prevalent at all levels. For example, co-design of hardware, runtime, and compilers. This is not a new idea: the “RISC revolution” entailed hardware and compiler co-design. Whenever one needs to effect a significant change in the capabilities of a system, then it becomes necessary to reconsider the functionality of its components and their relations.

The supercomputer industry is also going through a “co-design” stage, as shown by the sale by Cray to Intel of interconnect technology. The division of labor between various technology providers and integrators ten years from now could be quite different than it is now. Consequently, the definition of the subsystems that compose a supercomputer and of the interfaces across subsystem boundaries could change quite significantly.

Stevens: I believe that we will not reach exascale in the near term without an aggressive co-design process that makes visible to the whole team the costs and benefits of each set of decisions on the architecture, software stack, and algorithms. In the past it was typically the case that architects could use rules of thumb from broad classes of applications or benchmarks to resolve design choices.

However many of the tradeoffs in exascale design are likely to be so dramatic that they need to be accompanied by an explicit agreement between the parties that they can work within the resulting design space and avoid producing machines that might technically meet some exascale objective but be effectively useless to real applications.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

SC Bids Farewell to Denver, Heads to Dallas for 30th

November 17, 2017

After a jam-packed four-day expo and intensive six-day technical program, SC17 has wrapped up another successful event that brought together nearly 13,000 visitors to the Colorado Convention Center in Denver for the larg Read more…

By Tiffany Trader

SC17 Keynote – HPC Powers SKA Efforts to Peer Deep into the Cosmos

November 17, 2017

This week’s SC17 keynote – Life, the Universe and Computing: The Story of the SKA Telescope – was a powerful pitch for the potential of Big Science projects that also showcased the foundational role of high performance computing in modern science. It was also visually stunning. Read more…

By John Russell

How Cities Use HPC at the Edge to Get Smarter

November 17, 2017

Cities are sensoring up, collecting vast troves of data that they’re running through predictive models and using the insights to solve problems that, in some cases, city managers didn’t even know existed. Speaking Read more…

By Doug Black

HPE Extreme Performance Solutions

Harness Scalable Petabyte Storage with HPE Apollo 4510 and HPE StoreEver

As a growing number of connected devices challenges IT departments to rapidly collect, manage, and store troves of data, organizations must adopt a new generation of IT to help them operate quickly and intelligently. Read more…

SC17 Student Cluster Competition Configurations: Fewer Nodes, Way More Accelerators

November 16, 2017

The final configurations for each of the SC17 “Donnybrook in Denver” Student Cluster Competition have been released. Fortunately, each team received their equipment shipments on time and undamaged, so the teams are r Read more…

By Dan Olds

SC Bids Farewell to Denver, Heads to Dallas for 30th

November 17, 2017

After a jam-packed four-day expo and intensive six-day technical program, SC17 has wrapped up another successful event that brought together nearly 13,000 visit Read more…

By Tiffany Trader

SC17 Keynote – HPC Powers SKA Efforts to Peer Deep into the Cosmos

November 17, 2017

This week’s SC17 keynote – Life, the Universe and Computing: The Story of the SKA Telescope – was a powerful pitch for the potential of Big Science projects that also showcased the foundational role of high performance computing in modern science. It was also visually stunning. Read more…

By John Russell

How Cities Use HPC at the Edge to Get Smarter

November 17, 2017

Cities are sensoring up, collecting vast troves of data that they’re running through predictive models and using the insights to solve problems that, in some Read more…

By Doug Black

Student Cluster LINPACK Record Shattered! More LINs Packed Than Ever before!

November 16, 2017

Nanyang Technological University, the pride of Singapore, utterly destroyed the Student Cluster Competition LINPACK record by posting a score of 51.77 TFlop/s a Read more…

By Dan Olds

Hyperion Market Update: ‘Decent’ Growth Led by HPE; AI Transparency a Risk Issue

November 15, 2017

The HPC market update from Hyperion Research (formerly IDC) at the annual SC conference is a business and social “must,” and this year’s presentation at S Read more…

By Doug Black

Nvidia Focuses Its Cloud Containers on HPC Applications

November 14, 2017

Having migrated its top-of-the-line datacenter GPU to the largest cloud vendors, Nvidia is touting its Volta architecture for a range of scientific computing ta Read more…

By George Leopold

HPE Launches ARM-based Apollo System for HPC, AI

November 14, 2017

HPE doubled down on its memory-driven computing vision while expanding its processor portfolio with the announcement yesterday of the company’s first ARM-base Read more…

By Doug Black

OpenACC Shines in Global Climate/Weather Codes

November 14, 2017

OpenACC, the directive-based parallel programming model used mostly for porting codes to GPUs for use on heterogeneous systems, came to SC17 touting impressive Read more…

By John Russell

US Coalesces Plans for First Exascale Supercomputer: Aurora in 2021

September 27, 2017

At the Advanced Scientific Computing Advisory Committee (ASCAC) meeting, in Arlington, Va., yesterday (Sept. 26), it was revealed that the "Aurora" supercompute Read more…

By Tiffany Trader

NERSC Scales Scientific Deep Learning to 15 Petaflops

August 28, 2017

A collaborative effort between Intel, NERSC and Stanford has delivered the first 15-petaflops deep learning software running on HPC platforms and is, according Read more…

By Rob Farber

Oracle Layoffs Reportedly Hit SPARC and Solaris Hard

September 7, 2017

Oracle’s latest layoffs have many wondering if this is the end of the line for the SPARC processor and Solaris OS development. As reported by multiple sources Read more…

By John Russell

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

Google Releases Deeplearn.js to Further Democratize Machine Learning

August 17, 2017

Spreading the use of machine learning tools is one of the goals of Google’s PAIR (People + AI Research) initiative, which was introduced in early July. Last w Read more…

By John Russell

GlobalFoundries Puts Wind in AMD’s Sails with 12nm FinFET

September 24, 2017

From its annual tech conference last week (Sept. 20), where GlobalFoundries welcomed more than 600 semiconductor professionals (reaching the Santa Clara venue Read more…

By Tiffany Trader

Amazon Debuts New AMD-based GPU Instances for Graphics Acceleration

September 12, 2017

Last week Amazon Web Services (AWS) streaming service, AppStream 2.0, introduced a new GPU instance called Graphics Design intended to accelerate graphics. The Read more…

By John Russell

EU Funds 20 Million Euro ARM+FPGA Exascale Project

September 7, 2017

At the Barcelona Supercomputer Centre on Wednesday (Sept. 6), 16 partners gathered to launch the EuroEXA project, which invests €20 million over three-and-a-half years into exascale-focused research and development. Led by the Horizon 2020 program, EuroEXA picks up the banner of a triad of partner projects — ExaNeSt, EcoScale and ExaNoDe — building on their work... Read more…

By Tiffany Trader

Leading Solution Providers

Reinders: “AVX-512 May Be a Hidden Gem” in Intel Xeon Scalable Processors

June 29, 2017

Imagine if we could use vector processing on something other than just floating point problems.  Today, GPUs and CPUs work tirelessly to accelerate algorithms Read more…

By James Reinders

Delays, Smoke, Records & Markets – A Candid Conversation with Cray CEO Peter Ungaro

October 5, 2017

Earlier this month, Tom Tabor, publisher of HPCwire and I had a very personal conversation with Cray CEO Peter Ungaro. Cray has been on something of a Cinderell Read more…

By Tiffany Trader & Tom Tabor

Cray Moves to Acquire the Seagate ClusterStor Line

July 28, 2017

This week Cray announced that it is picking up Seagate's ClusterStor HPC storage array business for an undisclosed sum. "In short we're effectively transitioning the bulk of the ClusterStor product line to Cray," said CEO Peter Ungaro. Read more…

By Tiffany Trader

Intel Launches Software Tools to Ease FPGA Programming

September 5, 2017

Field Programmable Gate Arrays (FPGAs) have a reputation for being difficult to program, requiring expertise in specialty languages, like Verilog or VHDL. Easin Read more…

By Tiffany Trader

HPC Chips – A Veritable Smorgasbord?

October 10, 2017

For the first time since AMD's ill-fated launch of Bulldozer the answer to the question, 'Which CPU will be in my next HPC system?' doesn't have to be 'Whichever variety of Intel Xeon E5 they are selling when we procure'. Read more…

By Dairsie Latimer

IBM Advances Web-based Quantum Programming

September 5, 2017

IBM Research is pairing its Jupyter-based Data Science Experience notebook environment with its cloud-based quantum computer, IBM Q, in hopes of encouraging a new class of entrepreneurial user to solve intractable problems that even exceed the capabilities of the best AI systems. Read more…

By Alex Woodie

AMD Showcases Growing Portfolio of EPYC and Radeon-based Systems at SC17

November 13, 2017

AMD’s charge back into HPC and the datacenter is on full display at SC17. Having launched the EPYC processor line in June along with its MI25 GPU the focus he Read more…

By John Russell

How ‘Knights Mill’ Gets Its Deep Learning Flops

June 22, 2017

Intel, the subject of much speculation regarding the delayed, rewritten or potentially canceled “Aurora” contract (the Argonne Lab part of the CORAL “ Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Share This