Why Iterative Innovation is the Only Path to Exascale

By Nicole Hemsoth

April 14, 2014

If we’re out of “magic bullets” that can shoot across supercomputing space, shattering assumptions about how high performance computing operates efficiently at massive scale, we’re left with one option…refine and tweak that which exists, while pushing as much funding as possible toward the blue sky above with the hopes that another disruptive technology will emerge.

Few others have the insight about this paradigm that Buddy Bland possesses. As director of Oak Ridge National Lab’s Leadership Computing Facility and former lead on a number of large-scale system projects, Bland has developed a keen sense of what is required of the supercomputers of the future—including those that will be part of the CORAL triad of pre-exascale systems. He has seen the so-called magic bullets pop off in the past, which yielded big gains in performance and power (outfitting with Jaguar with GPUs for the Titan refresh, for instance). But from what he sees from on high at this point, the exascale vision needs a long string of constant, cumulative tweaks in the absence of some looming “great distruptor” for HPC.

The CORAL program is a collaborative effort between Oak Ridge, Argonne and Lawrence Livermore labs, which will deliver pre-exascale class computing for Department of Energy and National Nuclear Security Administration needs by the 2017-2018 timeframe. There will be more information about the planned capability, vendor, and architecture within the next month when details are formally released. The decisions around the third site, which will be at Oak Ridge National Lab, have been keeping Bland busy. His team at Oak Ridge is nearing the signing of a contract and expect to be able to share more about the anticipated system by the end of this year.

In addition to navigating vendor capability to deliver the capacity and power requirements of the various vendors who submitted their offerings, Bland has had to look back at a number of successful systems to see why they were solid resources—and why certain approaches to efficient, high performance computing fail to deliver. From Titan, Sequoia, and Mira—and the many systems before these, Bland says he’s seen enough to understand that making exascale computing practical requires some serious investment in two key areas–reliability and power. This is not a surprise in itself, but the way Bland ties this into some finer points around the needs for more robust hardware and software that can automatically adjust to added complexity is worth sharing.

“Over the years as these machines have grown larger, the complexity of keeping them up and running and usable to use on a single application over a long period of time has become more of a problem,” said Bland. “We see nodes fail every couple of days,” he told us. “We expect that with the CORAL machines, since there will be even more parts, there will even more failures, so we’re working with the vendors to help us with that and we’re also looking to software that can help us get around those failures. We need to find ways to help applications stay up and run for even longer periods of time without failing.”

As it stands, the process of recovering from node failure at a large supercomputing site hasn’t developed much over the years. A good bit of is a manual, and all of it contributes to expense for both the center and the people who help get the application ship back on course. For even a basic cluster, node failure is a problem—but when the average job running on Titan is taking up around 60,000 cores at minimum, the value of having a way to mitigate downtime is essential. Aside from those direct costs, scientists simply want their results, not the burden of smacking around new nodes and reviving from a checkpoint (if they were lucky enough to have one).

“What’s really needed is full automation of the recovery process,” says Bland. He explained that these issues around recovery have been addressed already by a number of scheduling packages, but none of them have managed to mesh together what’s needed into a comprehensive package that allows touch-free recovery.

As an interesting side note, this capability to auto-roll after a rock hits the works is something that the largest datacenters in the world have built into their operations (think more in terms of Google, Facebook and the like rather than large scientific computing hubs) but for HPC sites, this remains a big challenge for the hardware vendors and those making schedulers as well. Ah, but that’s a different world, right? Certainly there could be no relevance to U.S. government lab supercomputing centers….Ahem. So, moving on…

If recovery and power are two of the major issues that HPC centers need to address in this era of pre-exascale systems, there seems to be burgeoning answer that speaks to both matters. Cut down on the movement of data by moving as much as possible onto the same chip. This not only wicks away the big energy drain, which is that very movement, but it also means fewer components, thus a lessened chance of failing parts. Bland says that the model, which has played out in the Blue Gene systems, has proven itself to some degree. However, despite any success there the future of that line of IBM machines for supercomputing is in question—but that’s another article.

Bland points to other innovations that have improved power consumption specifically, which is through the addition of GPUs. He said despite a 10-fold improvement in computing power, upgrading Titan with new processors and GPUs from its plain vanilla CPU-only Jaguar roots, the system consumed quite a bit less power (going from around 7 megawatts to 5). This was a remarkable improvement, he said, but it was just a one-time innovation. “You can’t tackle all these problems around reliability and power without looking at every single one of the things that consumes power or leads to failure. We had GPUs and that helped, but that’s not enough. There must be more innovation for all layers the stack.”

Innovations in areas that don’t get quite as much attention are all going to be the small developments that add to more efficient exascale computing. There is no one solution—no magic bullet, says Bland. He pointed to the example of power supplies as representative of the “little things” that can be worked on in the near term. “Right now we have power supplies that are around 92 percent efficient in converting AC to DC. That’s 8% we’re leaving on the floor—we need to find one that’s 99% efficient. It’s these pieces, these small details in how we’re spending small amounts of energy that are really going to make the difference.”

There are a few other considerations that found their way from experience into the RFP process for the new CORAL systems, including the choices among particular architectures. What’s most surprising, says Bland, is how little these architectural considerations matter against the sheer process of exposing parallelism in the codes set to run on the future fastest systems. “You can’t just throw a code to the compiler—you as a human actually have to go in and expose that parallelism then let the compilers handle the architectural details.” He says that it’s a matter of writing applications in a way that can bring massive parallel capability to light versus expecting the architectural decisions to unfold in a way that automatically yields ultra high performance.

Compute, code and energy issues aren’t the only problems Bland’s team is thinking about for the next generation of large-scale systems. For instance, there’s also a broader concern around I/O that will become far more pressing going forward. It’s hard enough to think about building archives in the current generation of supercomputers, and just as difficult to get enough bandwidth since most of what centers are using is built for capacity. Here’s where another “small innovation” that can yield larger gains via burst buffers. We’ve written about these before—there’s a cache layer in front of the archive that allows “bursting” of the data into slower devices, which is great for this type of traffic. As Bland said, this smaller but important innovation is good as a “stop gap” for now, but more work is needed to handle streaming traffic at high bandwidth, which will be an even bigger problem as system sizes and the data they generate grows.

Codes aside, he said, at the end of the day, the one thing that will determine the feasibility of exascale computing will be power. And while there is promise in the research being done in the FastForward and DesignForward programs, it’s about refining. He said he’s not expecting that there will be one major disruptive technology that will turn supercomputing on its head in the near term—it’s about innovations across the board that will be small individually but will contribute to a much richer set of capabilities that centers can actually afford to host.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Data Vortex Users Contemplate the Future of Supercomputing

October 19, 2017

Last month (Sept. 11-12), HPC networking company Data Vortex held its inaugural users group at Pacific Northwest National Laboratory (PNNL) bringing together about 30 participants from industry, government and academia t Read more…

By Tiffany Trader

AI Self-Training Goes Forward at Google DeepMind

October 19, 2017

DeepMind, Google’s AI research organization, announced today in a blog that AlphaGo Zero, the latest evolution of AlphaGo (the first computer program to defeat a Go world champion) trained itself within three days to play Go at a superhuman level (i.e., better than any human) – and to beat the old version of AlphaGo – without leveraging human expertise, data or training. Read more…

By Doug Black

Researchers Scale COSMO Climate Code to 4888 GPUs on Piz Daint

October 17, 2017

Effective global climate simulation, sorely needed to anticipate and cope with global warming, has long been computationally challenging. Two of the major obstacles are the needed resolution and prolonged time to compute Read more…

By John Russell

HPE Extreme Performance Solutions

Transforming Genomic Analytics with HPC-Accelerated Insights

Advancements in the field of genomics are revolutionizing our understanding of human biology, rapidly accelerating the discovery and treatment of genetic diseases, and dramatically improving human health. Read more…

Student Cluster Competition Coverage New Home

October 16, 2017

Hello computer sports fans! This is the first of many (many!) articles covering the world-wide phenomenon of Student Cluster Competitions. Finally, the Student Cluster Competition coverage has come to its natural home: H Read more…

By Dan Olds

Data Vortex Users Contemplate the Future of Supercomputing

October 19, 2017

Last month (Sept. 11-12), HPC networking company Data Vortex held its inaugural users group at Pacific Northwest National Laboratory (PNNL) bringing together ab Read more…

By Tiffany Trader

AI Self-Training Goes Forward at Google DeepMind

October 19, 2017

DeepMind, Google’s AI research organization, announced today in a blog that AlphaGo Zero, the latest evolution of AlphaGo (the first computer program to defeat a Go world champion) trained itself within three days to play Go at a superhuman level (i.e., better than any human) – and to beat the old version of AlphaGo – without leveraging human expertise, data or training. Read more…

By Doug Black

Student Cluster Competition Coverage New Home

October 16, 2017

Hello computer sports fans! This is the first of many (many!) articles covering the world-wide phenomenon of Student Cluster Competitions. Finally, the Student Read more…

By Dan Olds

Intel Delivers 17-Qubit Quantum Chip to European Research Partner

October 10, 2017

On Tuesday, Intel delivered a 17-qubit superconducting test chip to research partner QuTech, the quantum research institute of Delft University of Technology (TU Delft) in the Netherlands. The announcement marks a major milestone in the 10-year, $50-million collaborative relationship with TU Delft and TNO, the Dutch Organization for Applied Research, to accelerate advancements in quantum computing. Read more…

By Tiffany Trader

Fujitsu Tapped to Build 37-Petaflops ABCI System for AIST

October 10, 2017

Fujitsu announced today it will build the long-planned AI Bridging Cloud Infrastructure (ABCI) which is set to become the fastest supercomputer system in Japan Read more…

By John Russell

HPC Chips – A Veritable Smorgasbord?

October 10, 2017

For the first time since AMD's ill-fated launch of Bulldozer the answer to the question, 'Which CPU will be in my next HPC system?' doesn't have to be 'Whichever variety of Intel Xeon E5 they are selling when we procure'. Read more…

By Dairsie Latimer

Delays, Smoke, Records & Markets – A Candid Conversation with Cray CEO Peter Ungaro

October 5, 2017

Earlier this month, Tom Tabor, publisher of HPCwire and I had a very personal conversation with Cray CEO Peter Ungaro. Cray has been on something of a Cinderell Read more…

By Tiffany Trader & Tom Tabor

Intel Debuts Programmable Acceleration Card

October 5, 2017

With a view toward supporting complex, data-intensive applications, such as AI inference, video streaming analytics, database acceleration and genomics, Intel i Read more…

By Doug Black

Reinders: “AVX-512 May Be a Hidden Gem” in Intel Xeon Scalable Processors

June 29, 2017

Imagine if we could use vector processing on something other than just floating point problems.  Today, GPUs and CPUs work tirelessly to accelerate algorithms Read more…

By James Reinders

NERSC Scales Scientific Deep Learning to 15 Petaflops

August 28, 2017

A collaborative effort between Intel, NERSC and Stanford has delivered the first 15-petaflops deep learning software running on HPC platforms and is, according Read more…

By Rob Farber

Oracle Layoffs Reportedly Hit SPARC and Solaris Hard

September 7, 2017

Oracle’s latest layoffs have many wondering if this is the end of the line for the SPARC processor and Solaris OS development. As reported by multiple sources Read more…

By John Russell

US Coalesces Plans for First Exascale Supercomputer: Aurora in 2021

September 27, 2017

At the Advanced Scientific Computing Advisory Committee (ASCAC) meeting, in Arlington, Va., yesterday (Sept. 26), it was revealed that the "Aurora" supercompute Read more…

By Tiffany Trader

How ‘Knights Mill’ Gets Its Deep Learning Flops

June 22, 2017

Intel, the subject of much speculation regarding the delayed, rewritten or potentially canceled “Aurora” contract (the Argonne Lab part of the CORAL “ Read more…

By Tiffany Trader

Google Releases Deeplearn.js to Further Democratize Machine Learning

August 17, 2017

Spreading the use of machine learning tools is one of the goals of Google’s PAIR (People + AI Research) initiative, which was introduced in early July. Last w Read more…

By John Russell

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

GlobalFoundries Puts Wind in AMD’s Sails with 12nm FinFET

September 24, 2017

From its annual tech conference last week (Sept. 20), where GlobalFoundries welcomed more than 600 semiconductor professionals (reaching the Santa Clara venue Read more…

By Tiffany Trader

Leading Solution Providers

Graphcore Readies Launch of 16nm Colossus-IPU Chip

July 20, 2017

A second $30 million funding round for U.K. AI chip developer Graphcore sets up the company to go to market with its “intelligent processing unit” (IPU) in Read more…

By Tiffany Trader

Amazon Debuts New AMD-based GPU Instances for Graphics Acceleration

September 12, 2017

Last week Amazon Web Services (AWS) streaming service, AppStream 2.0, introduced a new GPU instance called Graphics Design intended to accelerate graphics. The Read more…

By John Russell

EU Funds 20 Million Euro ARM+FPGA Exascale Project

September 7, 2017

At the Barcelona Supercomputer Centre on Wednesday (Sept. 6), 16 partners gathered to launch the EuroEXA project, which invests €20 million over three-and-a-half years into exascale-focused research and development. Led by the Horizon 2020 program, EuroEXA picks up the banner of a triad of partner projects — ExaNeSt, EcoScale and ExaNoDe — building on their work... Read more…

By Tiffany Trader

Delays, Smoke, Records & Markets – A Candid Conversation with Cray CEO Peter Ungaro

October 5, 2017

Earlier this month, Tom Tabor, publisher of HPCwire and I had a very personal conversation with Cray CEO Peter Ungaro. Cray has been on something of a Cinderell Read more…

By Tiffany Trader & Tom Tabor

Cray Moves to Acquire the Seagate ClusterStor Line

July 28, 2017

This week Cray announced that it is picking up Seagate's ClusterStor HPC storage array business for an undisclosed sum. "In short we're effectively transitioning the bulk of the ClusterStor product line to Cray," said CEO Peter Ungaro. Read more…

By Tiffany Trader

Intel Launches Software Tools to Ease FPGA Programming

September 5, 2017

Field Programmable Gate Arrays (FPGAs) have a reputation for being difficult to program, requiring expertise in specialty languages, like Verilog or VHDL. Easin Read more…

By Tiffany Trader

IBM Advances Web-based Quantum Programming

September 5, 2017

IBM Research is pairing its Jupyter-based Data Science Experience notebook environment with its cloud-based quantum computer, IBM Q, in hopes of encouraging a new class of entrepreneurial user to solve intractable problems that even exceed the capabilities of the best AI systems. Read more…

By Alex Woodie

HPC Chips – A Veritable Smorgasbord?

October 10, 2017

For the first time since AMD's ill-fated launch of Bulldozer the answer to the question, 'Which CPU will be in my next HPC system?' doesn't have to be 'Whichever variety of Intel Xeon E5 they are selling when we procure'. Read more…

By Dairsie Latimer

  • arrow
  • Click Here for More Headlines
  • arrow
Share This