Doug Kothe on the Race to Build Exascale Applications

By John Russell

May 29, 2017

Ensuring there are applications ready to churn out useful science when the first U.S. exascale computers arrive in the 2021-2023 timeframe is Doug Kothe’s job. No pressure. He’s not alone, of course. The U.S. Exascale Computing Project (ECP) is a complicated effort with many interrelated parts and contributors, all necessary for success. Yet Kothe’s job as director of application development is one of the more visible and daunting and perhaps best described by his boss, Paul Messina, ECP director.

“We think of 50 times [current] performance on applications [as the exascale measure of merit], unfortunately there’s a kink in this,” said Messina. “The kink is people won’t be running today’s jobs in these exascale systems. We want exascale systems to do things we can’t do today and we need to figure out a way to quantify that. In some cases it will be relatively easy – just achieving much greater resolutions – but in many cases it will be enabling additional physics to more faithfully represent the phenomena. We want to focus on measuring every capable exascale system based on full applications tackling real problems compared to what they can do today.”

Doug Kothe, ECP

In this wide-ranging discussion with HPCwire, Kothe touches on ECP application development goals and processes; several technical issues such as efforts to combine data analytics with mod/sim and the need for expanded software frameworks to accommodate exascale applications; and early thoughts for incorporating neuromorphic and quantum computing not currently part of the formal ECP plan. Interestingly, his biggest worry isn’t reaching the goal on schedule – he believes the application teams will get there – but post-ECP staff retention when industry comes calling.

By way of review, ECP is a collaborative effort of two Department of Energy organizations—the Office of Science and the National Nuclear Security Administration. Six applications areas have been singled out: national security; energy security, economic security, scientific discovery; earth science; and health care. In terms of app-dev, that’s translated into 21 Science & Energy application projects, 3 NNSA application projects, and 1 DOE / NIH application project (precision medicine for cancer).

It’s not yet clear what the just released FY2018 U.S. Budget proposed by the Trump Administration portends. Funding for science programs were cut nearly across the board although ECP escaped. Kothe says simply, “It is the beginning of the process for the FY18 budget, and while the overall budget is determined, we will continue working on the applications that are already part of the ECP.”

In keeping with ECP’s broad ambitions, Kothe says, “All of our applications teams are focused on very specific challenge problems and by our definition a challenge problem is one that is intractable today, needs exascale resources, and is a strategic high priority for one of the DOE program offices. We aren’t claiming we are going to solve all the problems but we are claiming is simulation technology that can address the problem. The point is we have the applications vectored in rather specific directions.” (Summary list below, click to enlarge)

 

RISE OF DATA ANALYTICS
One of the more exciting and new-to-HPC areas is incorporation of data analytics into the HPC environment overall and ECP in particular. Indeed, harmonizing or at least integrating the big data and modelling and simulation is a goal specified by the National Strategic Computing Initiative. Data-driven science isn’t new nor is researcher familiarity with underlying statistics. But the sudden rise machine/deep learning techniques and including many that rely on lower precision calculations is somewhat new to the scientific computing community and an area where the commercial world has perhaps taken the lead. Kothe labels the topic “white hot”.

“Not being trained in the data analytics area I’ve been doing a lot of reading and talking [to others]. A large fraction of the area I feel like I know, but I didn’t appreciate the other 20 or 30 percent. The point is by exposing our applications teams to the data analytics community, even just calling libraries, we are going to see some interesting in situ and computational steering use cases. As an example of in situ, think of turbulence. It could be an LES (large eddy simulation) whose parameters could have been tuned a priori by machine learning or chosen on the fly by machine learning. That kind of work is already going on at some universities,” Kothe says.

Climate modeling is a case point. “A big challenge is subgrid models for clouds. Right now and even at exascale we probably cannot do one km or less resolution everywhere. We may be able to do regional coupled simulations that way, but if we try to do five or ten kilometers everywhere – of course it will vary whether over ocean or land ice, sea ice, or atmosphere – you will still have many clouds lost in one cell. You need a subgrid model. Maybe machine learning could be used to select the parameters. Think of a bunch of little LES models running in a 10km x10km cell holding lots of clouds that are then scaled into the higher level physics. I think subgrid models are potentially a poster child for machine learning.”

Steering simulations is another emerging use case. “There’s a couple of labs, Lawrence Livermore in particular, that are already using machine learning to make decisions, to automate decisions about mesh quality for fluid and structure simulations where the mesh is just flowing with the moving material and the mesh may start to contort in a way that will cause the numerical solution to break down or errors to increase. You could do quality checks on the fly and correct the mesh with machine learning.”

One interesting use is being explored as part of the Exascale CANcer Distributed Learning Environment (CANDLE) project (see HPCwire article, Enlisting Deep Learning in the War on Cancer). Part of the project is clarifying the RAS (gene) network activity. The RAS network is implicated very many cancers. “You have machine learning orchestrating ensembles of molecular dynamics simulations [looking at docking scenarios with the RAS protein] and examining factors that are involved in docking,” says Kothe. Machine learning can recognize already known areas and reduce need for computationally intensive simulation in those areas while zeroing in on lesser known areas for intense quantum chemistry simulations. Think of it as zooming in and out as needed.

 

FRAMEWORKS REVISITED
Clearly there’s no shortage of challenges for ECP application development. Kothe cites optimizing node performance and memory management among the especially thorny ones, “We’ve now have many levels of memory exposed to us. We don’t really quite know how best to use it.” Data structure choices can also be problematic and Kothe suggests frameworks may undergo a revival,

One of the application teams (astrophysics), recalls Kothe, came to him and said, “I am afraid to make a choice for a data structure that would be pervasive in my whole code because it might be the wrong one and I’m stuck with it.'” The point is I think what we are seeing with the applications a kind of ‘going back to the future’ in late 80s when you saw lots of heavyweight frameworks where an application would call out to a black box and say register this array for me and hand me back the pointer.

“That’s good and it’s bad. The bad part is you’re losing control and now you have to schlep around this black box and you don’t know if it is going to do what you want it to do. The good part is if you are on a KNL system or an NVIDIA system, you are on different nodes, and that block box memory manager would have been tuned for that hardware. [In] dealing with memory hierarchy risks, I think we are probably seeing applications move more towards frameworks which I find think is a good idea. We’ve learned kind of what I call the big F or little f frameworks. I think we’re learning how to balance the two so applications can be portable and not have to rely on an army of people but still do something that’s more agile than just choose one data structure and hope it works.”

Performance portability is naturally a major consideration. Historically, says Kothe, application developers and he includes himself in the category, “We chose portability over performance because we want to make sure our science can be done anywhere. Performance can’t be an afterthought but it often is. Portability in my mind has several dimensions. So the new system shows up and it is probably not something out of left field, you know something about it, but what’s a reasonable amount of effort that you think should be required to port your code? How much of the code base do you think should change? What is correctness in terms of the problem and getting the answer.

“I would claim that a 64-bit comparison is probably not realistic. I mean it’s probably not even appropriate. What set of problems would you run? You need to run real problems. We’re asking each app team to define what they think portability means and hope that collectively we’ll move towards a good definition and a good target for all the apps but I think it will end up being fairly app specific.”

THE CO-DESIGN IMPERATIVE
The necessity of co-design has become a given throughout HPC as well as with the ECP. Advancing hardware and new systems architectures must be taken into account not merely to push application performance but to get them to run at all. However coupling software too tightly to a specific machine or architecture is limiting. Currently ECP has established six co-design centers to help deal with specific challenges. Kothe believes use of motifs may help.

“Every application team at some level will be doing some vertically integrated co-design and there is probably more software co-design going on – the interplay with the compilers and runtime systems and that kind of thing – than anything else. By having the co-design centers identify a small number of motifs that applications are using, I think we can leverage a deep dive co-design on the motifs as opposed to doing kind of an extensive co-design vertically integrated within every application. This is new and there are some risks. But long term, my dream would be we [develop] community libraries that are co-designed around motifs that are used broadly among the applications.

“The poster child is probably [handling] particles. Almost every application has a discrete particle model for something and that’s good and it’s a challenge. So how do you encapsulate the particle [model] in a way that it can be co-designed not as a separate activity that’s not thinking about the [specific] consumer of that motif, but just thinking about making that motif rock and roll. That’s the challenge, to co-design motifs so they can be broadly used and I have high hopes there.”

 

 

STAY ON TARGET
“A big challenge with application developers, is everything sounds cool and looks good, so we want to keep them focused. Year by year the applications have laid out a number of milestones and for the most parts the milestones are step by step progression towards that challenge program. The progression has many dimensions: is the science capability improving, better physics, better algorithms; is the team utilizing the hardware efficiently [such as] state of the art test beds, the latest systems on the floor; are they integrating software technologies and probably one of the most important is they are using co-design efforts,” says Kothe

One ECP-wide tool is a comprehensive project database where “all the R&D projects and applications and software technology, all their plans and milestones are in one place.” A key aspect of ECP, says Kothe, is that everyone can see what everyone else is doing and how they are progressing.

Think of a milestone as a handful of things, says Kothe, that are generally tangible such as software release or a demonstration simulation. “It could be a report or a presentation. It can even be a small write up that says I tried this algorithm and it didn’t work. A milestone is a decision point.

“It’s not always a huge success. Failure can be just as valuable. Sometimes we can force a sense of urgency. We can review this seven-year plan and say, alright you can’t bring in a technology that doesn’t have a line of sight in this timeframe, or you’ve got algorithm A and B going along [and] at this point you have make a decision and choose one and go with it. I like that. I think it imparts a sense of urgency,” Kothe.

Kothe, of course, has his own milestones. One is an annual application assessment report due every September.

“I am hearing I am a slave driver and I didn’t really think had that personality,” says Kothe. One area where he is inflexible is on scheduled releases. “We want you to release on the scheduled date, that date is gospel. What’s in the release may float. So the team and budget, we like to be pretty rigid, but what’s in the release floats based on what you have learned. You have this bag of tasks and try to get as many tasks done as you can but you still must have the release.”

Currently, the comprehensive database of projects isn’t publicly available (would be interesting reading) but Kothe says individual PIs are encouraged to share information widely.

SOFTWARE TECHNOLOGY SHARING
Not surprisingly, close collaboration with the software technology team is emphasized. “Right now what we have this incredible opportunity because applications teams are exposed to a lot of software technologies they’ve never seen or heard of.” It’s a bit like kids in a candy store says Kothe, “They are looking at this technology and saying I want to do that, to do that, to do that, and so the challenge for integration is on managing the interfaces and doing it in a scalable way.”

There a couple of technology projects that everyone wants to integrate, he says, and that’s big bandwidth worry when you have 20-plus application projects lined up saying “let me try your stuff because chances are there will be new APIs and new functionalities and bugs and features [too]. The software technology people are saying, ‘Doug be careful. let’s come up with a scalable process.’” Conversely, says Kothe, it is also true there’s a fair amount of great “software technology the application teams are not exploring which they should be.”

“We have defined a number of integration milestones which are basically milestones that require deliverables from two or three areas. We call that shared fate. [I know] it sounds like we are jumping off a cliff together. A good example is an application project looks at a linear solver and says ‘you don’t have the functionality I need, lets negotiate requirements.’ So the solver negotiates a new API, a new functionality, and the application team will have a milestone that says it will have integrated and tested and the new technology [by a given date] and the software technology team has to have its release say two or three months before. These things tend to be daisy chained like that. You have a release, then an integration assessment, and we might have another release to basically deal with any issues.

“Right now, early on in ECP, we’re having a lot of point-to-point interaction where there’s lots of aps that want to do lots of same or different things with lots of software projects. I think once we settle down on the requirements the software technologies will be kind of one to all [having] settled on a base functionality and a base API. An obvious example is MPI but even with MPI there’s new features and functionalities that certain aspects. We can’t take it for granted that some of these tremendous technologies like MPI are going to be there working the way we need for exascale,” says Kothe.

 

ECP FUTURE WATCH
Even as ECP pushes forward it remains rooted in CMOS technology yet there are several newer technologies – not least neuromorphic and quantum computing – which have made great strides recently and seem on the cusp of practical application.

“One of the things I have been thinking about is even if we don’t have access to a neuromorphic chip what is its behavior like from a hardware simulator point of view. The same thing with quantum computing. Our mindset has to change with regards to the algorithms we lay out for neuromorphic or quantum. The applications teams need to start thinking about different types of algorithms. As Paul [Messina] has pointed out it’s possible quantum computing could fairly soon become an accelerator on traditional node. Making sure applications are compartmentalized is important to make that possible. It would allow us to be more flexible and extensible and perhaps exploit something like a quantum accelerator.”

Looking ahead, says Kothe, he worries most about the unknown unknowns – there will be surprises. “I feel like right now in apps space we kind of have known unknowns and we’ll hit some unknown unknowns, but I believe we are going to have a number of applications ready to go. We’ll have trips along the way and we may not do some things we plan now. I think we have an aggressive but not naive set of metrics. It’s really the people. We have some unbelievable people,” he says.

One can understand today’s attraction. Kothe points out this is likely to be a once-in-a-career opportunity and the mix of experience among the application team members significant. “What we see is millennials sitting at the table showing people new ways of doing software with gray-haired guys like me who have been to the school of hard knocks. There’s a tremendous cross fertilization. I’m confident. I saw it when we selected these teams. We had teams with rosters that looked like the all star team, but I am worried about retention. We are training people to be some of the best, especially the early career folks, so I am worried that they will be in high demand, very marketable.”

Kothe Bio from ECP website:
Douglas B. Kothe (Doug) has over three decades of experience in conducting and leading applied R&D in computational applications designed to simulate complex physical phenomena in the energy, defense, and manufacturing sectors. Kothe is currently the Deputy Associate Laboratory Director of the Computing and Computational Sciences Directorate (CCSD) at Oak Ridge National Laboratory (ORNL). Prior positions for Kothe at ORNL, where he has been since 2006, were Director of the Consortium for Advanced Simulation of Light Water Reactors, DOE’s first Energy Innovation Hub (2010-2015), and Director of Science at the National Center for Computational Sciences (2006-2010).

Feature Caption:
The Transforming Additive Manufacturing through Exascale Simulation project (ExaAM) is building a new multi-physics modeling and simulation platform for 3D printing of metals to provide an up-front assessment of the manufacturability and performance of additively manufactured parts. Pictured: simulation of laser melting of metal powder in a 3D printing process (LLNL) and a fully functional lightweight robotic hand (ORNL).

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

TACC Researchers Test AI Traffic Monitoring Tool in Austin

December 13, 2017

Traffic jams and mishaps are often painful and sometimes dangerous facts of life. At this week’s IEEE International Conference on Big Data being held in Boston, researchers from TACC and colleagues will present a new Read more…

AMD Wins Another: Baidu to Deploy EPYC on Single Socket Servers

December 13, 2017

When AMD introduced its EPYC chip line in June, the company said a portion of the line was specifically designed to re-invigorate a single socket segment in what has become an overwhelmingly two-socket landscape in the d Read more…

By John Russell

Microsoft Wants to Speed Quantum Development

December 12, 2017

Quantum computing continues to make headlines in what remains of 2017 as tech giants jockey to establish a pole position in the race toward commercialization of quantum. This week, Microsoft took the next step in advanci Read more…

By Tiffany Trader

HPE Extreme Performance Solutions

Explore the Origins of Space with COSMOS and Memory-Driven Computing

From the formation of black holes to the origins of space, data is the key to unlocking the secrets of the early universe. Read more…

ESnet Now Moving More Than 1 Petabyte/wk

December 12, 2017

Optimizing ESnet (Energy Sciences Network), the world's fastest network for science, is an ongoing process. Recently a two-year collaboration by ESnet users – the Petascale DTN Project – achieved its ambitious goal t Read more…

AMD Wins Another: Baidu to Deploy EPYC on Single Socket Servers

December 13, 2017

When AMD introduced its EPYC chip line in June, the company said a portion of the line was specifically designed to re-invigorate a single socket segment in wha Read more…

By John Russell

Microsoft Wants to Speed Quantum Development

December 12, 2017

Quantum computing continues to make headlines in what remains of 2017 as tech giants jockey to establish a pole position in the race toward commercialization of Read more…

By Tiffany Trader

HPC Iron, Soft, Data, People – It Takes an Ecosystem!

December 11, 2017

Cutting edge advanced computing hardware (aka big iron) does not stand by itself. These computers are the pinnacle of a myriad of technologies that must be care Read more…

By Alex R. Larzelere

IBM Begins Power9 Rollout with Backing from DOE, Google

December 6, 2017

After over a year of buildup, IBM is unveiling its first Power9 system based on the same architecture as the Department of Energy CORAL supercomputers, Summit a Read more…

By Tiffany Trader

Microsoft Spins Cycle Computing into Core Azure Product

December 5, 2017

Last August, cloud giant Microsoft acquired HPC cloud orchestration pioneer Cycle Computing. Since then the focus has been on integrating Cycle’s organization Read more…

By John Russell

GlobalFoundries, Ayar Labs Team Up to Commercialize Optical I/O

December 4, 2017

GlobalFoundries (GF) and Ayar Labs, a startup focused on using light, instead of electricity, to transfer data between chips, today announced they've entered in Read more…

By Tiffany Trader

HPE In-Memory Platform Comes to COSMOS

November 30, 2017

Hewlett Packard Enterprise is on a mission to accelerate space research. In August, it sent the first commercial-off-the-shelf HPC system into space for testing Read more…

By Tiffany Trader

SC17 Cluster Competition: Who Won and Why? Results Analyzed and Over-Analyzed

November 28, 2017

Everyone by now knows that Nanyang Technological University of Singapore (NTU) took home the highest LINPACK Award and the Overall Championship from the recently concluded SC17 Student Cluster Competition. We also already know how the teams did in the Highest LINPACK and Highest HPCG competitions, with Nanyang grabbing bragging rights for both benchmarks. Read more…

By Dan Olds

US Coalesces Plans for First Exascale Supercomputer: Aurora in 2021

September 27, 2017

At the Advanced Scientific Computing Advisory Committee (ASCAC) meeting, in Arlington, Va., yesterday (Sept. 26), it was revealed that the "Aurora" supercompute Read more…

By Tiffany Trader

NERSC Scales Scientific Deep Learning to 15 Petaflops

August 28, 2017

A collaborative effort between Intel, NERSC and Stanford has delivered the first 15-petaflops deep learning software running on HPC platforms and is, according Read more…

By Rob Farber

Oracle Layoffs Reportedly Hit SPARC and Solaris Hard

September 7, 2017

Oracle’s latest layoffs have many wondering if this is the end of the line for the SPARC processor and Solaris OS development. As reported by multiple sources Read more…

By John Russell

AMD Showcases Growing Portfolio of EPYC and Radeon-based Systems at SC17

November 13, 2017

AMD’s charge back into HPC and the datacenter is on full display at SC17. Having launched the EPYC processor line in June along with its MI25 GPU the focus he Read more…

By John Russell

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

Japan Unveils Quantum Neural Network

November 22, 2017

The U.S. and China are leading the race toward productive quantum computing, but it's early enough that ultimate leadership is still something of an open questi Read more…

By Tiffany Trader

GlobalFoundries Puts Wind in AMD’s Sails with 12nm FinFET

September 24, 2017

From its annual tech conference last week (Sept. 20), where GlobalFoundries welcomed more than 600 semiconductor professionals (reaching the Santa Clara venue Read more…

By Tiffany Trader

Google Releases Deeplearn.js to Further Democratize Machine Learning

August 17, 2017

Spreading the use of machine learning tools is one of the goals of Google’s PAIR (People + AI Research) initiative, which was introduced in early July. Last w Read more…

By John Russell

Leading Solution Providers

Amazon Debuts New AMD-based GPU Instances for Graphics Acceleration

September 12, 2017

Last week Amazon Web Services (AWS) streaming service, AppStream 2.0, introduced a new GPU instance called Graphics Design intended to accelerate graphics. The Read more…

By John Russell

Perspective: What Really Happened at SC17?

November 22, 2017

SC is over. Now comes the myriad of follow-ups. Inboxes are filled with templated emails from vendors and other exhibitors hoping to win a place in the post-SC thinking of booth visitors. Attendees of tutorials, workshops and other technical sessions will be inundated with requests for feedback. Read more…

By Andrew Jones

EU Funds 20 Million Euro ARM+FPGA Exascale Project

September 7, 2017

At the Barcelona Supercomputer Centre on Wednesday (Sept. 6), 16 partners gathered to launch the EuroEXA project, which invests €20 million over three-and-a-half years into exascale-focused research and development. Led by the Horizon 2020 program, EuroEXA picks up the banner of a triad of partner projects — ExaNeSt, EcoScale and ExaNoDe — building on their work... Read more…

By Tiffany Trader

Delays, Smoke, Records & Markets – A Candid Conversation with Cray CEO Peter Ungaro

October 5, 2017

Earlier this month, Tom Tabor, publisher of HPCwire and I had a very personal conversation with Cray CEO Peter Ungaro. Cray has been on something of a Cinderell Read more…

By Tiffany Trader & Tom Tabor

IBM Begins Power9 Rollout with Backing from DOE, Google

December 6, 2017

After over a year of buildup, IBM is unveiling its first Power9 system based on the same architecture as the Department of Energy CORAL supercomputers, Summit a Read more…

By Tiffany Trader

Tensors Come of Age: Why the AI Revolution Will Help HPC

November 13, 2017

Thirty years ago, parallel computing was coming of age. A bitter battle began between stalwart vector computing supporters and advocates of various approaches to parallel computing. IBM skeptic Alan Karp, reacting to announcements of nCUBE’s 1024-microprocessor system and Thinking Machines’ 65,536-element array, made a public $100 wager that no one could get a parallel speedup of over 200 on real HPC workloads. Read more…

By John Gustafson & Lenore Mullin

Flipping the Flops and Reading the Top500 Tea Leaves

November 13, 2017

The 50th edition of the Top500 list, the biannual publication of the world’s fastest supercomputers based on public Linpack benchmarking results, was released Read more…

By Tiffany Trader

Intel Launches Software Tools to Ease FPGA Programming

September 5, 2017

Field Programmable Gate Arrays (FPGAs) have a reputation for being difficult to program, requiring expertise in specialty languages, like Verilog or VHDL. Easin Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Share This