Doug Kothe on the Race to Build Exascale Applications

By John Russell

May 29, 2017

Ensuring there are applications ready to churn out useful science when the first U.S. exascale computers arrive in the 2021-2023 timeframe is Doug Kothe’s job. No pressure. He’s not alone, of course. The U.S. Exascale Computing Project (ECP) is a complicated effort with many interrelated parts and contributors, all necessary for success. Yet Kothe’s job as director of application development is one of the more visible and daunting and perhaps best described by his boss, Paul Messina, ECP director.

“We think of 50 times [current] performance on applications [as the exascale measure of merit], unfortunately there’s a kink in this,” said Messina. “The kink is people won’t be running today’s jobs in these exascale systems. We want exascale systems to do things we can’t do today and we need to figure out a way to quantify that. In some cases it will be relatively easy – just achieving much greater resolutions – but in many cases it will be enabling additional physics to more faithfully represent the phenomena. We want to focus on measuring every capable exascale system based on full applications tackling real problems compared to what they can do today.”

Doug Kothe, ECP

In this wide-ranging discussion with HPCwire, Kothe touches on ECP application development goals and processes; several technical issues such as efforts to combine data analytics with mod/sim and the need for expanded software frameworks to accommodate exascale applications; and early thoughts for incorporating neuromorphic and quantum computing not currently part of the formal ECP plan. Interestingly, his biggest worry isn’t reaching the goal on schedule – he believes the application teams will get there – but post-ECP staff retention when industry comes calling.

By way of review, ECP is a collaborative effort of two Department of Energy organizations—the Office of Science and the National Nuclear Security Administration. Six applications areas have been singled out: national security; energy security, economic security, scientific discovery; earth science; and health care. In terms of app-dev, that’s translated into 21 Science & Energy application projects, 3 NNSA application projects, and 1 DOE / NIH application project (precision medicine for cancer).

It’s not yet clear what the just released FY2018 U.S. Budget proposed by the Trump Administration portends. Funding for science programs were cut nearly across the board although ECP escaped. Kothe says simply, “It is the beginning of the process for the FY18 budget, and while the overall budget is determined, we will continue working on the applications that are already part of the ECP.”

In keeping with ECP’s broad ambitions, Kothe says, “All of our applications teams are focused on very specific challenge problems and by our definition a challenge problem is one that is intractable today, needs exascale resources, and is a strategic high priority for one of the DOE program offices. We aren’t claiming we are going to solve all the problems but we are claiming is simulation technology that can address the problem. The point is we have the applications vectored in rather specific directions.” (Summary list below, click to enlarge)

 

RISE OF DATA ANALYTICS
One of the more exciting and new-to-HPC areas is incorporation of data analytics into the HPC environment overall and ECP in particular. Indeed, harmonizing or at least integrating the big data and modelling and simulation is a goal specified by the National Strategic Computing Initiative. Data-driven science isn’t new nor is researcher familiarity with underlying statistics. But the sudden rise machine/deep learning techniques and including many that rely on lower precision calculations is somewhat new to the scientific computing community and an area where the commercial world has perhaps taken the lead. Kothe labels the topic “white hot”.

“Not being trained in the data analytics area I’ve been doing a lot of reading and talking [to others]. A large fraction of the area I feel like I know, but I didn’t appreciate the other 20 or 30 percent. The point is by exposing our applications teams to the data analytics community, even just calling libraries, we are going to see some interesting in situ and computational steering use cases. As an example of in situ, think of turbulence. It could be an LES (large eddy simulation) whose parameters could have been tuned a priori by machine learning or chosen on the fly by machine learning. That kind of work is already going on at some universities,” Kothe says.

Climate modeling is a case point. “A big challenge is subgrid models for clouds. Right now and even at exascale we probably cannot do one km or less resolution everywhere. We may be able to do regional coupled simulations that way, but if we try to do five or ten kilometers everywhere – of course it will vary whether over ocean or land ice, sea ice, or atmosphere – you will still have many clouds lost in one cell. You need a subgrid model. Maybe machine learning could be used to select the parameters. Think of a bunch of little LES models running in a 10km x10km cell holding lots of clouds that are then scaled into the higher level physics. I think subgrid models are potentially a poster child for machine learning.”

Steering simulations is another emerging use case. “There’s a couple of labs, Lawrence Livermore in particular, that are already using machine learning to make decisions, to automate decisions about mesh quality for fluid and structure simulations where the mesh is just flowing with the moving material and the mesh may start to contort in a way that will cause the numerical solution to break down or errors to increase. You could do quality checks on the fly and correct the mesh with machine learning.”

One interesting use is being explored as part of the Exascale CANcer Distributed Learning Environment (CANDLE) project (see HPCwire article, Enlisting Deep Learning in the War on Cancer). Part of the project is clarifying the RAS (gene) network activity. The RAS network is implicated very many cancers. “You have machine learning orchestrating ensembles of molecular dynamics simulations [looking at docking scenarios with the RAS protein] and examining factors that are involved in docking,” says Kothe. Machine learning can recognize already known areas and reduce need for computationally intensive simulation in those areas while zeroing in on lesser known areas for intense quantum chemistry simulations. Think of it as zooming in and out as needed.

 

FRAMEWORKS REVISITED
Clearly there’s no shortage of challenges for ECP application development. Kothe cites optimizing node performance and memory management among the especially thorny ones, “We’ve now have many levels of memory exposed to us. We don’t really quite know how best to use it.” Data structure choices can also be problematic and Kothe suggests frameworks may undergo a revival,

One of the application teams (astrophysics), recalls Kothe, came to him and said, “I am afraid to make a choice for a data structure that would be pervasive in my whole code because it might be the wrong one and I’m stuck with it.'” The point is I think what we are seeing with the applications a kind of ‘going back to the future’ in late 80s when you saw lots of heavyweight frameworks where an application would call out to a black box and say register this array for me and hand me back the pointer.

“That’s good and it’s bad. The bad part is you’re losing control and now you have to schlep around this black box and you don’t know if it is going to do what you want it to do. The good part is if you are on a KNL system or an NVIDIA system, you are on different nodes, and that block box memory manager would have been tuned for that hardware. [In] dealing with memory hierarchy risks, I think we are probably seeing applications move more towards frameworks which I find think is a good idea. We’ve learned kind of what I call the big F or little f frameworks. I think we’re learning how to balance the two so applications can be portable and not have to rely on an army of people but still do something that’s more agile than just choose one data structure and hope it works.”

Performance portability is naturally a major consideration. Historically, says Kothe, application developers and he includes himself in the category, “We chose portability over performance because we want to make sure our science can be done anywhere. Performance can’t be an afterthought but it often is. Portability in my mind has several dimensions. So the new system shows up and it is probably not something out of left field, you know something about it, but what’s a reasonable amount of effort that you think should be required to port your code? How much of the code base do you think should change? What is correctness in terms of the problem and getting the answer.

“I would claim that a 64-bit comparison is probably not realistic. I mean it’s probably not even appropriate. What set of problems would you run? You need to run real problems. We’re asking each app team to define what they think portability means and hope that collectively we’ll move towards a good definition and a good target for all the apps but I think it will end up being fairly app specific.”

THE CO-DESIGN IMPERATIVE
The necessity of co-design has become a given throughout HPC as well as with the ECP. Advancing hardware and new systems architectures must be taken into account not merely to push application performance but to get them to run at all. However coupling software too tightly to a specific machine or architecture is limiting. Currently ECP has established six co-design centers to help deal with specific challenges. Kothe believes use of motifs may help.

“Every application team at some level will be doing some vertically integrated co-design and there is probably more software co-design going on – the interplay with the compilers and runtime systems and that kind of thing – than anything else. By having the co-design centers identify a small number of motifs that applications are using, I think we can leverage a deep dive co-design on the motifs as opposed to doing kind of an extensive co-design vertically integrated within every application. This is new and there are some risks. But long term, my dream would be we [develop] community libraries that are co-designed around motifs that are used broadly among the applications.

“The poster child is probably [handling] particles. Almost every application has a discrete particle model for something and that’s good and it’s a challenge. So how do you encapsulate the particle [model] in a way that it can be co-designed not as a separate activity that’s not thinking about the [specific] consumer of that motif, but just thinking about making that motif rock and roll. That’s the challenge, to co-design motifs so they can be broadly used and I have high hopes there.”

 

 

STAY ON TARGET
“A big challenge with application developers, is everything sounds cool and looks good, so we want to keep them focused. Year by year the applications have laid out a number of milestones and for the most parts the milestones are step by step progression towards that challenge program. The progression has many dimensions: is the science capability improving, better physics, better algorithms; is the team utilizing the hardware efficiently [such as] state of the art test beds, the latest systems on the floor; are they integrating software technologies and probably one of the most important is they are using co-design efforts,” says Kothe

One ECP-wide tool is a comprehensive project database where “all the R&D projects and applications and software technology, all their plans and milestones are in one place.” A key aspect of ECP, says Kothe, is that everyone can see what everyone else is doing and how they are progressing.

Think of a milestone as a handful of things, says Kothe, that are generally tangible such as software release or a demonstration simulation. “It could be a report or a presentation. It can even be a small write up that says I tried this algorithm and it didn’t work. A milestone is a decision point.

“It’s not always a huge success. Failure can be just as valuable. Sometimes we can force a sense of urgency. We can review this seven-year plan and say, alright you can’t bring in a technology that doesn’t have a line of sight in this timeframe, or you’ve got algorithm A and B going along [and] at this point you have make a decision and choose one and go with it. I like that. I think it imparts a sense of urgency,” Kothe.

Kothe, of course, has his own milestones. One is an annual application assessment report due every September.

“I am hearing I am a slave driver and I didn’t really think had that personality,” says Kothe. One area where he is inflexible is on scheduled releases. “We want you to release on the scheduled date, that date is gospel. What’s in the release may float. So the team and budget, we like to be pretty rigid, but what’s in the release floats based on what you have learned. You have this bag of tasks and try to get as many tasks done as you can but you still must have the release.”

Currently, the comprehensive database of projects isn’t publicly available (would be interesting reading) but Kothe says individual PIs are encouraged to share information widely.

SOFTWARE TECHNOLOGY SHARING
Not surprisingly, close collaboration with the software technology team is emphasized. “Right now what we have this incredible opportunity because applications teams are exposed to a lot of software technologies they’ve never seen or heard of.” It’s a bit like kids in a candy store says Kothe, “They are looking at this technology and saying I want to do that, to do that, to do that, and so the challenge for integration is on managing the interfaces and doing it in a scalable way.”

There a couple of technology projects that everyone wants to integrate, he says, and that’s big bandwidth worry when you have 20-plus application projects lined up saying “let me try your stuff because chances are there will be new APIs and new functionalities and bugs and features [too]. The software technology people are saying, ‘Doug be careful. let’s come up with a scalable process.’” Conversely, says Kothe, it is also true there’s a fair amount of great “software technology the application teams are not exploring which they should be.”

“We have defined a number of integration milestones which are basically milestones that require deliverables from two or three areas. We call that shared fate. [I know] it sounds like we are jumping off a cliff together. A good example is an application project looks at a linear solver and says ‘you don’t have the functionality I need, lets negotiate requirements.’ So the solver negotiates a new API, a new functionality, and the application team will have a milestone that says it will have integrated and tested and the new technology [by a given date] and the software technology team has to have its release say two or three months before. These things tend to be daisy chained like that. You have a release, then an integration assessment, and we might have another release to basically deal with any issues.

“Right now, early on in ECP, we’re having a lot of point-to-point interaction where there’s lots of aps that want to do lots of same or different things with lots of software projects. I think once we settle down on the requirements the software technologies will be kind of one to all [having] settled on a base functionality and a base API. An obvious example is MPI but even with MPI there’s new features and functionalities that certain aspects. We can’t take it for granted that some of these tremendous technologies like MPI are going to be there working the way we need for exascale,” says Kothe.

 

ECP FUTURE WATCH
Even as ECP pushes forward it remains rooted in CMOS technology yet there are several newer technologies – not least neuromorphic and quantum computing – which have made great strides recently and seem on the cusp of practical application.

“One of the things I have been thinking about is even if we don’t have access to a neuromorphic chip what is its behavior like from a hardware simulator point of view. The same thing with quantum computing. Our mindset has to change with regards to the algorithms we lay out for neuromorphic or quantum. The applications teams need to start thinking about different types of algorithms. As Paul [Messina] has pointed out it’s possible quantum computing could fairly soon become an accelerator on traditional node. Making sure applications are compartmentalized is important to make that possible. It would allow us to be more flexible and extensible and perhaps exploit something like a quantum accelerator.”

Looking ahead, says Kothe, he worries most about the unknown unknowns – there will be surprises. “I feel like right now in apps space we kind of have known unknowns and we’ll hit some unknown unknowns, but I believe we are going to have a number of applications ready to go. We’ll have trips along the way and we may not do some things we plan now. I think we have an aggressive but not naive set of metrics. It’s really the people. We have some unbelievable people,” he says.

One can understand today’s attraction. Kothe points out this is likely to be a once-in-a-career opportunity and the mix of experience among the application team members significant. “What we see is millennials sitting at the table showing people new ways of doing software with gray-haired guys like me who have been to the school of hard knocks. There’s a tremendous cross fertilization. I’m confident. I saw it when we selected these teams. We had teams with rosters that looked like the all star team, but I am worried about retention. We are training people to be some of the best, especially the early career folks, so I am worried that they will be in high demand, very marketable.”

Kothe Bio from ECP website:
Douglas B. Kothe (Doug) has over three decades of experience in conducting and leading applied R&D in computational applications designed to simulate complex physical phenomena in the energy, defense, and manufacturing sectors. Kothe is currently the Deputy Associate Laboratory Director of the Computing and Computational Sciences Directorate (CCSD) at Oak Ridge National Laboratory (ORNL). Prior positions for Kothe at ORNL, where he has been since 2006, were Director of the Consortium for Advanced Simulation of Light Water Reactors, DOE’s first Energy Innovation Hub (2010-2015), and Director of Science at the National Center for Computational Sciences (2006-2010).

Feature Caption:
The Transforming Additive Manufacturing through Exascale Simulation project (ExaAM) is building a new multi-physics modeling and simulation platform for 3D printing of metals to provide an up-front assessment of the manufacturability and performance of additively manufactured parts. Pictured: simulation of laser melting of metal powder in a 3D printing process (LLNL) and a fully functional lightweight robotic hand (ORNL).

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

IBM Research Scales to 11,400 Cores for EDA

August 5, 2021

For many HPC users, their needs are not evenly distributed throughout a year: some might need few – if any – resources for months, then they might need a very large system for a week. For those kinds of users, large Read more…

Careers in Cybersecurity Featured at PEARC21

August 5, 2021

The PEARC21 (Practice & Experience in Advanced Research Computing) Student Program featured a Cybersecurity Careers Panel. Five experts shared lessons learned from more than 100 years of combined experience. While it Read more…

HPC Career Notes: August 2021 Edition

August 4, 2021

In this monthly feature, we’ll keep you up-to-date on the latest career developments for individuals in the high-performance computing community. Whether it’s a promotion, new company hire, or even an accolade, we’ Read more…

The Promise (and Necessity) of Runtime Systems like Charm++ in Exascale Power Management

August 4, 2021

Big heterogeneous computer systems, especially forthcoming exascale computers, are power hungry and difficult to program effectively. This is, of course, not an unrecognized problem. In a recent blog, Charmworks’ CEO S Read more…

Digging into the Atos-Nimbix Deal: Big US HPC and Global Cloud Aspirations. Look out HPE?

August 2, 2021

Behind Atos’s deal announced last week to acquire HPC-cloud specialist Nimbix are ramped-up plans to penetrate the U.S. HPC market and global expansion of its HPC cloud capabilities. Nimbix will become “an Atos HPC c Read more…

AWS Solution Channel

Pushing pixels, not data with NICE DCV

NICE DCV, our high-performance, low-latency remote-display protocol, was originally created for scientists and engineers who ran large workloads on far-away supercomputers, but needed to visualize data without moving it. Read more…

Berkeley Lab Makes Strides in Autonomous Discovery to Tackle the Data Deluge

August 2, 2021

Data production is outpacing the human capacity to process said data. Whether a giant radio telescope, a new particle accelerator or lidar data from autonomous cars, the sheer scale of the data generated is increasingly Read more…

Careers in Cybersecurity Featured at PEARC21

August 5, 2021

The PEARC21 (Practice & Experience in Advanced Research Computing) Student Program featured a Cybersecurity Careers Panel. Five experts shared lessons learn Read more…

Digging into the Atos-Nimbix Deal: Big US HPC and Global Cloud Aspirations. Look out HPE?

August 2, 2021

Behind Atos’s deal announced last week to acquire HPC-cloud specialist Nimbix are ramped-up plans to penetrate the U.S. HPC market and global expansion of its Read more…

What’s After Exascale? The Internet of Workflows Says HPE’s Nicolas Dubé

July 29, 2021

With the race to exascale computing in its final leg, it’s natural to wonder what the Post Exascale Era will look like. Nicolas Dubé, VP and chief technologist for HPE’s HPC business unit, agrees and shared his vision at Supercomputing Frontiers Europe 2021 held last week. The next big thing, he told the virtual audience at SFE21, is something that will connect HPC and (broadly) all of IT – into what Dubé calls The Internet of Workflows. Read more…

How UK Scientists Developed Transformative, HPC-Powered Coronavirus Sequencing System

July 29, 2021

In November 2020, the COVID-19 Genomics UK Consortium (COG-UK) won the HPCwire Readers’ Choice Award for Best HPC Collaboration for its CLIMB-COVID sequencing project. Launched in March 2020, CLIMB-COVID has now resulted in the sequencing of over 675,000 coronavirus genomes – an increasingly critical task as variants like Delta threaten the tenuous prospect of a return to normalcy in much of the world. Read more…

IBM and University of Tokyo Roll Out Quantum System One in Japan

July 27, 2021

IBM and the University of Tokyo today unveiled an IBM Quantum System One as part of the IBM-Japan quantum program announced in 2019. The system is the second IB Read more…

Intel Unveils New Node Names; Sapphire Rapids Is Now an ‘Intel 7’ CPU

July 27, 2021

What's a preeminent chip company to do when its process node technology lags the competition by (roughly) one generation, but outmoded naming conventions make it seem like it's two nodes behind? For Intel, the response was to change how it refers to its nodes with the aim of better reflecting its positioning within the leadership semiconductor manufacturing space. Intel revealed its new node nomenclature, and... Read more…

Will Approximation Drive Post-Moore’s Law HPC Gains?

July 26, 2021

“Hardware-based improvements are going to get more and more difficult,” said Neil Thompson, an innovation scholar at MIT’s Computer Science and Artificial Intelligence Lab (CSAIL). “I think that’s something that this crowd will probably, actually, be already familiar with.” Thompson, speaking... Read more…

With New Owner and New Roadmap, an Independent Omni-Path Is Staging a Comeback

July 23, 2021

Put on a shelf by Intel in 2019, Omni-Path faced a uncertain future, but under new custodian Cornelis Networks, OmniPath is looking to make a comeback as an independent high-performance interconnect solution. A "significant refresh" – called Omni-Path Express – is coming later this year according to the company. Cornelis Networks formed last September as a spinout of Intel's Omni-Path division. Read more…

AMD Chipmaker TSMC to Use AMD Chips for Chipmaking

May 8, 2021

TSMC has tapped AMD to support its major manufacturing and R&D workloads. AMD will provide its Epyc Rome 7702P CPUs – with 64 cores operating at a base cl Read more…

Berkeley Lab Debuts Perlmutter, World’s Fastest AI Supercomputer

May 27, 2021

A ribbon-cutting ceremony held virtually at Berkeley Lab's National Energy Research Scientific Computing Center (NERSC) today marked the official launch of Perlmutter – aka NERSC-9 – the GPU-accelerated supercomputer built by HPE in partnership with Nvidia and AMD. Read more…

Ahead of ‘Dojo,’ Tesla Reveals Its Massive Precursor Supercomputer

June 22, 2021

In spring 2019, Tesla made cryptic reference to a project called Dojo, a “super-powerful training computer” for video data processing. Then, in summer 2020, Tesla CEO Elon Musk tweeted: “Tesla is developing a [neural network] training computer called Dojo to process truly vast amounts of video data. It’s a beast! … A truly useful exaflop at de facto FP32.” Read more…

Google Launches TPU v4 AI Chips

May 20, 2021

Google CEO Sundar Pichai spoke for only one minute and 42 seconds about the company’s latest TPU v4 Tensor Processing Units during his keynote at the Google I Read more…

CentOS Replacement Rocky Linux Is Now in GA and Under Independent Control

June 21, 2021

The Rocky Enterprise Software Foundation (RESF) is announcing the general availability of Rocky Linux, release 8.4, designed as a drop-in replacement for the soon-to-be discontinued CentOS. The GA release is launching six-and-a-half months after Red Hat deprecated its support for the widely popular, free CentOS server operating system. The Rocky Linux development effort... Read more…

Intel Launches 10nm ‘Ice Lake’ Datacenter CPU with Up to 40 Cores

April 6, 2021

The wait is over. Today Intel officially launched its 10nm datacenter CPU, the third-generation Intel Xeon Scalable processor, codenamed Ice Lake. With up to 40 Read more…

Iran Gains HPC Capabilities with Launch of ‘Simorgh’ Supercomputer

May 18, 2021

Iran is said to be developing domestic supercomputing technology to advance the processing of scientific, economic, political and military data, and to strengthen the nation’s position in the age of AI and big data. On Sunday, Iran unveiled the Simorgh supercomputer, which will deliver.... Read more…

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

Leading Solution Providers

Contributors

Julia Update: Adoption Keeps Climbing; Is It a Python Challenger?

January 13, 2021

The rapid adoption of Julia, the open source, high level programing language with roots at MIT, shows no sign of slowing according to data from Julialang.org. I Read more…

AMD-Xilinx Deal Gains UK, EU Approvals — China’s Decision Still Pending

July 1, 2021

AMD’s planned acquisition of FPGA maker Xilinx is now in the hands of Chinese regulators after needed antitrust approvals for the $35 billion deal were receiv Read more…

GTC21: Nvidia Launches cuQuantum; Dips a Toe in Quantum Computing

April 13, 2021

Yesterday Nvidia officially dipped a toe into quantum computing with the launch of cuQuantum SDK, a development platform for simulating quantum circuits on GPU-accelerated systems. As Nvidia CEO Jensen Huang emphasized in his keynote, Nvidia doesn’t plan to build... Read more…

Microsoft to Provide World’s Most Powerful Weather & Climate Supercomputer for UK’s Met Office

April 22, 2021

More than 14 months ago, the UK government announced plans to invest £1.2 billion ($1.56 billion) into weather and climate supercomputing, including procuremen Read more…

Quantum Roundup: IBM, Rigetti, Phasecraft, Oxford QC, China, and More

July 13, 2021

IBM yesterday announced a proof for a quantum ML algorithm. A week ago, it unveiled a new topology for its quantum processors. Last Friday, the Technical Univer Read more…

Q&A with Jim Keller, CTO of Tenstorrent, and an HPCwire Person to Watch in 2021

April 22, 2021

As part of our HPCwire Person to Watch series, we are happy to present our interview with Jim Keller, president and chief technology officer of Tenstorrent. One of the top chip architects of our time, Keller has had an impactful career. Read more…

Frontier to Meet 20MW Exascale Power Target Set by DARPA in 2008

July 14, 2021

After more than a decade of planning, the United States’ first exascale computer, Frontier, is set to arrive at Oak Ridge National Laboratory (ORNL) later this year. Crossing this “1,000x” horizon required overcoming four major challenges: power demand, reliability, extreme parallelism and data movement. Read more…

Senate Debate on Bill to Remake NSF – the Endless Frontier Act – Begins

May 18, 2021

The U.S. Senate today opened floor debate on the Endless Frontier Act which seeks to remake and expand the National Science Foundation by creating a technology Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire