In this one-on-one interview, Doug Kothe – associate laboratory director, Computing and Computational Sciences at Oak Ridge National Laboratory, and director of the Exascale Computing Project (ECP) – discusses Frontier’s progress, the significance of breaking the exaflops barrier, and the first applications that will run on Frontier. As Frontier gets up and running, the ECP will benchmark and validate across a broad range of targets in support of U.S. energy and security missions.
“Every one of our applications is targeting a very specific problem that’s really unachievable and unattainable without exascale resources. You need lots of memory and big compute to go after these big problems. So without exascale, a lot of these problems would take months or years to address on a petascale system or they’re just not even attainable.” – Doug Kothe
Transcript (lightly edited):
Tiffany Trader: Hey Doug. How’s it going?
Doug Kothe: It’s going pretty well.
Trader: We’re here at Oak Ridge National Laboratory with the newly appointed Associate Director for Oak Ridge. I believe that was official on June 6.
Kothe: It was. It was. So I’m a few days in, still trying to get my feet under me. I will say too that I retained Director of the Exascale Computing Project. But Lori Diachin from Livermore, my deputy, is really taking the reins there. And she’s doing great already.
Trader: You’re wearing two really big hats now, but they’re hats that I think fit nicely together.
Kothe: I believe so as well. I guess time will tell. I very much care deeply about ECP. I really want to see it across the finish line. We can see the can see the light at the end of the tunnel and it’s not a train.
Trader: And that June 6 date I thought was notable because that’s pretty much exactly one week from the official debut of Frontier as the first supercomputer to cross the exaflops milestone, so seemed like good timing there. And of course, being the director of the ECP. That is the project responsible for exascale-readiness, and getting the applications ready for day one on exascale. And now with these new machines coming online, I think that will be a proving ground for that – and that’s where these two things are coming together. Do you want to provide a mini update on ECP, and the applications that you’ve been developing and how that will roll out to the system and especially Frontier.
Kothe: This is an incredibly exciting time. I used to play high school football, if you can believe it. It’s like we’ve been doing two-a-day practices for five years. We are ready. It’s really an exciting time. I’ll note to that Frontier being delivered was incredible. It’s not formally part of this project of ECP. But we have dear friends and colleagues and we really are right there with them watching this happen, and helping where we could from the software point of view. And the fact that it was delivered just a few weeks ago, kind of given what we’ve seen going on with COVID and supply chain and all that is incredible. To be honest, I didn’t think it would happen when it did. Every every machine is unique and this one was certainly unique and complex. But what impressed me was the staff at Oak Ridge working closely with AMD and HPE – really a very cohesive team and that’s what it takes.
So on the ECP side, we are ready. We’ve been working on a couple hundred nodes of Frontier really since January. And so our software stack and our applications are running well. And by well meaning they’re they’re compiling and they’re getting the right answer. That’s the first step. But now doing kind of single node or multi-node performance optimizations. Performance, in particular on the MI250X GPU for most of our apps is meeting expectations. We still have a lot of work to do and scaling up and, you know, being ready because we’re going to now move from 200 nodes to 9,400 nodes. And so it’s not going to be a walk in the park. I think our teams know what’s ahead of us. And that we’ll probably go through a several week, sort of scale-up period. And we’ll be rolling the apps on in terms of readiness, who’s the most ready. But we have  apps ready to go, and over 70 software products ready to go. They all have signed contracts with external reviewers for quantifying and demonstrating, you know, in fair amount of detail, what we signed up to five years ago. So it’s, again, it’s, it’s exciting to be here.
Trader: Five years ago… 2016, I think was the start of ECP.
Kothe: We really started funding the teams in September of 16 and so, you know, it’ll be six years this September. And, you know, we had to… it’s hard to sort of set specific quantitative goals that far in advance within a field that’s so agile and and ever changing, but it’s so far it’s worked out well.
Trader: So you have 24 ECP applications – 21 of those are DOE Office of Science and the NNSA [National Nuclear Security Adminstration] contributes three of those. Can you give us some examples of those applications and the use cases?
Kothe: You bet. It is very exciting to talk about this. So for example, let’s talk power grid, being able to simulate the entire national power grid consists of three interconnects, and if you count the points for generation, transformers, houses, you’re up to 10 to the ninth, 10 to the 10th, at some point, maybe exascale level points, we want to be able to simulate what happens on the grid when certain power sources come on, maybe due to a disaster, or when we have wind and solar that tends to to sort of ebb and flow with day and night. So we want to be able to do planning so we can help prevent blackouts or brownouts. I mean, we saw that in Texas recently and other places. That’s an exciting new application that’s very non-traditional HPC.
Another example is wind farms. In talking to the experts there, for wind farms consisting of tens of turbines, maybe 50 to 100 close by, they can buffet each other, and because of the turbulence from one turbine to the other, they can lose 20 to 30 percent of the potential wind energy coming into the wind farm. So they’re not nearly as efficient as they could be. So we’re trying to understand that so we can develop more efficient wind farms. We’re simulating quantum materials. Quantum materials are materials where the electrons can flow around very freely. In quantum, we’re trying to understand what makes the material have correlation as it’s called. And that will inform how to build quantum computers, it will inform how to build room temperature superconductors or super insulators. That’s an example of a materials application.
Others include chemistry, being able to design catalysts, basically virtually design a molecule that helps catalyze a reaction without having to do any experiments. And maybe you go into the experiment, and you fabricate and you confirm, rather than just explore. My background in nuclear engineering – I’d be remiss not to mention this – being able to design in the computer small and micro reactors, and then go out and build a safe operating reactor without necessarily having to do… with very little testing. We also engaged in fusion and also clean combustion of coal, being able to burn coal or oil, and have the byproduct be just CO2 that you can then capture and reuse or sequester. So it’s typical for the Department of Energy – and I’ll emphasize Energy – we’re all into energy production, energy transmission, materials and chemistry for energy. But I’ll also mention, the Department of Energy funds fundamental science, the origin of elements in the universe, the evolution of the universe, the nuclear force, which is known as the standard model – very, very fundamental science areas that I think will lead to some fantastic new insights. So that’s just a few examples. And again, these were chosen very carefully and selectively with our sponsors, and so every one is going to have a home and have a steward post-ECP.
Trader: And I understand from speaking with Justin Whitt, who is the project director for Frontier, that it’s nearing its acceptance phase. And so what does that mean for the timeline for the ECP applications and increasing the scale that they run on?
Kothe: It’s a good question. So we’ve negotiated a timeline, it’s been fairly conservative, because fortunately, the OLCF [Oak Ridge Leadership Computing Facility] leadership team has been through many acceptances. And they know, there’ll be fits and starts. The point is probably about four to six weeks from now, some of our most ready apps will get on. Whether or not acceptance will be done, it’s hard to tell, they have an aggressive schedule. But basically, if we get on later than that we have plenty of headroom in our own schedule. So as Justin probably talked to you, the acceptance is very rigorous. There’s functionality, do basic things that we need work? There’s performance, are we getting the performance out of the system? Certainly all indications are based on the HPL [High Performance Linpack] run that we are. And then there’s stability and stability is the one that’s most challenging. Essentially, surragate workloads that mimic actual production workloads are run for for weeks on the system. And there’s very specific metrics in terms of the percent of jobs that have to complete and the percent of those jobs that get the right answer, etc. So acceptance is pretty onerous. And so we feel confident that after that period, the machine will be fairly well shaken out for us to get on.
Trader: And Oak Ridge is hosting the HPC User Forum this week, and you gave a presentation yesterday. And one of the things that stood out to me was that you said you overestimated the time that it would take to achieve readiness. You want to talk a little bit more about that?
Kothe: Yeah, it’s interesting you picked up on that.
Trader: You don’t hear that very often.
Kothe: Well, you know, scientists, I think, tend to be more pessimistic about, you know, “gosh, I need to hypothesize, test, hypothesize, test, things are going to change, I don’t know the future, there’s lots of risk.” Certainly in software, that’s the case. But I think what we observed – and we want to write a retrospective on this – is if you have the right team together, and in the case of applications, it’s kind of five to 10 people. But not all physicists, not all engineers like myself; you need to have mathematicians, computer scientists, computational scientists, performance engineers. When you have this eclectic mix of people, everybody brings a diverse point of view and a different set of experiences. And the lessons learned there is the teams were smaller than we thought they needed to be, and I think took less time than we thought they needed. Now, we haven’t crossed the finish line yet. But it’s all about not surprisingly, getting the right people. And so we were lucky because ECP has attracted the best and brightest. And we have great teams and teams that have been together for, you know, five plus years and learn from one another. When DOE set up this large project, yes, it’s complicated to manage, but we brought together teams of people that maybe knew of each other, maybe not, but to watch this cross-fertilization of lessons learned and best practices. And we have a lot of, you know, quarterbacks, A students, Michael Jordans, whatever you want to call them. There’s a lot of one-upmanship that goes on and that – people feed off each other. And so they’re kind of some nice competition going on within the project. So sort of all those things, they’re hard, if not impossible to measure, but that was in my head when I made the comment that in retrospect, you know, these teams pulled off more than we thought.
Trader: And next steps for ECP, I understand there are certain KPPs – key performance parameters – that need to be achieved before the project can conclude.
Kothe: Right. So for a formal project in the Department of Energy, we generally have to sign up to a small number of three to five quantitative metrics that constitute formal success. From a sort of our own staff point of view, we want to set that bar reasonably high, but we want to go beyond it. So our threshold KPP metrics have to do with demonstrating our applications are simulating real important challenge problems. Okay. A challenge problem is a problem of strategic interest to the various program offices that we’re building the apps for. And about half of the apps have to show they can do 50x performance relative to 2016 – so most of the apps benchmarked on Titan or Mira or Theta at Argonne. And so they had to sign up, you know, five-and-a-half, six years ago for 50x performance. Now, that doesn’t mean just getting an answer quicker. It also in probably in every case I can think of getting a better answer quicker, meaning an answer that has more physics, that is higher confidence, more predictable. So the 50x is for 11 of the 24 apps. And then the other 13 have signed up to demonstrate capabilities. So we’ve got around the hook for 24 apps, and our minimum performance is half. And I think we can do probably 70 – 80 percent. At least that’s our that’s our target goal.
On the software side, the way we’ve incentivized integration and portability is, if I’m building a software product, somebody has to care about it, somebody has to use it, and it has to be on an exascale system. So somebody is typically an application. It could be another software product, it could be the facility wants it there. So our software products have signed up for generally four to eight capability integrations. So if I’m building like a linear solver library, I have to demonstrate that let me say two apps are using my library in a critical way on, say, Aurora and Frontier, because you get a kind of a point for each, or, four apps are using my capability, say, on Frontier. And you know, the point is you get a higher score if you show what you can do on both systems. So it sounds complicated, it took us a couple of years to figure out kind of a scoring metric for software integration. And here, I would just say if your stuff is used and useful, then you ought to be fine on the score. So those are the three we have to hit. And we want to hit these by less than a year from now, ideally, much less than a year.
Trader: And benchmarking these applications on Frontier and Aurora – Aurora being the Argonne National Laboratory system. So that’s the other system that is part of your goal to benchmark applications on to hit your KPPs.
Kothe: That is correct. Now, the way that KPPs are defined is we can ideally hit all of our KPPs on one system. But we really want to do far better than that, and achieve them on both. Because for the better of science, post-ECP – these are science and engineering tools – for I think decades, we want to show that this ecosystem is robust and portable, and able to deliver great answers on any number of types of hardware. So we definitely want to get on Aurora and do the same thing.
Trader: And what are the steps to go beyond ECP? What are the plans in place as you wind down ECP to prepare for future milestones?
Kothe: So the applications are going to be mature enough to be used in science campaigns by the program offices, and I’ll call science campaign as I’m using the application to discover new science to design new things, but also to further validate the code. The point is validation is comparison against experimental data. We’ve done some of that in ECP. But the program offices – and again, it’s not our decision, but we’ve been engaged with the program office stakeholders that, you know, view a given application as being in their mission space. We’ve been talking to them for the last five years. And in also negotiating with with ASCR, the Advanced Scientific Computing Research office on making sure the software stack is sustainable. And like I mentioned at the HPC User Forum are foreign, we’ve been releasing our software stack now for three and a half years, every quarter. It’s called E4S – Extreme-scale Scientific Software Stack – and we’ve got really a nice cadence and a nice process for essentially handing off this software stack to DOE, and a lot of us will still be working on it and evolving it post-ECP. So I’m quite confident it will be sustained, not just released, documented and available, but further evolved. We see in the five to 10 years to come, the software stack will evolve to capture edge technologies, likely quantum capabilities – by that meaning, you know, elements of the software stack to support quantum – and, of course, more AI and machine learning. So the stack will continue to evolve as we kind of move through these next two or three tipping points.
Trader: And then high level, how do these exascale systems support the mission of the DOE and the NNSA? And why are exascale systems important?
Kothe: Very good question. So, as you probably remember, there are a number of exascale requirements workshops that were held, pointing to very specific program offices. In the case of ECP, there are on the order of 10 program offices that we’re building applications for. So we sat down with the program offices and as a part of these workshops and in private discussions, and talked about problems of strategic interest for their office that they currently cannot address or solve today that are amenable to computing, at least for part of the solution, and need exascale. So every one of our applications is targeting a very specific problem that’s really unachievable and unattainable without exascale resources. And again, you know, you need lots of memory and big compute, to go after these big problems. So without exascale, some of these, a lot of these problems would take months or years to address on a petascale system, let’s say, or they’re just not even attainable. So the exascale drivers are there. And you know, we sat down with the program offices at the very beginning and laid out plans for each app. And part of their KPP is to show they can do that problem. And that they can do that problem by fully exploiting all the breadth and depth of Frontier or Aurora. So they have very specific metrics about full system runs, doing all the science and incorporating all the science and the physics needed for a given problem with very specific outcome metrics for each problem.
Trader: Computational milestones like exascale are exciting and inspirational. What do you think when you look ahead to future milestones?
Kothe: Well, I’d like to think that I will probably be retired at that time. But these tools and technologies we’ve developed will lead to groundbreaking discoveries, Nobel Prizes, new concepts and designs for the landscape of DOE, from energy production to power grid to materials and chemistry for energy. I mean, we’re in the middle of some real challenges right now. And you know, I really need to mention the national security aspect, too. So, you know, it’s not about just exascale. In my mind as an application person, it’s about, we need to as a nation [need to], and we are, leading applications that deliver solutions to policymakers [and] to decision-makers to make consequential decisions. So I do anticipate the results of the simulation insights provided will greatly sort of de-risk decisions and give us high confidence in making decisions that we can bank on – that’s really an important role for simulation.
Trader: Great. Well, let’s leave it at that. Thanks. It’s been great talking with you.
Kothe: Thanks, Tiffany.