HPCwire Talks Exascale with Doug Kothe at Oak Ridge National Laboratory

By Tiffany Trader

July 6, 2022

In this one-on-one interview, Doug Kothe – associate laboratory director, Computing and Computational Sciences at Oak Ridge National Laboratory, and director of the Exascale Computing Project (ECP) – discusses Frontier’s progress, the significance of breaking the exaflops barrier, and the first applications that will run on Frontier. As Frontier gets up and running, the ECP will benchmark and validate across a broad range of targets in support of U.S. energy and security missions.


“Every one of our applications is targeting a very specific problem that’s really unachievable and unattainable without exascale resources. You need lots of memory and big compute to go after these big problems. So without exascale, a lot of these problems would take months or years to address on a petascale system or they’re just not even attainable.” – Doug Kothe


 


Transcript (lightly edited):

Tiffany Trader: Hey Doug. How’s it going?

Doug Kothe: It’s going pretty well.

Trader: We’re here at Oak Ridge National Laboratory with the newly appointed Associate Director for Oak Ridge. I believe that was official on June 6.

Kothe: It was. It was. So I’m a few days in, still trying to get my feet under me. I will say too that I retained Director of the Exascale Computing Project. But Lori Diachin from Livermore, my deputy, is really taking the reins there. And she’s doing great already.

Trader: You’re wearing two really big hats now, but they’re hats that I think fit nicely together.

Kothe: I believe so as well. I guess time will tell. I very much care deeply about ECP. I really want to see it across the finish line. We can see the can see the light at the end of the tunnel and it’s not a train.

Trader: And that June 6 date I thought was notable because that’s pretty much exactly one week from the official debut of Frontier as the first supercomputer to cross the exaflops milestone, so seemed like good timing there. And of course, being the director of the ECP. That is the project responsible for exascale-readiness, and getting the applications ready for day one on exascale. And now with these new machines coming online, I think that will be a proving ground for that – and that’s where these two things are coming together. Do you want to provide a mini update on ECP, and the applications that you’ve been developing and how that will roll out to the system and especially Frontier.

Kothe: This is an incredibly exciting time. I used to play high school football, if you can believe it. It’s like we’ve been doing two-a-day practices for five years. We are ready. It’s really an exciting time. I’ll note to that Frontier being delivered was incredible. It’s not formally part of this project of ECP. But we have dear friends and colleagues and we really are right there with them watching this happen, and helping where we could from the software point of view. And the fact that it was delivered just a few weeks ago, kind of given what we’ve seen going on with COVID and supply chain and all that is incredible. To be honest, I didn’t think it would happen when it did. Every every machine is unique and this one was certainly unique and complex. But what impressed me was the staff at Oak Ridge working closely with AMD and HPE – really a very cohesive team and that’s what it takes.

So on the ECP side, we are ready. We’ve been working on a couple hundred nodes of Frontier really since January. And so our software stack and our applications are running well. And by well meaning they’re they’re compiling and they’re getting the right answer. That’s the first step. But now doing kind of single node or multi-node performance optimizations. Performance, in particular on the MI250X GPU for most of our apps is meeting expectations. We still have a lot of work to do and scaling up and, you know, being ready because we’re going to now move from 200 nodes to 9,400 nodes. And so it’s not going to be a walk in the park. I think our teams know what’s ahead of us. And that we’ll probably go through a several week, sort of scale-up period. And we’ll be rolling the apps on in terms of readiness, who’s the most ready. But we have [24] apps ready to go, and over 70 software products ready to go. They all have signed contracts with external reviewers for quantifying and demonstrating, you know, in fair amount of detail, what we signed up to five years ago. So it’s, again, it’s, it’s exciting to be here.

Trader: Five years ago… 2016, I think was the start of ECP.

Kothe: We really started funding the teams in September of 16 and so, you know, it’ll be six years this September. And, you know, we had to… it’s hard to sort of set specific quantitative goals that far in advance within a field that’s so agile and and ever changing, but it’s so far it’s worked out well.

Trader: So you have 24 ECP applications – 21 of those are DOE Office of Science and the NNSA [National Nuclear Security Adminstration] contributes three of those. Can you give us some examples of those applications and the use cases?

Kothe: You bet. It is very exciting to talk about this. So for example, let’s talk power grid, being able to simulate the entire national power grid consists of three interconnects, and if you count the points for generation, transformers, houses, you’re up to 10 to the ninth, 10 to the 10th, at some point, maybe exascale level points, we want to be able to simulate what happens on the grid when certain power sources come on, maybe due to a disaster, or when we have wind and solar that tends to to sort of ebb and flow with day and night. So we want to be able to do planning so we can help prevent blackouts or brownouts. I mean, we saw that in Texas recently and other places. That’s an exciting new application that’s very non-traditional HPC.

Another example is wind farms. In talking to the experts there, for wind farms consisting of tens of turbines, maybe 50 to 100 close by, they can buffet each other, and because of the turbulence from one turbine to the other, they can lose 20 to 30 percent of the potential wind energy coming into the wind farm. So they’re not nearly as efficient as they could be. So we’re trying to understand that so we can develop more efficient wind farms. We’re simulating quantum materials. Quantum materials are materials where the electrons can flow around very freely. In quantum, we’re trying to understand what makes the material have correlation as it’s called. And that will inform how to build quantum computers, it will inform how to build room temperature superconductors or super insulators. That’s an example of a materials application.

Others include chemistry, being able to design catalysts, basically virtually design a molecule that helps catalyze a reaction without having to do any experiments. And maybe you go into the experiment, and you fabricate and you confirm, rather than just explore. My background in nuclear engineering – I’d be remiss not to mention this – being able to design in the computer small and micro reactors, and then go out and build a safe operating reactor without necessarily having to do… with very little testing. We also engaged in fusion and also clean combustion of coal, being able to burn coal or oil, and have the byproduct be just CO2 that you can then capture and reuse or sequester. So it’s typical for the Department of Energy – and I’ll emphasize Energy – we’re all into energy production, energy transmission, materials and chemistry for energy. But I’ll also mention, the Department of Energy funds fundamental science, the origin of elements in the universe, the evolution of the universe, the nuclear force, which is known as the standard model – very, very fundamental science areas that I think will lead to some fantastic new insights. So that’s just a few examples. And again, these were chosen very carefully and selectively with our sponsors, and so every one is going to have a home and have a steward post-ECP.

Trader: And I understand from speaking with Justin Whitt, who is the project director for Frontier, that it’s nearing its acceptance phase. And so what does that mean for the timeline for the ECP applications and increasing the scale that they run on?

Kothe: It’s a good question. So we’ve negotiated a timeline, it’s been fairly conservative, because fortunately, the OLCF [Oak Ridge Leadership Computing Facility] leadership team has been through many acceptances. And they know, there’ll be fits and starts. The point is probably about four to six weeks from now, some of our most ready apps will get on. Whether or not acceptance will be done, it’s hard to tell, they have an aggressive schedule. But basically, if we get on later than that we have plenty of headroom in our own schedule. So as Justin probably talked to you, the acceptance is very rigorous. There’s functionality, do basic things that we need work? There’s performance, are we getting the performance out of the system? Certainly all indications are based on the HPL [High Performance Linpack] run that we are. And then there’s stability and stability is the one that’s most challenging. Essentially, surrogate workloads that mimic actual production workloads are run for weeks on the system. And there’s very specific metrics in terms of the percent of jobs that have to complete and the percent of those jobs that get the right answer, etc. So acceptance is pretty onerous. And so we feel confident that after that period, the machine will be fairly well shaken out for us to get on.

Trader: And Oak Ridge is hosting the HPC User Forum this week, and you gave a presentation yesterday. And one of the things that stood out to me was that you said you overestimated the time that it would take to achieve readiness. You want to talk a little bit more about that?

Kothe: Yeah, it’s interesting you picked up on that.

Trader: You don’t hear that very often.

Kothe: Well, you know, scientists, I think, tend to be more pessimistic about, you know, “gosh, I need to hypothesize, test, hypothesize, test, things are going to change, I don’t know the future, there’s lots of risk.” Certainly in software, that’s the case. But I think what we observed – and we want to write a retrospective on this – is if you have the right team together, and in the case of applications, it’s kind of five to 10 people. But not all physicists, not all engineers like myself; you need to have mathematicians, computer scientists, computational scientists, performance engineers. When you have this eclectic mix of people, everybody brings a diverse point of view and a different set of experiences. And the lessons learned there is the teams were smaller than we thought they needed to be, and I think took less time than we thought they needed. Now, we haven’t crossed the finish line yet. But it’s all about not surprisingly, getting the right people. And so we were lucky because ECP has attracted the best and brightest. And we have great teams and teams that have been together for, you know, five plus years and learn from one another. When DOE set up this large project, yes, it’s complicated to manage, but we brought together teams of people that maybe knew of each other, maybe not, but to watch this cross-fertilization of lessons learned and best practices. And we have a lot of, you know, quarterbacks, A students, Michael Jordans, whatever you want to call them. There’s a lot of one-upmanship that goes on and that – people feed off each other. And so they’re kind of some nice competition going on within the project. So sort of all those things, they’re hard, if not impossible to measure, but that was in my head when I made the comment that in retrospect, you know, these teams pulled off more than we thought.

Trader: And next steps for ECP, I understand there are certain KPPs – key performance parameters – that need to be achieved before the project can conclude.

Kothe: Right. So for a formal project in the Department of Energy, we generally have to sign up to a small number of three to five quantitative metrics that constitute formal success. From a sort of our own staff point of view, we want to set that bar reasonably high, but we want to go beyond it. So our threshold KPP metrics have to do with demonstrating our applications are simulating real important challenge problems. Okay. A challenge problem is a problem of strategic interest to the various program offices that we’re building the apps for. And about half of the apps have to show they can do 50x performance relative to 2016 – so most of the apps benchmarked on Titan or Mira or Theta at Argonne. And so they had to sign up, you know, five-and-a-half, six years ago for 50x performance. Now, that doesn’t mean just getting an answer quicker. It also in probably in every case I can think of getting a better answer quicker, meaning an answer that has more physics, that is higher confidence, more predictable. So the 50x is for 11 of the 24 apps. And then the other 13 have signed up to demonstrate capabilities. So we’ve got around the hook for 24 apps, and our minimum performance is half. And I think we can do probably 70 – 80 percent. At least that’s our that’s our target goal.

On the software side, the way we’ve incentivized integration and portability is, if I’m building a software product, somebody has to care about it, somebody has to use it, and it has to be on an exascale system. So somebody is typically an application. It could be another software product, it could be the facility wants it there. So our software products have signed up for generally four to eight capability integrations. So if I’m building like a linear solver library, I have to demonstrate that let me say two apps are using my library in a critical way on, say, Aurora and Frontier, because you get a kind of a point for each, or, four apps are using my capability, say, on Frontier. And you know, the point is you get a higher score if you show what you can do on both systems. So it sounds complicated, it took us a couple of years to figure out kind of a scoring metric for software integration. And here, I would just say if your stuff is used and useful, then you ought to be fine on the score. So those are the three we have to hit. And we want to hit these by less than a year from now, ideally, much less than a year.

Trader: And benchmarking these applications on Frontier and Aurora – Aurora being the Argonne National Laboratory system. So that’s the other system that is part of your goal to benchmark applications on to hit your KPPs.

Kothe: That is correct. Now, the way that KPPs are defined is we can ideally hit all of our KPPs on one system. But we really want to do far better than that, and achieve them on both. Because for the better of science, post-ECP – these are science and engineering tools – for I think decades, we want to show that this ecosystem is robust and portable, and able to deliver great answers on any number of types of hardware. So we definitely want to get on Aurora and do the same thing.

Trader: And what are the steps to go beyond ECP? What are the plans in place as you wind down ECP to prepare for future milestones?

Kothe: So the applications are going to be mature enough to be used in science campaigns by the program offices, and I’ll call science campaign as I’m using the application to discover new science to design new things, but also to further validate the code. The point is validation is comparison against experimental data. We’ve done some of that in ECP. But the program offices – and again, it’s not our decision, but we’ve been engaged with the program office stakeholders that, you know, view a given application as being in their mission space. We’ve been talking to them for the last five years. And in also negotiating with with ASCR, the Advanced Scientific Computing Research office on making sure the software stack is sustainable. And like I mentioned at the HPC User Forum are foreign, we’ve been releasing our software stack now for three and a half years, every quarter. It’s called E4S – Extreme-scale Scientific Software Stack – and we’ve got really a nice cadence and a nice process for essentially handing off this software stack to DOE, and a lot of us will still be working on it and evolving it post-ECP. So I’m quite confident it will be sustained, not just released, documented and available, but further evolved. We see in the five to 10 years to come, the software stack will evolve to capture edge technologies, likely quantum capabilities – by that meaning, you know, elements of the software stack to support quantum – and, of course, more AI and machine learning. So the stack will continue to evolve as we kind of move through these next two or three tipping points.

Trader: And then high level, how do these exascale systems support the mission of the DOE and the NNSA? And why are exascale systems important?

Kothe: Very good question. So, as you probably remember, there are a number of exascale requirements workshops that were held, pointing to very specific program offices. In the case of ECP, there are on the order of 10 program offices that we’re building applications for. So we sat down with the program offices and as a part of these workshops and in private discussions, and talked about problems of strategic interest for their office that they currently cannot address or solve today that are amenable to computing, at least for part of the solution, and need exascale. So every one of our applications is targeting a very specific problem that’s really unachievable and unattainable without exascale resources. And again, you know, you need lots of memory and big compute, to go after these big problems. So without exascale, some of these, a lot of these problems would take months or years to address on a petascale system, let’s say, or they’re just not even attainable. So the exascale drivers are there. And you know, we sat down with the program offices at the very beginning and laid out plans for each app. And part of their KPP is to show they can do that problem. And that they can do that problem by fully exploiting all the breadth and depth of Frontier or Aurora. So they have very specific metrics about full system runs, doing all the science and incorporating all the science and the physics needed for a given problem with very specific outcome metrics for each problem.

Trader: Computational milestones like exascale are exciting and inspirational. What do you think when you look ahead to future milestones?

Kothe: Well, I’d like to think that I will probably be retired at that time. But these tools and technologies we’ve developed will lead to groundbreaking discoveries, Nobel Prizes, new concepts and designs for the landscape of DOE, from energy production to power grid to materials and chemistry for energy. I mean, we’re in the middle of some real challenges right now. And you know, I really need to mention the national security aspect, too. So, you know, it’s not about just exascale. In my mind as an application person, it’s about, we need to as a nation [need to], and we are, leading applications that deliver solutions to policymakers [and] to decision-makers to make consequential decisions. So I do anticipate the results of the simulation insights provided will greatly sort of de-risk decisions and give us high confidence in making decisions that we can bank on – that’s really an important role for simulation.

Trader: Great. Well, let’s leave it at that. Thanks. It’s been great talking with you.

Kothe: Thanks, Tiffany.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Nvidia Showcases Work with Quantum Centers at ISC24

May 13, 2024

With quantum computing surging in Europe, Nvidia took advantage of ISC24 to showcase its efforts working with quantum development centers. Currently, Nvidia GPUs are dominant inside classical systems used for quantum sim Read more…

ISC24: Hyperion Research Predicts HPC Market Rebound after Flat 2023

May 13, 2024

First, the top line: the overall HPC market was flat in 2023 at roughly $37 billion, bogged down by supply chain issues and slowed acceptance of some larger systems (e.g. exascale), according to Hyperion Research’s ann Read more…

Top 500: Aurora Breaks into Exascale, but Can’t Get to the Frontier of HPC

May 13, 2024

The 63rd installment of the TOP500 list is available today in coordination with the kickoff of ISC 2024 in Hamburg, Germany. Once again, the Frontier system at Oak Ridge National Laboratory in Tennessee, USA, retains its Read more…

Harvard/Google Use AI to Help Produce Astonishing 3D Map of Brain Tissue

May 10, 2024

Although LLMs are getting all the notice lately, AI techniques of many varieties are being infused throughout science. For example, Harvard researchers, Google, and colleagues published a 3D map in Science this week that Read more…

ISC Preview: Focus Will Be on Top500 and HPC Diversity 

May 9, 2024

Last year's Supercomputing 2023 in November had record attendance, but the direction of high-performance computing was a hot topic on the floor. Expect more of that at the upcoming ISC High Performance 2024, which is hap Read more…

Processor Security: Taking the Wong Path

May 9, 2024

More research at UC San Diego revealed yet another side-channel attack on x86_64 processors. The research identified a new vulnerability that allows precise control of conditional branch prediction in modern processors.� Read more…

ISC24: Hyperion Research Predicts HPC Market Rebound after Flat 2023

May 13, 2024

First, the top line: the overall HPC market was flat in 2023 at roughly $37 billion, bogged down by supply chain issues and slowed acceptance of some larger sys Read more…

Top 500: Aurora Breaks into Exascale, but Can’t Get to the Frontier of HPC

May 13, 2024

The 63rd installment of the TOP500 list is available today in coordination with the kickoff of ISC 2024 in Hamburg, Germany. Once again, the Frontier system at Read more…

ISC Preview: Focus Will Be on Top500 and HPC Diversity 

May 9, 2024

Last year's Supercomputing 2023 in November had record attendance, but the direction of high-performance computing was a hot topic on the floor. Expect more of Read more…

Illinois Considers $20 Billion Quantum Manhattan Project Says Report

May 7, 2024

There are multiple reports that Illinois governor Jay Robert Pritzker is considering a $20 billion Quantum Manhattan-like project for the Chicago area. Accordin Read more…

The NASA Black Hole Plunge

May 7, 2024

We have all thought about it. No one has done it, but now, thanks to HPC, we see what it looks like. Hold on to your feet because NASA has released videos of wh Read more…

How Nvidia Could Use $700M Run.ai Acquisition for AI Consumption

May 6, 2024

Nvidia is touching $2 trillion in market cap purely on the brute force of its GPU sales, and there's room for the company to grow with software. The company hop Read more…

Hyperion To Provide a Peek at Storage, File System Usage with Global Site Survey

May 3, 2024

Curious how the market for distributed file systems, interconnects, and high-end storage is playing out in 2024? Then you might be interested in the market anal Read more…

Qubit Watch: Intel Process, IBM’s Heron, APS March Meeting, PsiQuantum Platform, QED-C on Logistics, FS Comparison

May 1, 2024

Intel has long argued that leveraging its semiconductor manufacturing prowess and use of quantum dot qubits will help Intel emerge as a leader in the race to de Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Leading Solution Providers

Contributors

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

The NASA Black Hole Plunge

May 7, 2024

We have all thought about it. No one has done it, but now, thanks to HPC, we see what it looks like. Hold on to your feet because NASA has released videos of wh Read more…

Intel Plans Falcon Shores 2 GPU Supercomputing Chip for 2026  

August 8, 2023

Intel is planning to onboard a new version of the Falcon Shores chip in 2026, which is code-named Falcon Shores 2. The new product was announced by CEO Pat Gel Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

How the Chip Industry is Helping a Battery Company

May 8, 2024

Chip companies, once seen as engineering pure plays, are now at the center of geopolitical intrigue. Chip manufacturing firms, especially TSMC and Intel, have b Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire