ISC 2021 Keynote: Thomas Sterling on Urgent Computing, Big Machines, China Speculation

By John Russell

July 1, 2021

In a somewhat shortened version of his annual ISC keynote surveying the HPC landscape Thomas Sterling, lauded the community’s effort in bringing HPC to bear in the fight against the pandemic, welcomed the start of the exascale – if not yet exaflops – era with quick tour of some big machines, speculated a little on what China may be planning, and paid tribute to new and ongoing efforts to bring fresh talent into HPC.

Sterling is a longtime HPC leader, professor at Indiana University, and one of the co-developers of Beowulf  cluster. Let’s jump in (with apologies for any garbling of quotes).

The pandemic affected everything.

Thomas Sterling

“It has been a tragedy. There have been more than 200 million COVID cases worldwide, and almost 4 million deaths. And frankly, those numbers are probably conservative, and the actual numbers are much greater. We may never know. In the U.S., shockingly, more than half a million people, 600,000 people, have been killed by this virulent disease. And we’ve experienced over 34 million cases just in the U.S. alone, and our case rate per 1 million of the population is greater than 10 percent,” he said.

“One of the things that came out of this is an appreciation for what has been called urgent computing, the ability for high performance computing in general and the resources, both in terms of facility and talent, to be rapidly brought to bear to a problem, even a problem as challenging as that of COVID-19. Over the year across the international community, very quickly, HPC resources were freed up and made available to scientists. In addition, expert assistance and code development optimization were added to the scientific community to minimize the time of deployment of their code and their application to drug discovery to exploration and to analysis of new possible candidates of cures. In this sense, the high-performance computing community can be proud at the job [done] yet humbled by its own limitations in attacking this problem.”

Fugaku is an impressive machine

“Much of this slide I have used before. The core design is Arm done by Fujitsu and added to that is the use of a significant vector extensions that have demonstrated, in their view, that a homogeneous machine can compete with accelerator machines, and that future designs will be more varied than a singular formula. Is the jury done and the verdict in [on this]? No. As rapid changes are taking place we’ll still see this constructive tension among those. But what we are finding is that the broader range of applications not just in high performance computing, per se, but in AI, in machine learning and in big data and analytics, all of these can be done on machines that are intended for extreme scale,” said Sterling.

“Now, I said extreme scale. Fugaku is not an exaflops Rmax machine, but it comes close. It’s in somewhere around 700*. I apologize to our friends, Satoshi Matsuoka who is standing there in front of his machine. But in the area of lower precision, for intelligence computing, it is indeed an exascale machine. So we are now in an era of exascale if not yet classic exaflops.”

The age of big machines

This era of exascale and exaflops is rapidly dawning around the globe and Sterling briefly reviewed several systems now or soon-to-be rolling out. Importantly, he emphasized, the line between AI and HPC is happening fast and that fusion is greatly influencing HPC computer architecture.

About Frontier, which is expected to be the first U.S. exascale system stood up, he said:

“The Frontier machine has been announced as going to be the U.S.’s first exaflops and by exaflops, I mean an Rmax supercomputing somewhere around – we don’t have the measurements, of course – but the estimates are about one and a half exaflops Rmax. This will be operated in the Oak Ridge National Laboratory or the Oak Ridge Leadership Computing Facility in Tennessee, where the current Summit machine is, and this will be deployed towards the end of this year or the very beginning of the next year. It is being integrated by a Cray division of Hewlett Packard Enterprise and incorporates AMD chips, providing substantial performance and energy efficiency, although it’s predicted that the power consumption will be on the order of 30 megawatts but in a footprint [that’s] somewhat modest of just over 100 racks. The cost is $600 million. That’s a lot of money. [I’m] looking forward to this machine being operated and the science and the data analytics that can be performed with it.”

Sterling gave a brief tour of several of the forthcoming large systems, most of whose names are familiar to the HPC community. Despite being largely accelerated-based architectures, there is diversity among the approaches. He singled out the UK Met Office-Microsoft project to build the Met Office’s next system for weather forecasting in the cloud. That’s a first. He looked at the Euro Joint Undertaking’s Lumi project which will be a roughly half exaflops system.

“[The system] will be in Finland but there are 10 different countries that are involved in the consortium that together will share this machine. You have the list (on the slide below) of such countries starting with Finland and going down to Switzerland. There are multiple partitions for different purposes. So, I think that this is a slightly different way of organizing machines, where distinct countries will be managing different partitions and have different responsibilities,” said Sterling.

About the UK Met-Microsoft project, he noted, “They’re saying that this will be the world’s largest [web-based] climate modeling supercomputer, and this will be deployed a year from now that in the summer of 2022. Its floating-point performance will be 60 petaflops distributed among an organization of four quadrants, each 15 petaflops. There’ll be one and a half million CPUs of the AMD Epyc type, and eventually, I don’t know the year, there will be a midlife kicker, giving it a performance increase by a factor of three. So this will have a long life, indeed a life of about 10 years. What I find extraordinary is that this is a commitment of about one and a half billion dollars over a 10-year period. This is very serious, very significant dedication to a single domain of application.”

Here are a few of his slides on the coming systems.



China is the Dragon in the room

“Okay, so I talked about big machines. And there’s obviously one really big hole, and, you know, maybe what we should say is that’s the big dragon in the room. It’s China, of course, China has deployed over the last decade more than one Top500 machine. And over their evolution of machines they’ve taken a strong, organized and frankly, I’d call it a disciplined approach. In fact, it’s been a three-pronged strategy that they have moved forward. These include the National University of Defense Technology, the National Research Center of Parallel Computer (NRCPC) Engineering and Technology, and third, Sugon, which for those old gray beards, such as myself, we remember as Dawning,” said Sterling.

“All three of these different organizations are pursuing and following different approaches and I don’t know who’s in the lead or when their next big machine will hit the floor, but recently there have been some hints that have been exposed for one of them. And this is the NRCPC Sunway custom architecture. Now, you’ll remember the Sunway TaihuLight. Well, I didn’t know this, but in fact, their plan all along with TaihuLight was designed to be scalable, truly scalable. It was delivering something over 100 petaflops when it was deployed and led the list of HPC systems there and their intent is to bring that up to exascale. Now I use the term exascale as opposed to exaflops for the same reasons I did before. Their peak performance will be floating point. Four exaflops for single precision, and one exaflops for double precision. That’s peak performance. It’s anticipated that their Linpack Rmax will be around 700 petaflops.

“You know, the Sunway architecture is really interesting, because of its use of an enormous number of very lightweight processing elements organized in conjunction with a main processing elements to handle a sequential execution. The expectation is that, as opposed to 28 nanometers, for TaihuLight, this will be 14 nanometers as SMIC, the semiconductor manufacturer fabrication company will provide this at about just under one and a half gigahertz, which is about the same clock rate as TaihuLight. Why? Well, of course, to try to keep the power down. In doing this, they will have eight core groups**, as opposed to the four core groups you see in the lower black and white schematic (slide below), they will double the size of the words or multi-word lines from 256 bits to 512 bits. And they will increase the total size of the machine from somewhere around 40,000 nodes to 80,000 nodes. I don’t know when. But we can certainly wish our friends in China the best of luck as they push the edge of the envelope,” he said.

QUICK HITS – MPI Still Strong; In Praise of STEM

“Within the next small number of months, exactly when I don’t know, MPI 4.0 will be released with a number of improvements that have been carefully considered, examined and debated, including such things but not limited to persistent collaborative, persistent collective operations. For significant improvements in efficiency, and improvements in error handling a number of other as you can see these as well are either going to be in or are going to be considered for later extensions to 4.1. And if you thought that was it, now, there will be an MPI 5.0. The committee is open for new ideas. I don’t know how long this is going to go. But MPI 4.0 coming to an internet place near you,” said Sterling.

Sterling gave nods to various efforts to support HPC students and STEM efforts generally. He noted the establishment of the new Moscow State University branch at the Russian National Physics and Mathematics Center, near Nizhny Novgorod. “I’ve been there, a lovely small city. The MSU Sarov branch is intended to frankly attract the best scientists and students and faculty. No, I haven’t gotten my invitation letter yet and it (MSU) will be directed by our good friend and respected colleague, Vladimir Voevodin shown here,” he said.

Sterling had praise for the Texas Advanced Computing Center which helped South Africa by training its student cluster team by bringing them over to Austin, and “really giving them sort of a turbocharged experience in this area. Dan Stanzione (TACC director) shown here (slide below) also managed to make possible the repurposing of one of their earlier machines and giving it a second life at CHPC in South Africa.”

He concluded with kudos for the STEM-Trek organization led by Elizabeth Leake:

“The final person here is one who frankly, we really need to acknowledge and that is Elizabeth Leake. Now many of you know Elizabeth, she is part of our community and always with a friendly smile. But she is much more than that. She is the founder of STEM-Trek track, a nonprofit organization that is intended to – and let me read this – support scholarly travel, mentoring and advanced skills training in STEM scholars and students from underrepresented demographics in the use of 21st century cyberinfrastructure. I can’t read to you the long list of accomplishments, but through STEM-Trek, students are encouraged and engaged in high performance computing. She has singularly managed to acquire travel grants for students who otherwise, frankly, would never get to see conferences like ISC. You see a picture of her with students I met a couple of years ago. Elizabeth deserves very high praise for all of her contributions.”

NOTES

*  Fugaku’s Top500 Rmax is 442 petaflops and Rpeak is 537 petaflops.

** One observer noted in the ISC chat window during the keynote that Sunway would have six not eight core groups.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Under The Wire: Nearly HPC News (June 13, 2024)

June 13, 2024

As managing editor of the major global HPC news source, the term "news fire hose" is often mentioned. The analogy is quite correct. In any given week, there are many interesting stories, and only a few ever become headli Read more…

Quantum Tech Sector Hiring Stays Soft

June 13, 2024

New job announcements in the quantum tech sector declined again last month, according to an Quantum Economic Development Consortium (QED-C) report issued last week. “Globally, the number of new, public postings for Qu Read more…

Labs Keep Supercomputers Alive for Ten Years as Vendors Pull Support Early

June 12, 2024

Laboratories are running supercomputers for much longer, beyond the typical lifespan, as vendors prematurely deprecate the hardware and stop providing support. A typical supercomputer lifecycle is about five to six years Read more…

MLPerf Training 4.0 – Nvidia Still King; Power and LLM Fine Tuning Added

June 12, 2024

There are really two stories packaged in the most recent MLPerf  Training 4.0 results, released today. The first, of course, is the results. Nvidia (currently king of accelerated computing) wins again, sweeping all nine Read more…

Highlights from GlobusWorld 2024: The Conference for Reimagining Research IT

June 11, 2024

The Globus user conference, now in its 22nd year, brought together over 180 researchers, system administrators, developers, and IT leaders from 55 top research computing centers, national labs, federal agencies, and univ Read more…

Nvidia Shipped 3.76 Million Data-center GPUs in 2023, According to Study

June 10, 2024

Nvidia had an explosive 2023 in data-center GPU shipments, which totaled roughly 3.76 million units, according to a study conducted by semiconductor analyst firm TechInsights. Nvidia's GPU shipments in 2023 grew by more Read more…

Under The Wire: Nearly HPC News (June 13, 2024)

June 13, 2024

As managing editor of the major global HPC news source, the term "news fire hose" is often mentioned. The analogy is quite correct. In any given week, there are Read more…

Labs Keep Supercomputers Alive for Ten Years as Vendors Pull Support Early

June 12, 2024

Laboratories are running supercomputers for much longer, beyond the typical lifespan, as vendors prematurely deprecate the hardware and stop providing support. Read more…

MLPerf Training 4.0 – Nvidia Still King; Power and LLM Fine Tuning Added

June 12, 2024

There are really two stories packaged in the most recent MLPerf  Training 4.0 results, released today. The first, of course, is the results. Nvidia (currently Read more…

Highlights from GlobusWorld 2024: The Conference for Reimagining Research IT

June 11, 2024

The Globus user conference, now in its 22nd year, brought together over 180 researchers, system administrators, developers, and IT leaders from 55 top research Read more…

Nvidia Shipped 3.76 Million Data-center GPUs in 2023, According to Study

June 10, 2024

Nvidia had an explosive 2023 in data-center GPU shipments, which totaled roughly 3.76 million units, according to a study conducted by semiconductor analyst fir Read more…

ASC24 Expert Perspective: Dongarra, Hoefler, Yong Lin

June 7, 2024

One of the great things about being at an ASC (Asia Supercomputer Community) cluster competition is getting the chance to interview various industry experts and Read more…

HPC and Climate: Coastal Hurricanes Around the World Are Intensifying Faster

June 6, 2024

Hurricanes are among the world's most destructive natural hazards. Their environment shapes their ability to deliver damage; conditions like warm ocean waters, Read more…

ASC24: The Battle, The Apps, and The Competitors

June 5, 2024

The ASC24 (Asia Supercomputer Community) Student Cluster Competition was one for the ages. More than 350 university teams worked for months in the preliminary competition to earn one of the 25 final competition slots. The winning teams... Read more…

Atos Outlines Plans to Get Acquired, and a Path Forward

May 21, 2024

Atos – via its subsidiary Eviden – is the second major supercomputer maker outside of HPE, while others have largely dropped out. The lack of integrators and Atos' financial turmoil have the HPC market worried. If Atos goes under, HPE will be the only major option for building large-scale systems. Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Everyone Except Nvidia Forms Ultra Accelerator Link (UALink) Consortium

May 30, 2024

Consider the GPU. An island of SIMD greatness that makes light work of matrix math. Originally designed to rapidly paint dots on a computer monitor, it was then Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Some Reasons Why Aurora Didn’t Take First Place in the Top500 List

May 15, 2024

The makers of the Aurora supercomputer, which is housed at the Argonne National Laboratory, gave some reasons why the system didn't make the top spot on the Top Read more…

Leading Solution Providers

Contributors

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

The NASA Black Hole Plunge

May 7, 2024

We have all thought about it. No one has done it, but now, thanks to HPC, we see what it looks like. Hold on to your feet because NASA has released videos of wh Read more…

Google Announces Sixth-generation AI Chip, a TPU Called Trillium

May 17, 2024

On Tuesday May 14th, Google announced its sixth-generation TPU (tensor processing unit) called Trillium.  The chip, essentially a TPU v6, is the company's l Read more…

Intel’s Next-gen Falcon Shores Coming Out in Late 2025 

April 30, 2024

It's a long wait for customers hanging on for Intel's next-generation GPU, Falcon Shores, which will be released in late 2025.  "Then we have a rich, a very Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

Intel Plans Falcon Shores 2 GPU Supercomputing Chip for 2026  

August 8, 2023

Intel is planning to onboard a new version of the Falcon Shores chip in 2026, which is code-named Falcon Shores 2. The new product was announced by CEO Pat Gel Read more…

How the Chip Industry is Helping a Battery Company

May 8, 2024

Chip companies, once seen as engineering pure plays, are now at the center of geopolitical intrigue. Chip manufacturing firms, especially TSMC and Intel, have b Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire