In a somewhat shortened version of his annual ISC keynote surveying the HPC landscape Thomas Sterling, lauded the community’s effort in bringing HPC to bear in the fight against the pandemic, welcomed the start of the exascale – if not yet exaflops – era with quick tour of some big machines, speculated a little on what China may be planning, and paid tribute to new and ongoing efforts to bring fresh talent into HPC.
Sterling is a longtime HPC leader, professor at Indiana University, and one of the co-developers of Beowulf cluster. Let’s jump in (with apologies for any garbling of quotes).
The pandemic affected everything.
“It has been a tragedy. There have been more than 200 million COVID cases worldwide, and almost 4 million deaths. And frankly, those numbers are probably conservative, and the actual numbers are much greater. We may never know. In the U.S., shockingly, more than half a million people, 600,000 people, have been killed by this virulent disease. And we’ve experienced over 34 million cases just in the U.S. alone, and our case rate per 1 million of the population is greater than 10 percent,” he said.
“One of the things that came out of this is an appreciation for what has been called urgent computing, the ability for high performance computing in general and the resources, both in terms of facility and talent, to be rapidly brought to bear to a problem, even a problem as challenging as that of COVID-19. Over the year across the international community, very quickly, HPC resources were freed up and made available to scientists. In addition, expert assistance and code development optimization were added to the scientific community to minimize the time of deployment of their code and their application to drug discovery to exploration and to analysis of new possible candidates of cures. In this sense, the high-performance computing community can be proud at the job [done] yet humbled by its own limitations in attacking this problem.”
Fugaku is an impressive machine
“Much of this slide I have used before. The core design is Arm done by Fujitsu and added to that is the use of a significant vector extensions that have demonstrated, in their view, that a homogeneous machine can compete with accelerator machines, and that future designs will be more varied than a singular formula. Is the jury done and the verdict in [on this]? No. As rapid changes are taking place we’ll still see this constructive tension among those. But what we are finding is that the broader range of applications not just in high performance computing, per se, but in AI, in machine learning and in big data and analytics, all of these can be done on machines that are intended for extreme scale,” said Sterling.
“Now, I said extreme scale. Fugaku is not an exaflops Rmax machine, but it comes close. It’s in somewhere around 700*. I apologize to our friends, Satoshi Matsuoka who is standing there in front of his machine. But in the area of lower precision, for intelligence computing, it is indeed an exascale machine. So we are now in an era of exascale if not yet classic exaflops.”
The age of big machines
This era of exascale and exaflops is rapidly dawning around the globe and Sterling briefly reviewed several systems now or soon-to-be rolling out. Importantly, he emphasized, the line between AI and HPC is happening fast and that fusion is greatly influencing HPC computer architecture.
About Frontier, which is expected to be the first U.S. exascale system stood up, he said:
“The Frontier machine has been announced as going to be the U.S.’s first exaflops and by exaflops, I mean an Rmax supercomputing somewhere around – we don’t have the measurements, of course – but the estimates are about one and a half exaflops Rmax. This will be operated in the Oak Ridge National Laboratory or the Oak Ridge Leadership Computing Facility in Tennessee, where the current Summit machine is, and this will be deployed towards the end of this year or the very beginning of the next year. It is being integrated by a Cray division of Hewlett Packard Enterprise and incorporates AMD chips, providing substantial performance and energy efficiency, although it’s predicted that the power consumption will be on the order of 30 megawatts but in a footprint [that’s] somewhat modest of just over 100 racks. The cost is $600 million. That’s a lot of money. [I’m] looking forward to this machine being operated and the science and the data analytics that can be performed with it.”
Sterling gave a brief tour of several of the forthcoming large systems, most of whose names are familiar to the HPC community. Despite being largely accelerated-based architectures, there is diversity among the approaches. He singled out the UK Met Office-Microsoft project to build the Met Office’s next system for weather forecasting in the cloud. That’s a first. He looked at the Euro Joint Undertaking’s Lumi project which will be a roughly half exaflops system.
“[The system] will be in Finland but there are 10 different countries that are involved in the consortium that together will share this machine. You have the list (on the slide below) of such countries starting with Finland and going down to Switzerland. There are multiple partitions for different purposes. So, I think that this is a slightly different way of organizing machines, where distinct countries will be managing different partitions and have different responsibilities,” said Sterling.
About the UK Met-Microsoft project, he noted, “They’re saying that this will be the world’s largest [web-based] climate modeling supercomputer, and this will be deployed a year from now that in the summer of 2022. Its floating-point performance will be 60 petaflops distributed among an organization of four quadrants, each 15 petaflops. There’ll be one and a half million CPUs of the AMD Epyc type, and eventually, I don’t know the year, there will be a midlife kicker, giving it a performance increase by a factor of three. So this will have a long life, indeed a life of about 10 years. What I find extraordinary is that this is a commitment of about one and a half billion dollars over a 10-year period. This is very serious, very significant dedication to a single domain of application.”
Here are a few of his slides on the coming systems.
China is the Dragon in the room
“Okay, so I talked about big machines. And there’s obviously one really big hole, and, you know, maybe what we should say is that’s the big dragon in the room. It’s China, of course, China has deployed over the last decade more than one Top500 machine. And over their evolution of machines they’ve taken a strong, organized and frankly, I’d call it a disciplined approach. In fact, it’s been a three-pronged strategy that they have moved forward. These include the National University of Defense Technology, the National Research Center of Parallel Computer (NRCPC) Engineering and Technology, and third, Sugon, which for those old gray beards, such as myself, we remember as Dawning,” said Sterling.
“All three of these different organizations are pursuing and following different approaches and I don’t know who’s in the lead or when their next big machine will hit the floor, but recently there have been some hints that have been exposed for one of them. And this is the NRCPC Sunway custom architecture. Now, you’ll remember the Sunway TaihuLight. Well, I didn’t know this, but in fact, their plan all along with TaihuLight was designed to be scalable, truly scalable. It was delivering something over 100 petaflops when it was deployed and led the list of HPC systems there and their intent is to bring that up to exascale. Now I use the term exascale as opposed to exaflops for the same reasons I did before. Their peak performance will be floating point. Four exaflops for single precision, and one exaflops for double precision. That’s peak performance. It’s anticipated that their Linpack Rmax will be around 700 petaflops.
“You know, the Sunway architecture is really interesting, because of its use of an enormous number of very lightweight processing elements organized in conjunction with a main processing elements to handle a sequential execution. The expectation is that, as opposed to 28 nanometers, for TaihuLight, this will be 14 nanometers as SMIC, the semiconductor manufacturer fabrication company will provide this at about just under one and a half gigahertz, which is about the same clock rate as TaihuLight. Why? Well, of course, to try to keep the power down. In doing this, they will have eight core groups**, as opposed to the four core groups you see in the lower black and white schematic (slide below), they will double the size of the words or multi-word lines from 256 bits to 512 bits. And they will increase the total size of the machine from somewhere around 40,000 nodes to 80,000 nodes. I don’t know when. But we can certainly wish our friends in China the best of luck as they push the edge of the envelope,” he said.
QUICK HITS – MPI Still Strong; In Praise of STEM
“Within the next small number of months, exactly when I don’t know, MPI 4.0 will be released with a number of improvements that have been carefully considered, examined and debated, including such things but not limited to persistent collaborative, persistent collective operations. For significant improvements in efficiency, and improvements in error handling a number of other as you can see these as well are either going to be in or are going to be considered for later extensions to 4.1. And if you thought that was it, now, there will be an MPI 5.0. The committee is open for new ideas. I don’t know how long this is going to go. But MPI 4.0 coming to an internet place near you,” said Sterling.
Sterling gave nods to various efforts to support HPC students and STEM efforts generally. He noted the establishment of the new Moscow State University branch at the Russian National Physics and Mathematics Center, near Nizhny Novgorod. “I’ve been there, a lovely small city. The MSU Sarov branch is intended to frankly attract the best scientists and students and faculty. No, I haven’t gotten my invitation letter yet and it (MSU) will be directed by our good friend and respected colleague, Vladimir Voevodin shown here,” he said.
Sterling had praise for the Texas Advanced Computing Center which helped South Africa by training its student cluster team by bringing them over to Austin, and “really giving them sort of a turbocharged experience in this area. Dan Stanzione (TACC director) shown here (slide below) also managed to make possible the repurposing of one of their earlier machines and giving it a second life at CHPC in South Africa.”
He concluded with kudos for the STEM-Trek organization led by Elizabeth Leake:
“The final person here is one who frankly, we really need to acknowledge and that is Elizabeth Leake. Now many of you know Elizabeth, she is part of our community and always with a friendly smile. But she is much more than that. She is the founder of STEM-Trek track, a nonprofit organization that is intended to – and let me read this – support scholarly travel, mentoring and advanced skills training in STEM scholars and students from underrepresented demographics in the use of 21st century cyberinfrastructure. I can’t read to you the long list of accomplishments, but through STEM-Trek, students are encouraged and engaged in high performance computing. She has singularly managed to acquire travel grants for students who otherwise, frankly, would never get to see conferences like ISC. You see a picture of her with students I met a couple of years ago. Elizabeth deserves very high praise for all of her contributions.”
* Fugaku’s Top500 Rmax is 442 petaflops and Rpeak is 537 petaflops.
** One observer noted in the ISC chat window during the keynote that Sunway would have six not eight core groups.