The Texas Advanced Computing Center (TACC) today announced Stampede3, a powerful new Dell Technologies and Intel based supercomputer that will enable groundbreaking open science research projects in the U.S. while leveraging the nation’s previous high performance computing investment funds.
For over a decade, the Stampede systems — Stampede (2012) and Stampede2 (2017) — have been flagships in the U.S. National Science Foundation’s (NSF) ACCESS scientific supercomputing ecosystem. The Stampede systems continue to provide a vital capability for researchers in every field of science.
Made possible by a $10 million award for new computer hardware from the NSF, Stampede3 will be the newest strategic resource for the nation’s open science community when it enters full production in early 2024. It will enable thousands of researchers nationwide to investigate questions that require advanced computing power.
Stampede3 will deliver:
- A new 4 petaflop capability for high-end simulation: 560 new Intel® Xeon® CPU Max Series processors with high bandwidth memory-enabled nodes (HBM), adding nearly 63,000 cores for the largest, most performance-intensive compute jobs.
- A new graphics processing unit/AI subsystem including 10 Dell PowerEdge XE9640 servers adding 40 new Intel® Data Center GPU Max Series processors (Pointe Vecchio) for AI/ML and other GPU-friendly applications.
- Reintegration of 224 3rd Gen Intel Xeon Scalable processor nodes for higher memory applications (added to Stampede2 in 2021).
- Legacy hardware to support throughput computing — more than 1,000 existing Stampede2 2nd Gen Intel Xeon Scalable processor nodes will be incorporated into the new system to support high-throughput computing, interactive workloads, and other smaller workloads.
- The new Omni-Path Fabric 400 Gb/s technology offers highly scalable performance through a network interconnect with 24 TB/s backplane bandwidth to enable low latency, excellent scalability for applications, and high connectivity to the I/O subsystem.
- 1,858 compute nodes with more than 140,000 cores, more than 330 terabytes of RAM, 13 petabytes of new storage, and almost 10 petaflops of peak capability.
In addition, Stampede3 will be the only system in the NSF ACCESS environment to integrate the new Intel Max Series GPUs.
Recently, HPCwire had an opportunity to talk with Dan Stanzione, executive director of TACC about the new Stampede3 system.
HPCWire: The Stampede system has been around for quite a while. Is the plan for a complete replacement or to reuse of some hardware existing hardware for Stampede3?
Dan Stanzione: Historically, we have had 11 straight years of Stampede1 and Stampede2 is now in its seventh year. We hit six years about two months ago and it’s getting pretty long in the tooth. NSF had a program for new ACCESS production resources, and we pitched given that Stampede3 is smaller in scale than many new systems, Stampede3 will have new systems and add to the best parts of the current Stampede2 cluster. We ended up doing Stampede2 in various phases as well. Phase 1 started in 2016 and ended up going into 2017 and was the Knights Landing Xeon Phi systems, which was two thirds of that machine.
And then a little later we added a bunch of Skylake nodes to it for the other third, which was always the plan. They were supposed to be a year apart, but they ended up only being a few months apart. One was a little late and the other was a little early with Intel hardware, and it ran great for years. But when NSF asked us to do one of the extensions because originally it was a four-year machine and was going to shut down in 2021 (mid-June 2021 was supposed to be the original shutdown date) I said, look, if we are going to extend it. We need something new.
HPCwire: What was the general plan?
Stanzione: We took out the original set of 500 or so Knights Landing systems that we had installed at the end of the Stampede1 project, they were even older than the other hardware in that system, and replaced them with Ice Lake just around beginning of 2022 (or late 2021) We now have a few hundred Ice Lake nodes in there and then we still have a bunch of Skylake nodes in there as well. And we have about 3800 nodes that are going offline this weekend (mid-July 2023), so that we already extended by over two years of their original lifespan. The Ice Lakes are new and there is no point in tossing them out. Given what we have them in-house already and that we have a smaller budget on the Stampede3, what could we really leverage? Stampede has always been an Intel based system for us, but the ACCESS audience we address with Stampede, we wanted to keep the software environment and the user model consistent.
HPCwire: The new systems will have HBM memory, how do plan on using it?
Stanzione: You know what we’re always starved for on these CPU systems is memory bandwidth and we are putting in 560 new Sapphire Rapids nodes. It’s about four petaflops worth of Sapphire Rapids and we are doing HBM only.
HPCwire: Why HMB Only?
Stanzione: These are the top skew the 56 cores per socket, so 112 cores per node systems. To really leverage that part, and in our opinion for HPC, you don’t put DIMMs in because that slows HBM down. The systems have, 128G of pure HBM and the thinking is, if you need more per core than and you can’t scale out to more cores to reduce your memory footprint, then that’s what we still have those Ice Lake nodes for. They have 256GB of memory and less cores, so they offer 3GB per core.
In pure HBM mode there is still a little over a 1GB per core, but if we put the DIMMs in, you pay for the memory twice because you pay for all the HBM and then you pay for added memory. Adding another 24 DIMMs to those nodes increases the cost by about four grand. And then when you hit standard memory by accident things run slower.
With this arrangement, we guarantee that every application runs only out of HBM. The best way to enforce that is to not put in any other RAM. We are seeing in some cases as high as a factor of two or better versus a regular Sapphire Rapids with just standard DIMMs. We did some side-by-side comparisons. When you look back at the Cascade Lakes and Frontera or the Sky Lakes in Stampede2 we are seeing a 5X per socket with the HBM bandwidth from both the Sapphire Rapids and the HBM improvement. There are a few codes where performance is not that much different because they are not memory bandwidth bound, but on average I think we are seeing 60 to 70% improvement for our most common codes. Keep in mind one of the big advantages GPUs have had, is a lot more memory bandwidth per flop than a CPU has. So the core of what we’re doing in a big a Sapphire Rapid system with very fast memory.
HPCwire: Are you including any other hardware?
Stanzione: Again, we are adding 560 nodes of Sapphire Rapids with HBM and along with that we are going to put in a small system with exploratory capability using Intel Ponte Vecchio. We are still negotiating exactly how much of that will have, but I would say a minimum of 40 nodes and maximum of a hundred or so. We have our Lone Star systems with a few hundred Nvidia GPUs and that’s where we’re doing most of our AI work. But we want to see if we can move some of that workload onto Intel. But we haven’t really exposed that to the user community. We don’t know what the uptake is going to be. So, we’re just putting a couple of racks of Ponte Vecchio out there to see how people work with it. It is coming up on Aurora right now, if we get good adoption, we hope to add some more. But at this point, we need to get the user software base to sort-of kick the tires on this and figure out the software frameworks and all that kind of stuff for it.
On the smaller Ponte Vecchio system, there will be four-way nodes, that is four GPUs per node, again with Sapphire Rapids. And then what we’re going to do is re-purpose a fair amount of what’s left over from Stampede starting with the Ice Lake nodes we’ll have about 224 nodes with 256 GB and 80 cores per node.
Those nodes will deal with the larger memory workloads, and then to continue the broad mission of Stampede2 with Stampede3 as we see a ton of single node Python and Matlab throughput jobs that don’t really care about turnaround time that much. Actually, we don’t have maintenance on our Skylake nodes anymore, so we’re going to keep a reserve and let some fail. And we promised to keep 1000 Skylake nodes going as a throughput system providing another 48,000 cores of Skylake. So altogether Stampede3 is about 1858 nodes with 140,000 cores.
HPCWire: What about storage?
Stanzione: We are going to completely replace the storage system and share with another system we’re putting in. We have decided to go with VAST as our storage and do an all-solid state scratch and home volumes. That will be split with another new system we’re putting in this year. We have successful exploratory work with VAST on Frontera, and we’re actually going to tie it into the Frontera fabric because we want to try it at a bigger scale and see how it does.
We have been really impressed by the scalability of VAST so far. And of course, the disk on Stampede is the thing that is showing its age the most because those disks on scratch get beat-up and it’s been over six years of it day-in and day-out use and Lustre has been a great file system. But, you know, the disk drive failures are going way up as it gets older. So, as you might imagine a new file system is in order.
HPCwire: And what about the interconnect?
Stanzione:. When we did Stampede2, it was all Intel Omnipath. So, it’s an all Omnipath system at the moment. For Stampede3, we’re going to reuse a lot of that fabric because we have a ton of 100G Omnipath. All those director class switches are getting old, however, so we’re getting new top of rack switches but now it is a smaller fabric than Stampede2 with 6000 nodes. For the Ice Lake and the Skylake legacy nodes and initially for the Sapphire Rapids we will do 100G Omnipath. But then when it comes out next year, in 2024, we’ll add 400G Omnipath to the Sapphire Rapids nodes and build a non-blocking network for that side Stampede3.
HPCwire: How will do the transition to this hardware?
Stanzione: That will come as a second phase because we really want to do this without a real break in service from Stampede 2. We are hoping to get the nodes in by Supercomputing (SC23) but we’re certainly want to be in production by the first quarter of next year (2024). To make that first quarter, we didn’t want to wait on the 400G Omnipath and decided to bring it up at 100G and then update it when the 400G parts come out. And again, it’s just those 560 nodes and the GPU nodes that will get the 400G network. The other 1,300 nodes we don’t have to re-cable.
So, in general updated core fabric, new 400G fabric for the Sapphire Rapids, new VAST storage system, new Ponte Vecchio systems, and then reusing about half of the Skylakes, and we’ll keep the rest for spare parts so we can keep them running longer. We’ll have a few hundred spares. We’ll keep the Ice Lake for high memory applications since those are only two years old to begin with.
It should be an almost transparent migration for the users because anything you have already built ought to still work. I sometimes forget how old the system is. We are using CentOS-7 now on Stampede2, but we might have been older than that when it started in 2017. The plan is to update it to Rocky-9.
HPCwire: That was my next question. What distribution do you plan to use given the recent changes with Red Hat?
Stanzione: There are some things I’m not willing to say yet on that, but for now, we went with Rocky-8 on LoneStar, our AMD based system last year, and we are going to do Rocky-9 for now. There are some conversations happening, but the plan of record for today is Rocky-9.
HPCwire: So, the transition/shutdown is basically in progress?
Stanzione: So, again, we are hoping for the first quarter 2024. A little sooner if we get lucky with hardware deliveries. We are shutting down parts of Stampede on the 15th of July. Actually, we’re shutting down submission and we will let everything drain down. So really, the last days are more like the 17th or 18th of July. We are going to keep the remaining Skylake and Ice Lake partitions running at full speed until mid-October when we start doing the switch over to the new file system. We’re going to bring up the file system first, let users migrate data that they want from scratch on the old one to scratch on the new one. We’ll take care of home directories. And then we’ll slowly reduce the amount of Sky Lakes and Ice Lakes available as we bring them up on the new machine. So, the idea is that there is always a Stampede running at some scale for the users. The plan is to never cut user’s access. To help with this plan, we did stop taking allocations in the spring so that we have less people actually allocated on the machine for this fall/winter when the big transition happens.