TACC Looks to ‘Horizon’ System for Its Leadership-Class Computing Facility

By Oliver Peckham

April 14, 2022

During a talk for the Ken Kennedy Institute’s 2022 Energy High Performance Computing Conference, Dan Stanzione, executive director of the Texas Advanced Computing Center (TACC), gave a status update on TACC’s forthcoming Leadership Class Computing Facility (LCCF)—a massive NSF-funded expansion of its supercomputing capabilities that will launch with a new flagship supercomputer, “Horizon.”

“We have a lot of computers; they’re kinda large,” Stanzione explained, highlighting systems like Frontera (23.5 Linpack petaflops) and Stampede2 (10.6 Linpack petaflops) that rank highly on the Top500 and which support thousands of project teams. “Collectively, we have about 20,000 machines, we have over a million cores, we have a thousand GPUs, we deliver about seven billion core hours a year to our user community.” And it’s adding more all the time: Stanzione said that the graphics were just added to the new Lonestar-6 system the day before the talk.

The new art on Lonestar-6.

All of this computing firepower supports projects across the NSF’s domains. “Generally, if it’s unclassified and it’s open science and it’s at an academic institution we probably support it at this point,” Stanzione said. And the demand for more remains: “[At] the beginning of the center, we had had requests for about five times the computation we produced. We have about 80,000 times that much computing available now… and we get about five times the requests for the amount of time that we can produce.”

“So apparently,” he reasoned, “demand for computing is invariant and not in any way dependent on the size of the computer you buy. So obviously, you should buy a bigger one.”

Taking a leadership(-class) role

Enter the LCCF and Horizon. Back in mid-2018, TACC won an NSF award to create Frontera (which, Stanzione said, is “technically phase one of the LCCF”), laying the groundwork for a longer-term leadership computing strategy. By the following year, word had gotten out that TACC was planning to follow up Frontera with a computer 10× as powerful around 2024. In 2020, TACC started talking more about these plans, calling the facility the Leadership-Class Computing Facility, shifting its target slightly to 2025 and releasing concept art for an expanded datacenter.

A concept of the LCCF. Image courtesy of TACC.

The core pitch for the LCCF, Stanzione said, is “about coming up with a more sustainable way to invest in computing than one-off system competitions every four years.” Several times, Stanzione made reference to the National Center for Atmospheric Research (NCAR), an NSF-funded computing center focused on weather and climate research. TACC, he said, wants “to be an anchor of the NSF computing environment in the way that NCAR is for atmospheric research.”

Stanzione explained that the plan for the LCCF is to “actually start construction—fitting out a datacenter— …two years from today” (March 1, 2024), pending budgetary approval from Congress. Then, they’ll aim to deploy the LCCF’s flagship system in the first half of 2025, with science users on the system by the back half of 2025. After that: ten years of support for the center, from 2026 to 2036, with “maybe an upgrade in there,” Stanzione said.

“We’ve never had more than four years of funding on an NSF system before as the initial commitment,” he stressed, citing NSF rules around the duration and renewal of its grants. “The reason I say we wanted to build an NCAR-like facility is because it’s been sitting there funded [by the NSF] in Colorado since 1957, so… perhaps my real challenge is figuring out, personally, how I was going to get around that limit that technically exists.”

The flagship system will, Stanzione said, be named Horizon, “if DOE doesn’t steal that name and I don’t have to change it again when they put out a system.” (Regarding the LCCF itself: “I’m gonna give it a better name and a better logo before it’s over, but that comes later in the funding cycle.”)

The current, apparently ephemeral logo for the LCCF.

On the Horizon

“You probably all wanted me to say what machine [we are] picking,” Stanzione said, “and I’m not going to, because honestly, I haven’t made a final decision, because the best way to be wrong is to tell people four years in advance what your computer’s gonna look like and what it’s gonna do. … Two years before we deploy, ask me what it’s going to be.”

That said, Stanzione did go into some detail on the process of setting goals and attaining support for Horizon.

“[Frontera] is getting old very quickly, so we need a follow-on, and the only written instructions are ‘10× faster!’” he said. “And we’ve had I can’t tell you how many hours of conversations over what 10× actually means in that context—is that application throughput? Is it peak application performance? Does it have anything to do with flops?” (“I’ve argued no,” he said.)

He then broke down where they expect to get the 10× from.

“10× in that time frame is challenging for a couple reasons,” he said. “One: if we look at the Frontera baseline, if we just did nothing but rely on vendor performance improvement … at best, over this five-year timeframe, maybe we’ll get 3× out of that,” mostly from increases in memory bandwidth. “But three is not 10.”

Then, he said: “Buy more — that one almost always works … So we’re gonna double the budget over what Frontera was, and I got away with that, so.” That brings it up to 6×, leaving Horizon with a need to speed up by two-thirds. That remainder, Stanzione said, will be accomplished with improvements in software algorithms and methods.

Bringing things up to code

To that end, last spring, TACC solicited the research community to submit problems that they thought were core for future research problems and representative of problem designs in the supercomputing space. They received 140 proposals, whittling that down to 30, then selecting 21 to go forward for deep examination, looking to achieve significant speedups “and see where we can get.” The projects, which are funded to the tune of around $300,000 each, span mathematics, physical sciences, engineering, geosciences, life sciences and even the social sciences.

“If all the codes were well-engineered and really good, I’d worry a lot about our ability to make improvements and hit the numbers,” Stanzione said, continuing: “All codes are not really good, well-clustered, well-engineered codes. Most of them are what we’d call ‘software.’”

This code research, he added, would also help TACC sell the benefits of the LCCF and Horizon to decision-makers. “We know if we build a bigger machine, cool stuff’s gonna happen,” he said. “But I can’t sell hundreds of millions of dollars of investment on ‘cool stuff is gonna happen.’”

Now, back to the hardware

On the hardware front, Stanzione showed a laundry list of vendors under evaluation. “We’ve done a huge number of on-site evaluations, we’ve done others with partners,” he said. “We’ve looked at a wide variety of processor technologies … we’ve looked at various and sundry Arms, we’ve looked at NextSilicon … we’ve looked at a bunch of networking options … we’ve looked at a lot of the other, more exotic things with partner sites.”

A non-exhaustive list of hardware under evaluation by TACC for Horizon

  • Processors: AMD; Fujitsu Arm; Intel; NEC; NextSilicon; Nvidia
  • Networking: Cornelis; Nvidia; Rockport
  • Filesystems: BeeGFS; DAOS; VAST
  • Node disaggregation: GigaIO; Liqid
  • AI/quantum: Quantum (via Stanford); Graphcore (via Argonne National Laboratory); Cerebras (also via Argonne); SambaNova (also via Argonne)

(“Argonne buys all these [exotic] chips,” Stanzione said, “so I just call Rick Stevens [associate laboratory director at Argonne] and say ‘how’s it going?’ rather than trying to emulate what they’re doing on that.”)

Stanzione also said that they plan to add about 10 percent capacity to the core system to ensure that there is room for smaller testbeds and research projects when the main system is occupied. “We will define the size of the system,” he said, “and then I will lie about it, because we’ll buy more than that. … We’re gonna have this piece of the system that does the 10× piece and then we’re gonna have a bunch of additional racks to deal with all these other use cases that don’t count towards the capabilities.”

Confounding these efforts, Stanzione said, an “unnamed program officer” had asked him: “what is the peak flops of the entire system if you include all those pieces?”

“That’s the question you weren’t supposed to ever ask!” Stanzione exclaimed. “And now I have to make up an answer, because no one knows what the peak flops in 2025 are of any of these processors, let alone the achievable ones. So I wrote an answer and sent it back.”

The hardware will be supported by an additional 15MW of power capacity, making TACC a 25MW facility. And, regarding that “maybe an upgrade in there” from earlier: “The idea is basically we’d buy a second system halfway through that [2026-2036] life,” Stanzione said, “so it’s two five-year lifespans.” He added that they would prefer to avoid a radical change in architecture between the two systems.

For now, though, Stanzione and TACC are embroiled in what he said was the “terrifying amount of project management” required to begin building the new facility in two years.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Video: Sneak Preview of the AI Hardware Summit

August 19, 2022

Next month the AI Hardware Summit returns to the Bay Area, bringing AI technologists and end users together to share ideas and get up to speed on all the latest AI hardware developments. The event – which takes place September 13-15, 2022, at the Santa Clara Marriott, Calif. – will be co-located with the Edge AI Summit. Both events are organized by... Read more…

Oklahoma State University to Build New Supercomputer

August 18, 2022

Thanks to a grant from the National Science Foundation, Oklahoma State University (OSU) will be building a new supercomputer. The as-yet unnamed system will succeed OSU’s existing system, which is simply named “Pete.” “This is a big moment for OSU and the High Performance Computing Center (HPCC),” said Pratul Agarwal, assistant vice president of research cyberinfrastructure and... Read more…

DOE and ORNL Dedicate Frontier Supercomputer

August 17, 2022

“It is my privilege to welcome you to the dedication of Frontier, the supercomputer that broke the exascale barrier.” That was the introduction by Oak Ridge National Laboratory Director Thomas Zacharia, at a small, public event on August 17 to officially dedicate the supercomputer, which in May became the first system to achieve over 1.0 exaflops of 64-bit performance on the... Read more…

Tesla Bulks Up Its GPU-Powered AI Super – Is Dojo Next?

August 16, 2022

Tesla has revealed that its biggest in-house AI supercomputer – which we wrote about last year – now has a total of 7,360 A100 GPUs, a nearly 28 percent uplift from its previous total of 5,760 GPUs. That’s enough GPU oomph for a top seven spot on the Top500, although the tech company best known for its electric vehicles has not publicly benchmarked the system. If it had, it would... Read more…

Inflation Reduction Act Signed Into Law, with Major Computing Implications

August 16, 2022

For the second time in as many weeks, President Biden has signed into law a major bill with significant implications for the computing sector. The Inflation Reduction Act – which is certainly the cornerstone of Biden’s first two years in office – allocates hundreds of billions of dollars toward energy security, climate change and healthcare. Among those hundreds of billions are hundreds of millions for scientific computing. At the signing ceremony... Read more…

AWS Solution Channel

Shutterstock 1914742114

23andMe Innovates Drug and Therapeutic Discovery with HPC on AWS

23andMe Innovates Drug and Therapeutic Discovery with HPC on AWS

Genomics and biotechnology company 23andMe provides direct-to-customer genetic testing, giving customers valuable insights into their genetics. Read more…

Microsoft/NVIDIA Solution Channel

Shutterstock 1689646429

Gain a Competitive Edge using Cloud-Based, GPU-Accelerated AI KYC Recommender Systems

Financial services organizations face increased competition for customers from technologies such as FinTechs, mobile banking applications, and online payment systems. To meet this challenge, it is important for organizations to have a deep understanding of their customers. Read more…

Glimpse into ORNL Quantum Science Center Efforts to Find the Elusive Majorana and Much More

August 16, 2022

The Quantum Science Center (QSC), headquartered at Oak Ridge National Laboratory, is one of five such centers created by the National Quantum Initiative Act in 2018 and run by the Department of Energy. They all have distinct and overlapping goals. That’s sort of the point, to bring both focus and cooperation, and a heavy dose of industry participation to advance quantum information sciences... Read more…

DOE and ORNL Dedicate Frontier Supercomputer

August 17, 2022

“It is my privilege to welcome you to the dedication of Frontier, the supercomputer that broke the exascale barrier.” That was the introduction by Oak Ridge National Laboratory Director Thomas Zacharia, at a small, public event on August 17 to officially dedicate the supercomputer, which in May became the first system to achieve over 1.0 exaflops of 64-bit performance on the... Read more…

Tesla Bulks Up Its GPU-Powered AI Super – Is Dojo Next?

August 16, 2022

Tesla has revealed that its biggest in-house AI supercomputer – which we wrote about last year – now has a total of 7,360 A100 GPUs, a nearly 28 percent uplift from its previous total of 5,760 GPUs. That’s enough GPU oomph for a top seven spot on the Top500, although the tech company best known for its electric vehicles has not publicly benchmarked the system. If it had, it would... Read more…

Inflation Reduction Act Signed Into Law, with Major Computing Implications

August 16, 2022

For the second time in as many weeks, President Biden has signed into law a major bill with significant implications for the computing sector. The Inflation Reduction Act – which is certainly the cornerstone of Biden’s first two years in office – allocates hundreds of billions of dollars toward energy security, climate change and healthcare. Among those hundreds of billions are hundreds of millions for scientific computing. At the signing ceremony... Read more…

Glimpse into ORNL Quantum Science Center Efforts to Find the Elusive Majorana and Much More

August 16, 2022

The Quantum Science Center (QSC), headquartered at Oak Ridge National Laboratory, is one of five such centers created by the National Quantum Initiative Act in 2018 and run by the Department of Energy. They all have distinct and overlapping goals. That’s sort of the point, to bring both focus and cooperation, and a heavy dose of industry participation to advance quantum information sciences... Read more…

Australian Government Unveils New Defense, Weather Supercomputers

August 15, 2022

The Australian government has been busy on the supercomputing front. In just the last two weeks, the Australian Department of Defence and the Australian Bureau of Meteorology have both revealed major supercomputing updates. The Department of Defence, for its part, unveiled a new supercomputer: Taingiwilta, named after the word for “powerful” in the language of the... Read more…

HPCwire Quantum Survey: First Up – IBM and Zapata – on Algorithms, Error Mitigation, More

August 15, 2022

Quantum computing technology advances so quickly that it is hard to stay current. HPCwire recently asked a handful of senior researchers and executives for their thoughts on nearer-term progress and challenges. We’ll present their responses as they trickle in through the late summer and fall. (These execs take vacations too!) This also allows us to present the respondent’s... Read more…

SC22 Unveils ACM Gordon Bell Prize Finalists

August 12, 2022

Courtesy of the schedule for the SC22 conference, we now have our first glimpse at the finalists for this year’s coveted Gordon Bell Prize. The Gordon Bell Pr Read more…

Q&A with ORNL’s Bronson Messer, an HPCwire Person to Watch in 2022

August 12, 2022

HPCwire presents our interview with Bronson Messer, distinguished scientist and director of Science at the Oak Ridge Leadership Computing Facility (OLCF), ORNL, and an HPCwire 2022 Person to Watch. Messer recaps ORNL's journey to exascale and sheds light on how all the pieces line up to support the all-important science. Also covered are the role... Read more…

Royalty-free stock illustration ID: 1919750255

Intel Says UCIe to Outpace PCIe in Speed Race

May 11, 2022

Intel has shared more details on a new interconnect that is the foundation of the company’s long-term plan for x86, Arm and RISC-V architectures to co-exist in a single chip package. The semiconductor company is taking a modular approach to chip design with the option for customers to cram computing blocks such as CPUs, GPUs and AI accelerators inside a single chip package. Read more…

The Final Frontier: US Has Its First Exascale Supercomputer

May 30, 2022

In April 2018, the U.S. Department of Energy announced plans to procure a trio of exascale supercomputers at a total cost of up to $1.8 billion dollars. Over the ensuing four years, many announcements were made, many deadlines were missed, and a pandemic threw the world into disarray. Now, at long last, HPE and Oak Ridge National Laboratory (ORNL) have announced that the first of those... Read more…

US Senate Passes CHIPS Act Temperature Check, but Challenges Linger

July 19, 2022

The U.S. Senate on Tuesday passed a major hurdle that will open up close to $52 billion in grants for the semiconductor industry to boost manufacturing, supply chain and research and development. U.S. senators voted 64-34 in favor of advancing the CHIPS Act, which sets the stage for the final consideration... Read more…

Top500: Exascale Is Officially Here with Debut of Frontier

May 30, 2022

The 59th installment of the Top500 list, issued today from ISC 2022 in Hamburg, Germany, officially marks a new era in supercomputing with the debut of the first-ever exascale system on the list. Frontier, deployed at the Department of Energy’s Oak Ridge National Laboratory, achieved 1.102 exaflops in its fastest High Performance Linpack run, which was completed... Read more…

Newly-Observed Higgs Mode Holds Promise in Quantum Computing

June 8, 2022

The first-ever appearance of a previously undetectable quantum excitation known as the axial Higgs mode – exciting in its own right – also holds promise for developing and manipulating higher temperature quantum materials... Read more…

AMD’s MI300 APUs to Power Exascale El Capitan Supercomputer

June 21, 2022

Additional details of the architecture of the exascale El Capitan supercomputer were disclosed today by Lawrence Livermore National Laboratory’s (LLNL) Terri Read more…

Exclusive Inside Look at First US Exascale Supercomputer

July 1, 2022

HPCwire takes you inside the Frontier datacenter at DOE's Oak Ridge National Laboratory (ORNL) in Oak Ridge, Tenn., for an interview with Frontier Project Direc Read more…

PsiQuantum’s Path to 1 Million Qubits

April 21, 2022

PsiQuantum, founded in 2016 by four researchers with roots at Bristol University, Stanford University, and York University, is one of a few quantum computing startups that’s kept a moderately low PR profile. (That’s if you disregard the roughly $700 million in funding it has attracted.) The main reason is PsiQuantum has eschewed the clamorous public chase for... Read more…

Leading Solution Providers

Contributors

AMD Opens Up Chip Design to the Outside for Custom Future

June 15, 2022

AMD is getting personal with chips as it sets sail to make products more to the liking of its customers. The chipmaker detailed a modular chip future in which customers can mix and match non-AMD processors in a custom chip package. "We are focused on making it easier to implement chips with more flexibility," said Mark Papermaster, chief technology officer at AMD during the analyst day meeting late last week. Read more…

Intel Reiterates Plans to Merge CPU, GPU High-performance Chip Roadmaps

May 31, 2022

Intel reiterated it is well on its way to merging its roadmap of high-performance CPUs and GPUs as it shifts over to newer manufacturing processes and packaging technologies in the coming years. The company is merging the CPU and GPU lineups into a chip (codenamed Falcon Shores) which Intel has dubbed an XPU. Falcon Shores... Read more…

Nvidia, Intel to Power Atos-Built MareNostrum 5 Supercomputer

June 16, 2022

The long-troubled, hotly anticipated MareNostrum 5 supercomputer finally has a vendor: Atos, which will be supplying a system that includes both Nvidia and Inte Read more…

Tesla Bulks Up Its GPU-Powered AI Super – Is Dojo Next?

August 16, 2022

Tesla has revealed that its biggest in-house AI supercomputer – which we wrote about last year – now has a total of 7,360 A100 GPUs, a nearly 28 percent uplift from its previous total of 5,760 GPUs. That’s enough GPU oomph for a top seven spot on the Top500, although the tech company best known for its electric vehicles has not publicly benchmarked the system. If it had, it would... Read more…

India Launches Petascale ‘PARAM Ganga’ Supercomputer

March 8, 2022

Just a couple of weeks ago, the Indian government promised that it had five HPC systems in the final stages of installation and would launch nine new supercomputers this year. Now, it appears to be making good on that promise: the country’s National Supercomputing Mission (NSM) has announced the deployment of “PARAM Ganga” petascale supercomputer at Indian Institute of Technology (IIT)... Read more…

Is Time Running Out for Compromise on America COMPETES/USICA Act?

June 22, 2022

You may recall that efforts proposed in 2020 to remake the National Science Foundation (Endless Frontier Act) have since expanded and morphed into two gigantic bills, the America COMPETES Act in the U.S. House of Representatives and the U.S. Innovation and Competition Act in the U.S. Senate. So far, efforts to reconcile the two pieces of legislation have snagged and recent reports... Read more…

Nvidia R&D Chief on How AI is Improving Chip Design

April 18, 2022

Getting a glimpse into Nvidia’s R&D has become a regular feature of the spring GTC conference with Bill Dally, chief scientist and senior vice president of research, providing an overview of Nvidia’s R&D organization and a few details on current priorities. This year, Dally focused mostly on AI tools that Nvidia is both developing and using in-house to improve... Read more…

Exascale Watch: Aurora Installation Underway, Now Open for Reservations

May 10, 2022

Installation has begun on the Aurora supercomputer, Rick Stevens (associate director of Argonne National Laboratory) revealed today during the Intel Vision event keynote taking place in Dallas, Texas, and online. Joining Intel exec Raja Koduri on stage, Stevens confirmed that the Aurora build is underway – a major development for a system that is projected to deliver more... Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire