SC21: Larry Smarr on The Rise of Supernetwork Data Intensive Computing

By John Russell

November 26, 2021

Larry Smarr, founding director of Calit2 (now Distinguished Professor Emeritus at the University of California San Diego) and the first director of NCSA, is one of the seminal figures in the U.S. supercomputing community. What began as a personal drive, shared by others, to spur the creation of supercomputers in the U.S. for scientific use, later expanded into a drive to link those supercomputers with high-speed optical networks, and blossomed into the notion of building a distributed, high-performance computing infrastructure – replete with compute, storage and management capabilities – available broadly to the science community.

Larry Smarr

At SC21 last week, Smarr delivered a racing reprise of this (ongoing) HPC tour de force in his talk, The Rise of Supernetwork Data Intensive Computing. Presented here are a sampling of his comments (lightly edited) and quite a few of his slides. Throughout the talk, Smarr readily acknowledged this was hardly a solo effort and liberally shared credit and praise with many colleagues working now and in the past to build out this computational connective tissue for science. Perhaps appropriately, given his championing of networking, Smarr gave his talk remotely at this year’s hybrid SC21 event.

Because he packed in so much material (break out your acronym dictionaries) and because we can hardly cover it all here, it’s worth reading the description from the SC21 program, which does a good job of shoehorning most of the ideas presented in one place. If you know Smarr, you’ll know that he literally touched on every topic mentioned the summary:

“Over the last 35 years, a fundamental architectural transformation in high-performance data-intensive computing has occurred, driven by the rise of optical fiber Supernetworks connecting the globe. Ironically, this cyberinfrastructure revolution has been led by supercomputer centers, which then became SuperNodes in this distributed system. I will review key moments, including the birth of the NSF Supercomputer Centers and NSFnet, the gigabit testbeds, the NSF PACI program, the emergence of Internet2 and the Regional Optical Networks, all eventually enabling, through a series of NSF grants, the National and Global Research Platforms.

“Over this same period, a similar cyberinfrastructure architecture allowed the commercial clouds to develop, which are now interconnected with this academic distributed system. Critical to this transformation has been the continual exponential rise of data and a new generation of distributed applications utilizing this connected digital fabric. Throughout this period, the role of the US Federal Government has been essential, anchored by the 1991 High-Performance Computing Act, which established the Networking and Information Technology Research and Development (NITRD) Program. Particularly important to the initiation of this distributed computing paradigm shift was the continued visionary leadership of Representative, then Senator, then Vice President Al Gore in the 1990s.”

Although Smarr retired from the Calit2 directorship in 2020, he remains an active participant and researcher in the HPC community (brief bio at the end of the article). Skipping over Smarr’s storied role in the creation of NSF-funded supercomputers and as director at NCSA, the idea for using fiber optics for computer networking was being discussed the ‘80s and got its first push from government at a 1985 federal hearing on Federal Supercomputer Programs and Policies attended by newly elected Senator Al Gore and Larry Smarr, then NCSA director.

“During that talk, Gore came up with this idea of [an] interstate highway for information using fiber optics connecting the major computational centers in the country. That was very inspirational to me, because I was still just trying to get a supercomputer alone. I hadn’t really pushed on the networking side. But I knew we were experimenting at the University of Illinois with fiber optics with the goal of going to a gigabit per second.  This committee hearing was about the strategic role of supercomputing for the United States and the rise of Japanese-built supercomputers. But at the same time, they were very aware that fiber optics were coming and that was going to change everything,” said Smarr.

In 1989, “We asked Senator Gore if he would record a video to introduce a program we did live at SIGGRAPH, using giant AT&T dishes to transmit satellite audio/video between NCSA in Urbana Illinois and the SIGGRAPH show floor in Boston. We did full telepresence, which, of course, I’m doing now [at SC21] and you think nothing of it. Back then it took a pair of giant satellite communication dishes. You just didn’t do it on your PC on the Internet. But Al already was talking about how fiber optics would replace this,” said Smarr.

“What we were doing was coupling one of the very first Sun workstation on the Boston stage and interactively running a Cray-2 simulation [at NCSA in Illinois], and then describing it as we were doing it. The visualization was in real time.

“What Senator Gore said in his introduction is your team is demonstrating a prototype of the way it’s going to be done in the future when we have high speed fiber optic connections between computers. And that was 1989!,” he said.

By 1991, Senator Gore had sponsored and gotten passed the Networking Information Technology Research and Development (NITRD) program, created by the High-Performance Computing and Communications Act.

Smarr described the early NITRD period as “the golden years”, when “DOE, DARPA, NSF all worked together.” “This is when the NSF Supercomputer Centers introduced all of these alternate architectures, such as Thinking Machines CM-5 and the Intel supercomputers that were massively parallel. It was when the whole world of highly parallel supercomputing took off,” said Smarr. “But the interesting thing is it’s also where networking took off, because the NSFNET which had been set up as a 56-kilobit link between the five NSF centers and NCAR in ’86, by ’91 had grown to T1. This enabled the famous Donna Cox and Bob Patterson visualization of the Merit Network data for the month of September in ’91, which became an icon for the NSFNET,” said Smarr.

“So even though we had just done the SIGGRAPH satellite hookup in ’89, by ’91 we were able to do these remote collaborations demos for real. I want to acknowledge that the annual SC conferences have always been critical to our progress, because we used them to drive things forward. For instance, SC ’95 was in San Diego, with Sid Karin as the overall SC95 Chair while I served as Program Chair. A group of us, including Tom DeFanti, Ian Foster and Rick Stevens organized the I-WAY project to run at the conference. Linda Winkler was absolutely critical for the making this all work. I-WAY featured the first interconnection of the telecos 155-megabit/s networks. Researchers across the country carried out 65 different science projects on this distributed system using what we called in those days “high-speed networking”. More importantly, the I-Soft software system that Ian Foster developed to control that distributed cyberinfrastructure led directly to the GLOBUS project that still is used everywhere,” said Smarr.

Bob Kahn, at the same time, had gone from DARPA to setting up the CNRI (Corporation for National Research Initiatives), which started the Gigabit Testbeds and we were involved with Blanca (testbed) at NCSA. The Testbeds were set up to see if could we use fiber optics to get gigabit per second flows. The answer was that getting ATM and other networking protocols to work over the fiber optics went fine. The trouble was the endpoints. When you feed that much data-per-second into a 1990s computer, they basically couldn’t handle it. As the final report of the Gigabit Testbeds said “the Host I/O turned out to be the Achilles heel”. So, the problem with using fiber optics at even just a gigabit per second wasn’t with the fiber optics at all. It was the endpoints,” said Smarr.

“That has stuck with me for a long time. As you’ll see later, the whole point of what we did to create the Pacific Research Platform was to solve that bottleneck. Then NSF in ’97 decided to move from supporting the five independent supercomputer centers to two – NCSA and SDSC – with a whole bunch of partners connected by the new NSF vBNS as the backbone-creating the NSF Partnerships for Advanced Computational Infrasstructure (PACI). For all intents and purposes this is what a national research platform looks like today! But back then it was completely experimental; it was hooking together virtual reality and supercomputer centers and end users. We called it a ‘Prototype for America’s 21st Century Information Infrastructure’ and amazingly that’s what it turned out to be!” said Smarr.

“I remember sitting on the couch with Miron Livny in my office when we made the decision to name this the National Technology Grid. The NSF-funded vBNS directly led to the formation of Internet2, which is now the dominant network connecting all the universities in the country,” said Smarr.

Smarr credits the building of Roadrunner, a COTS-based Linux “supercluster” at the university of New Mexico and placing it on the National Technology Grid as a watershed moment. He noted, David Bader, who led its development, received the 2021 Sidney Fernbach Award at SC21, in part for work on Roadrunner. “This was really the golden spike in my view on creating the distributed systems we use today.”

Another critical development was the deployment of I-Wire in Illinois, and at that same time, Indiana set up I-Light, both of which were dark fiber networks in the States. “These really were very inspirational, they were copied widely,” said Smarr. “When I got to California (~2000), a year later,” said Smarr who’d transitioned from U Illinois to UCSD, I found that CENIC (Corporation for Education Network Initiatives in California) was still buying bandwidth from Verizon and AT&T and did not have a dark fiber system. I joined the nonprofit board of CENIC, and within a few years, we started buying our own fiber. Today CENIC has 8,000 miles of owned and managed fiber. So, this growth over the last 20 years of these regional optical networks is a critical part of what lets us build today’s distributed systems, such as the Pacific Research Platform.”

“We put together a team to create a proposal to that ITR program, and we were able to win one called the Optiputer. This was the real question at that time; we had a lot of clusters, Beowulf clusters and others, like David Bader had developed, but their backplanes were much faster than the Wide Area Network. So compute clusters were basically digital islands. The OptIPuter team had the idea that because of the fiber optics, we would be able to make the wide area bandwidth equal to the cluster bandwidth and remove the locality issue,” recalled Smarr.

Working with the funds from the OptIPuter grant, we were able to build a prototype and interconnect with the NSF TeraGrid over the then National LambdaRail. This happened as 4K video streaming was just emerging. “We had the first 4K projector in the United States at Calit2 which I founded in 2000.” said Smarr.

“As all this developed, DOE and particularly ESNET, were able to codify the notion of what it meant to have this kind of high-speed networking for scientific computing and instrumental data on a campus. This led to their ScienceDMZ concept in 2010, in which you needed a Data Transfer Node (DTN) to terminate the high-speed optical fiber, as we had learned from the gigabit testbeds, with network performance monitoring being carried out by perfSONAR,” said Smarr.

“NSF heard about this and this created a second great example of NSF adopting from DOE (the first being setting up the NSF Supercomputer Centers). NSF created the campus cyberinfrastructure (CC*) program that Kevin Thompson has been running since 2012; [it] has made over 340 awards, funding these optically fiber connected ScienceDMZ cyberinfrastructure systems on campuses across the country,” said Smarr.

The next obvious step, said Smarr, was to take all the campus ScienceDMZ deployments that NSF had funded, and hook them together. By 2015 CENIC had [its own] elaborate fiber optic system that connected all the California campuses and supercomputer centers (NASA, DOE and NSF) together. “We put a proposal together six years ago called the Pacific Research Platform [PRP] to see if we could create a regional Science DMZ,” he said.

“We’ve created a PRP website over the last few years that has everything you need to get started on the PRP. It includes all the technical details of how to use Jupyter notebooks, how to containerize, and so forth. I recommend you go there if you’re interested,” said Smarr.

To solve the host I/O bottleneck that the Gigabit Testbeds discovered, “We’ve developed these DTNs we call Flash I/O Network Appliances, FIONAs. Flash made all the difference. It solved the host I/O data bottleneck, the Achilles heel that Bob Kahn talked about with four gigabit testbeds, but now at 10-40-or-100 times the speed that the gigabit testbeds had. They are just PCs with terabytes of Flash memory, multicore CPUs. But then you can put in eight gaming 32-bit GPUs and turn it into a machine learning platform. So, these FIONAs are the endpoints that the fiber optics terminate in and they’re on your campus. Because they’re commodity if you just get the specs, you can order them yourself,” said Smarr.

Of course, the science community was not the only one vigorously pursuing distributed computing. Commercial cloud vendors also have racks of PC systems tied together with fiber optics and faced similar challenges with regard to managing user access and software execution.

“The Google Cloud developed Kubernetes, and then made it open-source, as a way to orchestrate containers across a global set of optically fiber connected PCs. Now if you’ve containerized your application, you have this very global digital fabric on which Kubernetes can move your things around, and you can run Ceph as the storage system under Rook,  which runs under Kubernetes. That means we can manage petabytes of distributed storage and GPUs for data science use,” said Smarr, crediting work by John Graham to accomplish this.

The PRP kept expanding beyond the West Coast and across the country. “Now we have 184 of these [FIONAs] located on over 25 campuses and they’re all interconnected at 10 to 100 gigabits per second. Together the PRP FIONAs contain nearly 7,500 CPU-cores of over 500 32-bit GPUs, as well as four petabytes of rotating storage all running over CENIC, the Quilt regional networks, and Internet2. The extended PRP runs all the way from Korea and out in Guam, to Hawaii, across the U.S., to Amsterdam in Europe, Amsterdam” said Smarr.

There were speedbumps along the way – this is science after all – and Smarr recalled that early efforts to use the FIONAs widely had problems. “When we got started in 2016 we tried to do10 gigabyte file transfers between all and all of the FIONAs and then measure it. However, you can see back in 2016, the orange [on this slide] means we couldn’t even get the thing to read; we got basically no throughput, it was broken. But by a year and a half later, you can see the green, indicating that we had accomplished disk-to-disk file transfer at five gigabits per second and this was pretty much uniform. We also added TraceRoute to determine the state of all the network segments between the PRP FIONAs. These measurements have by now all been containerized.”

There was a great deal more to Smarr’s talk. Maybe SC21 will make a video of it widely available. That said, it’s important to remember how transformational it is for enabling the pursuit of science using powerful distributed computing infrastructure. Smarr cited few examples. Here’s one.

“We built this [research platform] to enhance multi-campus, data-intensive applications. We brought in a bunch of application teams from particle physics, telescope surveys, atmosphere sciences, biomedical, and virtual reality as drivers for this,” said Smarr. “Just to give you an example, this is a global precipitation map from the MERRA NASA satellite archives that sends data to a center in Irvine for hydrometeorology to see all the developing precipitation systems,” said Smarr.

“As you’ve been reading, the atmospheric rivers just wiped out a lot of the Northwest US and Western Canada. You need to be able to predict those extreme precipitation events and you want to be able to do it in real time,” he said. “So, there’s a center for Western weather that looks at all of this in San Diego. We wanted to link the NASA satellite archives to the Irvine Center and then to the San Diego center. Well, the workflow was taking 19 days to do one round and that’s not predictive. But by using the PRP distributed workflow with its machine learning, and optimizing that on the FIONAs, we were able to get it from 19 days down to 52 minutes, which means it goes from being retrospective to predictive.”

So where are we going in the future, asked Smarr?

“The first thing is we’ve just gotten two awards from NSF that are going to allow us to increase participation by a number of Minority Serving Institutions, by building FIONAs on those campuses, as well as bringing in more EPSCoR states, the least funded by NSF states, in South Carolina, up in Nebraska, as well as to increase the spread of the of the PRP across the country,” he said.

Secondly, he said, “The NSF XSEDE program that funds the supercomputer centers, has funded a distributed version of SDSC called Expanse that uses the same kind of composable systems [and] Ceph storage as the PRP and as the Open Science Grid. We have already federated the containerized applications on the PRP, Expanse, and the Open Science Grid. Expanse is a five-year proposal.”

“Third, the NSF has just funded Frank Wuerthwein, a PRP co-PI, with his NRP, a prototype of the National Research Platform. This is going to put a lot of emerging computational systems like FPGAs – as well as 64-bit GPUs like are used on the supercomputers – and more distributed data systems building on the work he’s done as executive director of the Open Science Grid.”

“I think you’re going to see this type of cyberinfrastructure become a permanent part of the research landscape – a completely distributed high-performance computing, networking, storage and analysis facility,” said Smarr.

“As I said, even though NSF is funding it in our country, we already have seen that it’s being interconnected to Europe and the Pacific Rim as well. In closing, I wanted to say that this talk is not meant as a scholarly historical analysis of the last 40 years. Rather, it’s my personal recollection of key moments that I was personally involved in.”

Brief Smarr Bio

Dr. Larry Smarr is a University of California San Diego Distinguished Professor Emeritus in UCSD’s Department of Computer Science and Engineering. For the last 20 years he served as the founding Director of the California Institute for Telecommunications and Information Technology (Calit2). He was earlier the Founding Director of the National Center for Supercomputing Applications (NCSA). In 2006, he received the IEEE Computer Society Tsutomu Kanai Award for his lifetime achievements in distributed computing. Smarr continues to provide national leadership in advanced cyberinfrastructure (CI), currently serving as Principal Investigator on three NSF CI research grants: the NSF Pacific Research Platform, Cognitive Hardware and Software Ecosystem Community Infrastructure, and Toward the National Research Platform. He is a member of the National Academy of Engineering and a Fellow of the American Physical Society and has served on top level advisory committees to NIH, NASA, NSF, and DOE.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Quantum Watchers – Terrific Interview with Caltech’s John Preskill by CERN

July 17, 2024

In case you missed it, there's a fascinating interview with John Preskill, the prominent Caltech physicist and pioneering quantum computing researcher that was recently posted by CERN’s department of experimental physi Read more…

Aurora AI-Driven Atmosphere Model is 5,000x Faster Than Traditional Systems

July 16, 2024

While the onset of human-driven climate change brings with it many horrors, the increase in the frequency and strength of storms poses an enormous threat to communities across the globe. As climate change is warming ocea Read more…

Researchers Say Memory Bandwidth and NVLink Speeds in Hopper Not So Simple

July 15, 2024

Researchers measured the real-world bandwidth of Nvidia's Grace Hopper superchip, with the chip-to-chip interconnect results falling well short of theoretical claims. A paper published on July 10 by researchers in the U. Read more…

Belt-Tightening in Store for Most Federal FY25 Science Budets

July 15, 2024

If it’s summer, it’s federal budgeting time, not to mention an election year as well. There’s an excellent summary of the curent state of FY25 efforts reported in AIP’s policy FYI: Science Policy News. Belt-tight Read more…

Peter Shor Wins IEEE 2025 Shannon Award

July 15, 2024

Peter Shor, the MIT mathematician whose ‘Shor’s algorithm’ sent shivers of fear through the encryption community and helped galvanize ongoing efforts to build quantum computers, has been named the 2025 winner of th Read more…

Weekly Wire Roundup: July 8-July 12, 2024

July 12, 2024

HPC news can get pretty sleepy in June and July, but this week saw a bump in activity midweek as Americans realized they still had work to do after the previous holiday weekend. The world outside the United States also s Read more…

Aurora AI-Driven Atmosphere Model is 5,000x Faster Than Traditional Systems

July 16, 2024

While the onset of human-driven climate change brings with it many horrors, the increase in the frequency and strength of storms poses an enormous threat to com Read more…

Shutterstock 1886124835

Researchers Say Memory Bandwidth and NVLink Speeds in Hopper Not So Simple

July 15, 2024

Researchers measured the real-world bandwidth of Nvidia's Grace Hopper superchip, with the chip-to-chip interconnect results falling well short of theoretical c Read more…

Shutterstock 2203611339

NSF Issues Next Solicitation and More Detail on National Quantum Virtual Laboratory

July 10, 2024

After percolating for roughly a year, NSF has issued the next solicitation for the National Quantum Virtual Lab program — this one focused on design and imple Read more…

NCSA’s SEAS Team Keeps APACE of AlphaFold2

July 9, 2024

High-performance computing (HPC) can often be challenging for researchers to use because it requires expertise in working with large datasets, scaling the softw Read more…

Anders Jensen on Europe’s Plan for AI-optimized Supercomputers, Welcoming the UK, and More

July 8, 2024

The recent ISC24 conference in Hamburg showcased LUMI and other leadership-class supercomputers co-funded by the EuroHPC Joint Undertaking (JU), including three Read more…

Generative AI to Account for 1.5% of World’s Power Consumption by 2029

July 8, 2024

Generative AI will take on a larger chunk of the world's power consumption to keep up with the hefty hardware requirements to run applications. "AI chips repres Read more…

US Senators Propose $32 Billion in Annual AI Spending, but Critics Remain Unconvinced

July 5, 2024

Senate leader, Chuck Schumer, and three colleagues want the US government to spend at least $32 billion annually by 2026 for non-defense related AI systems.  T Read more…

Point and Click HPC: High-Performance Desktops

July 3, 2024

Recently, an interesting paper appeared on Arvix called Use Cases for High-Performance Research Desktops. To be clear, the term desktop in this context does not Read more…

Atos Outlines Plans to Get Acquired, and a Path Forward

May 21, 2024

Atos – via its subsidiary Eviden – is the second major supercomputer maker outside of HPE, while others have largely dropped out. The lack of integrators and Atos' financial turmoil have the HPC market worried. If Atos goes under, HPE will be the only major option for building large-scale systems. Read more…

Everyone Except Nvidia Forms Ultra Accelerator Link (UALink) Consortium

May 30, 2024

Consider the GPU. An island of SIMD greatness that makes light work of matrix math. Originally designed to rapidly paint dots on a computer monitor, it was then Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock_1687123447

Nvidia Economics: Make $5-$7 for Every $1 Spent on GPUs

June 30, 2024

Nvidia is saying that companies could make $5 to $7 for every $1 invested in GPUs over a four-year period. Customers are investing billions in new Nvidia hardwa Read more…

Nvidia Shipped 3.76 Million Data-center GPUs in 2023, According to Study

June 10, 2024

Nvidia had an explosive 2023 in data-center GPU shipments, which totaled roughly 3.76 million units, according to a study conducted by semiconductor analyst fir Read more…

AMD Clears Up Messy GPU Roadmap, Upgrades Chips Annually

June 3, 2024

In the world of AI, there's a desperate search for an alternative to Nvidia's GPUs, and AMD is stepping up to the plate. AMD detailed its updated GPU roadmap, w Read more…

Some Reasons Why Aurora Didn’t Take First Place in the Top500 List

May 15, 2024

The makers of the Aurora supercomputer, which is housed at the Argonne National Laboratory, gave some reasons why the system didn't make the top spot on the Top Read more…

Intel’s Next-gen Falcon Shores Coming Out in Late 2025 

April 30, 2024

It's a long wait for customers hanging on for Intel's next-generation GPU, Falcon Shores, which will be released in late 2025.  "Then we have a rich, a very Read more…

Leading Solution Providers

Contributors

Google Announces Sixth-generation AI Chip, a TPU Called Trillium

May 17, 2024

On Tuesday May 14th, Google announced its sixth-generation TPU (tensor processing unit) called Trillium.  The chip, essentially a TPU v6, is the company's l Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

IonQ Plots Path to Commercial (Quantum) Advantage

July 2, 2024

IonQ, the trapped ion quantum computing specialist, delivered a progress report last week firming up 2024/25 product goals and reviewing its technology roadmap. Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

The NASA Black Hole Plunge

May 7, 2024

We have all thought about it. No one has done it, but now, thanks to HPC, we see what it looks like. Hold on to your feet because NASA has released videos of wh Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire