SC21: Larry Smarr on The Rise of Supernetwork Data Intensive Computing

By John Russell

November 26, 2021

Larry Smarr, founding director of Calit2 (now Distinguished Professor Emeritus at the University of California San Diego) and the first director of NCSA, is one of the seminal figures in the U.S. supercomputing community. What began as a personal drive, shared by others, to spur the creation of supercomputers in the U.S. for scientific use, later expanded into a drive to link those supercomputers with high-speed optical networks, and blossomed into the notion of building a distributed, high-performance computing infrastructure – replete with compute, storage and management capabilities – available broadly to the science community.

Larry Smarr

At SC21 last week, Smarr delivered a racing reprise of this (ongoing) HPC tour de force in his talk, The Rise of Supernetwork Data Intensive Computing. Presented here are a sampling of his comments (lightly edited) and quite a few of his slides. Throughout the talk, Smarr readily acknowledged this was hardly a solo effort and liberally shared credit and praise with many colleagues working now and in the past to build out this computational connective tissue for science. Perhaps appropriately, given his championing of networking, Smarr gave his talk remotely at this year’s hybrid SC21 event.

Because he packed in so much material (break out your acronym dictionaries) and because we can hardly cover it all here, it’s worth reading the description from the SC21 program, which does a good job of shoehorning most of the ideas presented in one place. If you know Smarr, you’ll know that he literally touched on every topic mentioned the summary:

“Over the last 35 years, a fundamental architectural transformation in high-performance data-intensive computing has occurred, driven by the rise of optical fiber Supernetworks connecting the globe. Ironically, this cyberinfrastructure revolution has been led by supercomputer centers, which then became SuperNodes in this distributed system. I will review key moments, including the birth of the NSF Supercomputer Centers and NSFnet, the gigabit testbeds, the NSF PACI program, the emergence of Internet2 and the Regional Optical Networks, all eventually enabling, through a series of NSF grants, the National and Global Research Platforms.

“Over this same period, a similar cyberinfrastructure architecture allowed the commercial clouds to develop, which are now interconnected with this academic distributed system. Critical to this transformation has been the continual exponential rise of data and a new generation of distributed applications utilizing this connected digital fabric. Throughout this period, the role of the US Federal Government has been essential, anchored by the 1991 High-Performance Computing Act, which established the Networking and Information Technology Research and Development (NITRD) Program. Particularly important to the initiation of this distributed computing paradigm shift was the continued visionary leadership of Representative, then Senator, then Vice President Al Gore in the 1990s.”

Although Smarr retired from the Calit2 directorship in 2020, he remains an active participant and researcher in the HPC community (brief bio at the end of the article). Skipping over Smarr’s storied role in the creation of NSF-funded supercomputers and as director at NCSA, the idea for using fiber optics for computer networking was being discussed the ‘80s and got its first push from government at a 1985 federal hearing on Federal Supercomputer Programs and Policies attended by newly elected Senator Al Gore and Larry Smarr, then NCSA director.

“During that talk, Gore came up with this idea of [an] interstate highway for information using fiber optics connecting the major computational centers in the country. That was very inspirational to me, because I was still just trying to get a supercomputer alone. I hadn’t really pushed on the networking side. But I knew we were experimenting at the University of Illinois with fiber optics with the goal of going to a gigabit per second.  This committee hearing was about the strategic role of supercomputing for the United States and the rise of Japanese-built supercomputers. But at the same time, they were very aware that fiber optics were coming and that was going to change everything,” said Smarr.

In 1989, “We asked Senator Gore if he would record a video to introduce a program we did live at SIGGRAPH, using giant AT&T dishes to transmit satellite audio/video between NCSA in Urbana Illinois and the SIGGRAPH show floor in Boston. We did full telepresence, which, of course, I’m doing now [at SC21] and you think nothing of it. Back then it took a pair of giant satellite communication dishes. You just didn’t do it on your PC on the Internet. But Al already was talking about how fiber optics would replace this,” said Smarr.

“What we were doing was coupling one of the very first Sun workstation on the Boston stage and interactively running a Cray-2 simulation [at NCSA in Illinois], and then describing it as we were doing it. The visualization was in real time.

“What Senator Gore said in his introduction is your team is demonstrating a prototype of the way it’s going to be done in the future when we have high speed fiber optic connections between computers. And that was 1989!,” he said.

By 1991, Senator Gore had sponsored and gotten passed the Networking Information Technology Research and Development (NITRD) program, created by the High-Performance Computing and Communications Act.

Smarr described the early NITRD period as “the golden years”, when “DOE, DARPA, NSF all worked together.” “This is when the NSF Supercomputer Centers introduced all of these alternate architectures, such as Thinking Machines CM-5 and the Intel supercomputers that were massively parallel. It was when the whole world of highly parallel supercomputing took off,” said Smarr. “But the interesting thing is it’s also where networking took off, because the NSFNET which had been set up as a 56-kilobit link between the five NSF centers and NCAR in ’86, by ’91 had grown to T1. This enabled the famous Donna Cox and Bob Patterson visualization of the Merit Network data for the month of September in ’91, which became an icon for the NSFNET,” said Smarr.

“So even though we had just done the SIGGRAPH satellite hookup in ’89, by ’91 we were able to do these remote collaborations demos for real. I want to acknowledge that the annual SC conferences have always been critical to our progress, because we used them to drive things forward. For instance, SC ’95 was in San Diego, with Sid Karin as the overall SC95 Chair while I served as Program Chair. A group of us, including Tom DeFanti, Ian Foster and Rick Stevens organized the I-WAY project to run at the conference. Linda Winkler was absolutely critical for the making this all work. I-WAY featured the first interconnection of the telecos 155-megabit/s networks. Researchers across the country carried out 65 different science projects on this distributed system using what we called in those days “high-speed networking”. More importantly, the I-Soft software system that Ian Foster developed to control that distributed cyberinfrastructure led directly to the GLOBUS project that still is used everywhere,” said Smarr.

Bob Kahn, at the same time, had gone from DARPA to setting up the CNRI (Corporation for National Research Initiatives), which started the Gigabit Testbeds and we were involved with Blanca (testbed) at NCSA. The Testbeds were set up to see if could we use fiber optics to get gigabit per second flows. The answer was that getting ATM and other networking protocols to work over the fiber optics went fine. The trouble was the endpoints. When you feed that much data-per-second into a 1990s computer, they basically couldn’t handle it. As the final report of the Gigabit Testbeds said “the Host I/O turned out to be the Achilles heel”. So, the problem with using fiber optics at even just a gigabit per second wasn’t with the fiber optics at all. It was the endpoints,” said Smarr.

“That has stuck with me for a long time. As you’ll see later, the whole point of what we did to create the Pacific Research Platform was to solve that bottleneck. Then NSF in ’97 decided to move from supporting the five independent supercomputer centers to two – NCSA and SDSC – with a whole bunch of partners connected by the new NSF vBNS as the backbone-creating the NSF Partnerships for Advanced Computational Infrasstructure (PACI). For all intents and purposes this is what a national research platform looks like today! But back then it was completely experimental; it was hooking together virtual reality and supercomputer centers and end users. We called it a ‘Prototype for America’s 21st Century Information Infrastructure’ and amazingly that’s what it turned out to be!” said Smarr.

“I remember sitting on the couch with Miron Livny in my office when we made the decision to name this the National Technology Grid. The NSF-funded vBNS directly led to the formation of Internet2, which is now the dominant network connecting all the universities in the country,” said Smarr.

Smarr credits the building of Roadrunner, a COTS-based Linux “supercluster” at the university of New Mexico and placing it on the National Technology Grid as a watershed moment. He noted, David Bader, who led its development, received the 2021 Sidney Fernbach Award at SC21, in part for work on Roadrunner. “This was really the golden spike in my view on creating the distributed systems we use today.”

Another critical development was the deployment of I-Wire in Illinois, and at that same time, Indiana set up I-Light, both of which were dark fiber networks in the States. “These really were very inspirational, they were copied widely,” said Smarr. “When I got to California (~2000), a year later,” said Smarr who’d transitioned from U Illinois to UCSD, I found that CENIC (Corporation for Education Network Initiatives in California) was still buying bandwidth from Verizon and AT&T and did not have a dark fiber system. I joined the nonprofit board of CENIC, and within a few years, we started buying our own fiber. Today CENIC has 8,000 miles of owned and managed fiber. So, this growth over the last 20 years of these regional optical networks is a critical part of what lets us build today’s distributed systems, such as the Pacific Research Platform.”

“We put together a team to create a proposal to that ITR program, and we were able to win one called the Optiputer. This was the real question at that time; we had a lot of clusters, Beowulf clusters and others, like David Bader had developed, but their backplanes were much faster than the Wide Area Network. So compute clusters were basically digital islands. The OptIPuter team had the idea that because of the fiber optics, we would be able to make the wide area bandwidth equal to the cluster bandwidth and remove the locality issue,” recalled Smarr.

Working with the funds from the OptIPuter grant, we were able to build a prototype and interconnect with the NSF TeraGrid over the then National LambdaRail. This happened as 4K video streaming was just emerging. “We had the first 4K projector in the United States at Calit2 which I founded in 2000.” said Smarr.

“As all this developed, DOE and particularly ESNET, were able to codify the notion of what it meant to have this kind of high-speed networking for scientific computing and instrumental data on a campus. This led to their ScienceDMZ concept in 2010, in which you needed a Data Transfer Node (DTN) to terminate the high-speed optical fiber, as we had learned from the gigabit testbeds, with network performance monitoring being carried out by perfSONAR,” said Smarr.

“NSF heard about this and this created a second great example of NSF adopting from DOE (the first being setting up the NSF Supercomputer Centers). NSF created the campus cyberinfrastructure (CC*) program that Kevin Thompson has been running since 2012; [it] has made over 340 awards, funding these optically fiber connected ScienceDMZ cyberinfrastructure systems on campuses across the country,” said Smarr.

The next obvious step, said Smarr, was to take all the campus ScienceDMZ deployments that NSF had funded, and hook them together. By 2015 CENIC had [its own] elaborate fiber optic system that connected all the California campuses and supercomputer centers (NASA, DOE and NSF) together. “We put a proposal together six years ago called the Pacific Research Platform [PRP] to see if we could create a regional Science DMZ,” he said.

“We’ve created a PRP website over the last few years that has everything you need to get started on the PRP. It includes all the technical details of how to use Jupyter notebooks, how to containerize, and so forth. I recommend you go there if you’re interested,” said Smarr.

To solve the host I/O bottleneck that the Gigabit Testbeds discovered, “We’ve developed these DTNs we call Flash I/O Network Appliances, FIONAs. Flash made all the difference. It solved the host I/O data bottleneck, the Achilles heel that Bob Kahn talked about with four gigabit testbeds, but now at 10-40-or-100 times the speed that the gigabit testbeds had. They are just PCs with terabytes of Flash memory, multicore CPUs. But then you can put in eight gaming 32-bit GPUs and turn it into a machine learning platform. So, these FIONAs are the endpoints that the fiber optics terminate in and they’re on your campus. Because they’re commodity if you just get the specs, you can order them yourself,” said Smarr.

Of course, the science community was not the only one vigorously pursuing distributed computing. Commercial cloud vendors also have racks of PC systems tied together with fiber optics and faced similar challenges with regard to managing user access and software execution.

“The Google Cloud developed Kubernetes, and then made it open-source, as a way to orchestrate containers across a global set of optically fiber connected PCs. Now if you’ve containerized your application, you have this very global digital fabric on which Kubernetes can move your things around, and you can run Ceph as the storage system under Rook,  which runs under Kubernetes. That means we can manage petabytes of distributed storage and GPUs for data science use,” said Smarr, crediting work by John Graham to accomplish this.

The PRP kept expanding beyond the West Coast and across the country. “Now we have 184 of these [FIONAs] located on over 25 campuses and they’re all interconnected at 10 to 100 gigabits per second. Together the PRP FIONAs contain nearly 7,500 CPU-cores of over 500 32-bit GPUs, as well as four petabytes of rotating storage all running over CENIC, the Quilt regional networks, and Internet2. The extended PRP runs all the way from Korea and out in Guam, to Hawaii, across the U.S., to Amsterdam in Europe, Amsterdam” said Smarr.

There were speedbumps along the way – this is science after all – and Smarr recalled that early efforts to use the FIONAs widely had problems. “When we got started in 2016 we tried to do10 gigabyte file transfers between all and all of the FIONAs and then measure it. However, you can see back in 2016, the orange [on this slide] means we couldn’t even get the thing to read; we got basically no throughput, it was broken. But by a year and a half later, you can see the green, indicating that we had accomplished disk-to-disk file transfer at five gigabits per second and this was pretty much uniform. We also added TraceRoute to determine the state of all the network segments between the PRP FIONAs. These measurements have by now all been containerized.”

There was a great deal more to Smarr’s talk. Maybe SC21 will make a video of it widely available. That said, it’s important to remember how transformational it is for enabling the pursuit of science using powerful distributed computing infrastructure. Smarr cited few examples. Here’s one.

“We built this [research platform] to enhance multi-campus, data-intensive applications. We brought in a bunch of application teams from particle physics, telescope surveys, atmosphere sciences, biomedical, and virtual reality as drivers for this,” said Smarr. “Just to give you an example, this is a global precipitation map from the MERRA NASA satellite archives that sends data to a center in Irvine for hydrometeorology to see all the developing precipitation systems,” said Smarr.

“As you’ve been reading, the atmospheric rivers just wiped out a lot of the Northwest US and Western Canada. You need to be able to predict those extreme precipitation events and you want to be able to do it in real time,” he said. “So, there’s a center for Western weather that looks at all of this in San Diego. We wanted to link the NASA satellite archives to the Irvine Center and then to the San Diego center. Well, the workflow was taking 19 days to do one round and that’s not predictive. But by using the PRP distributed workflow with its machine learning, and optimizing that on the FIONAs, we were able to get it from 19 days down to 52 minutes, which means it goes from being retrospective to predictive.”

So where are we going in the future, asked Smarr?

“The first thing is we’ve just gotten two awards from NSF that are going to allow us to increase participation by a number of Minority Serving Institutions, by building FIONAs on those campuses, as well as bringing in more EPSCoR states, the least funded by NSF states, in South Carolina, up in Nebraska, as well as to increase the spread of the of the PRP across the country,” he said.

Secondly, he said, “The NSF XSEDE program that funds the supercomputer centers, has funded a distributed version of SDSC called Expanse that uses the same kind of composable systems [and] Ceph storage as the PRP and as the Open Science Grid. We have already federated the containerized applications on the PRP, Expanse, and the Open Science Grid. Expanse is a five-year proposal.”

“Third, the NSF has just funded Frank Wuerthwein, a PRP co-PI, with his NRP, a prototype of the National Research Platform. This is going to put a lot of emerging computational systems like FPGAs – as well as 64-bit GPUs like are used on the supercomputers – and more distributed data systems building on the work he’s done as executive director of the Open Science Grid.”

“I think you’re going to see this type of cyberinfrastructure become a permanent part of the research landscape – a completely distributed high-performance computing, networking, storage and analysis facility,” said Smarr.

“As I said, even though NSF is funding it in our country, we already have seen that it’s being interconnected to Europe and the Pacific Rim as well. In closing, I wanted to say that this talk is not meant as a scholarly historical analysis of the last 40 years. Rather, it’s my personal recollection of key moments that I was personally involved in.”

Brief Smarr Bio

Dr. Larry Smarr is a University of California San Diego Distinguished Professor Emeritus in UCSD’s Department of Computer Science and Engineering. For the last 20 years he served as the founding Director of the California Institute for Telecommunications and Information Technology (Calit2). He was earlier the Founding Director of the National Center for Supercomputing Applications (NCSA). In 2006, he received the IEEE Computer Society Tsutomu Kanai Award for his lifetime achievements in distributed computing. Smarr continues to provide national leadership in advanced cyberinfrastructure (CI), currently serving as Principal Investigator on three NSF CI research grants: the NSF Pacific Research Platform, Cognitive Hardware and Software Ecosystem Community Infrastructure, and Toward the National Research Platform. He is a member of the National Academy of Engineering and a Fellow of the American Physical Society and has served on top level advisory committees to NIH, NASA, NSF, and DOE.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion XL — were added to the benchmark suite as MLPerf continues Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing power it brings to artificial intelligence.  Nvidia's DGX Read more…

Call for Participation in Workshop on Potential NSF CISE Quantum Initiative

March 26, 2024

Editor’s Note: Next month there will be a workshop to discuss what a quantum initiative led by NSF’s Computer, Information Science and Engineering (CISE) directorate could entail. The details are posted below in a Ca Read more…

Waseda U. Researchers Reports New Quantum Algorithm for Speeding Optimization

March 25, 2024

Optimization problems cover a wide range of applications and are often cited as good candidates for quantum computing. However, the execution time for constrained combinatorial optimization applications on quantum device Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at the network layer threatens to make bigger and brawnier pro Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HBM3E memory as well as the the ability to train 1 trillion pa Read more…

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HB Read more…

Nvidia Looks to Accelerate GenAI Adoption with NIM

March 19, 2024

Today at the GPU Technology Conference, Nvidia launched a new offering aimed at helping customers quickly deploy their generative AI applications in a secure, s Read more…

The Generative AI Future Is Now, Nvidia’s Huang Says

March 19, 2024

We are in the early days of a transformative shift in how business gets done thanks to the advent of generative AI, according to Nvidia CEO and cofounder Jensen Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. Wh Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Leading Solution Providers

Contributors

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

Google Introduces ‘Hypercomputer’ to Its AI Infrastructure

December 11, 2023

Google ran out of monikers to describe its new AI system released on December 7. Supercomputer perhaps wasn't an apt description, so it settled on Hypercomputer Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Intel Won’t Have a Xeon Max Chip with New Emerald Rapids CPU

December 14, 2023

As expected, Intel officially announced its 5th generation Xeon server chips codenamed Emerald Rapids at an event in New York City, where the focus was really o Read more…

IBM Quantum Summit: Two New QPUs, Upgraded Qiskit, 10-year Roadmap and More

December 4, 2023

IBM kicks off its annual Quantum Summit today and will announce a broad range of advances including its much-anticipated 1121-qubit Condor QPU, a smaller 133-qu Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire