Larry Smarr, founding director of Calit2 (now Distinguished Professor Emeritus at the University of California San Diego) and the first director of NCSA, is one of the seminal figures in the U.S. supercomputing community. What began as a personal drive, shared by others, to spur the creation of supercomputers in the U.S. for scientific use, later expanded into a drive to link those supercomputers with high-speed optical networks, and blossomed into the notion of building a distributed, high-performance computing infrastructure – replete with compute, storage and management capabilities – available broadly to the science community.
At SC21 last week, Smarr delivered a racing reprise of this (ongoing) HPC tour de force in his talk, The Rise of Supernetwork Data Intensive Computing. Presented here are a sampling of his comments (lightly edited) and quite a few of his slides. Throughout the talk, Smarr readily acknowledged this was hardly a solo effort and liberally shared credit and praise with many colleagues working now and in the past to build out this computational connective tissue for science. Perhaps appropriately, given his championing of networking, Smarr gave his talk remotely at this year’s hybrid SC21 event.
Because he packed in so much material (break out your acronym dictionaries) and because we can hardly cover it all here, it’s worth reading the description from the SC21 program, which does a good job of shoehorning most of the ideas presented in one place. If you know Smarr, you’ll know that he literally touched on every topic mentioned the summary:
“Over the last 35 years, a fundamental architectural transformation in high-performance data-intensive computing has occurred, driven by the rise of optical fiber Supernetworks connecting the globe. Ironically, this cyberinfrastructure revolution has been led by supercomputer centers, which then became SuperNodes in this distributed system. I will review key moments, including the birth of the NSF Supercomputer Centers and NSFnet, the gigabit testbeds, the NSF PACI program, the emergence of Internet2 and the Regional Optical Networks, all eventually enabling, through a series of NSF grants, the National and Global Research Platforms.
“Over this same period, a similar cyberinfrastructure architecture allowed the commercial clouds to develop, which are now interconnected with this academic distributed system. Critical to this transformation has been the continual exponential rise of data and a new generation of distributed applications utilizing this connected digital fabric. Throughout this period, the role of the US Federal Government has been essential, anchored by the 1991 High-Performance Computing Act, which established the Networking and Information Technology Research and Development (NITRD) Program. Particularly important to the initiation of this distributed computing paradigm shift was the continued visionary leadership of Representative, then Senator, then Vice President Al Gore in the 1990s.”
Although Smarr retired from the Calit2 directorship in 2020, he remains an active participant and researcher in the HPC community (brief bio at the end of the article). Skipping over Smarr’s storied role in the creation of NSF-funded supercomputers and as director at NCSA, the idea for using fiber optics for computer networking was being discussed the ‘80s and got its first push from government at a 1985 federal hearing on Federal Supercomputer Programs and Policies attended by newly elected Senator Al Gore and Larry Smarr, then NCSA director.
“During that talk, Gore came up with this idea of [an] interstate highway for information using fiber optics connecting the major computational centers in the country. That was very inspirational to me, because I was still just trying to get a supercomputer alone. I hadn’t really pushed on the networking side. But I knew we were experimenting at the University of Illinois with fiber optics with the goal of going to a gigabit per second. This committee hearing was about the strategic role of supercomputing for the United States and the rise of Japanese-built supercomputers. But at the same time, they were very aware that fiber optics were coming and that was going to change everything,” said Smarr.
In 1989, “We asked Senator Gore if he would record a video to introduce a program we did live at SIGGRAPH, using giant AT&T dishes to transmit satellite audio/video between NCSA in Urbana Illinois and the SIGGRAPH show floor in Boston. We did full telepresence, which, of course, I’m doing now [at SC21] and you think nothing of it. Back then it took a pair of giant satellite communication dishes. You just didn’t do it on your PC on the Internet. But Al already was talking about how fiber optics would replace this,” said Smarr.
“What we were doing was coupling one of the very first Sun workstation on the Boston stage and interactively running a Cray-2 simulation [at NCSA in Illinois], and then describing it as we were doing it. The visualization was in real time.
“What Senator Gore said in his introduction is your team is demonstrating a prototype of the way it’s going to be done in the future when we have high speed fiber optic connections between computers. And that was 1989!,” he said.
By 1991, Senator Gore had sponsored and gotten passed the Networking Information Technology Research and Development (NITRD) program, created by the High-Performance Computing and Communications Act.
Smarr described the early NITRD period as “the golden years”, when “DOE, DARPA, NSF all worked together.” “This is when the NSF Supercomputer Centers introduced all of these alternate architectures, such as Thinking Machines CM-5 and the Intel supercomputers that were massively parallel. It was when the whole world of highly parallel supercomputing took off,” said Smarr. “But the interesting thing is it’s also where networking took off, because the NSFNET which had been set up as a 56-kilobit link between the five NSF centers and NCAR in ’86, by ’91 had grown to T1. This enabled the famous Donna Cox and Bob Patterson visualization of the Merit Network data for the month of September in ’91, which became an icon for the NSFNET,” said Smarr.
“So even though we had just done the SIGGRAPH satellite hookup in ’89, by ’91 we were able to do these remote collaborations demos for real. I want to acknowledge that the annual SC conferences have always been critical to our progress, because we used them to drive things forward. For instance, SC ’95 was in San Diego, with Sid Karin as the overall SC95 Chair while I served as Program Chair. A group of us, including Tom DeFanti, Ian Foster and Rick Stevens organized the I-WAY project to run at the conference. Linda Winkler was absolutely critical for the making this all work. I-WAY featured the first interconnection of the telecos 155-megabit/s networks. Researchers across the country carried out 65 different science projects on this distributed system using what we called in those days “high-speed networking”. More importantly, the I-Soft software system that Ian Foster developed to control that distributed cyberinfrastructure led directly to the GLOBUS project that still is used everywhere,” said Smarr.
“Bob Kahn, at the same time, had gone from DARPA to setting up the CNRI (Corporation for National Research Initiatives), which started the Gigabit Testbeds and we were involved with Blanca (testbed) at NCSA. The Testbeds were set up to see if could we use fiber optics to get gigabit per second flows. The answer was that getting ATM and other networking protocols to work over the fiber optics went fine. The trouble was the endpoints. When you feed that much data-per-second into a 1990s computer, they basically couldn’t handle it. As the final report of the Gigabit Testbeds said “the Host I/O turned out to be the Achilles heel”. So, the problem with using fiber optics at even just a gigabit per second wasn’t with the fiber optics at all. It was the endpoints,” said Smarr.
“That has stuck with me for a long time. As you’ll see later, the whole point of what we did to create the Pacific Research Platform was to solve that bottleneck. Then NSF in ’97 decided to move from supporting the five independent supercomputer centers to two – NCSA and SDSC – with a whole bunch of partners connected by the new NSF vBNS as the backbone-creating the NSF Partnerships for Advanced Computational Infrasstructure (PACI). For all intents and purposes this is what a national research platform looks like today! But back then it was completely experimental; it was hooking together virtual reality and supercomputer centers and end users. We called it a ‘Prototype for America’s 21st Century Information Infrastructure’ and amazingly that’s what it turned out to be!” said Smarr.
“I remember sitting on the couch with Miron Livny in my office when we made the decision to name this the National Technology Grid. The NSF-funded vBNS directly led to the formation of Internet2, which is now the dominant network connecting all the universities in the country,” said Smarr.
Smarr credits the building of Roadrunner, a COTS-based Linux “supercluster” at the university of New Mexico and placing it on the National Technology Grid as a watershed moment. He noted, David Bader, who led its development, received the 2021 Sidney Fernbach Award at SC21, in part for work on Roadrunner. “This was really the golden spike in my view on creating the distributed systems we use today.”
Another critical development was the deployment of I-Wire in Illinois, and at that same time, Indiana set up I-Light, both of which were dark fiber networks in the States. “These really were very inspirational, they were copied widely,” said Smarr. “When I got to California (~2000), a year later,” said Smarr who’d transitioned from U Illinois to UCSD, I found that CENIC (Corporation for Education Network Initiatives in California) was still buying bandwidth from Verizon and AT&T and did not have a dark fiber system. I joined the nonprofit board of CENIC, and within a few years, we started buying our own fiber. Today CENIC has 8,000 miles of owned and managed fiber. So, this growth over the last 20 years of these regional optical networks is a critical part of what lets us build today’s distributed systems, such as the Pacific Research Platform.”
“We put together a team to create a proposal to that ITR program, and we were able to win one called the Optiputer. This was the real question at that time; we had a lot of clusters, Beowulf clusters and others, like David Bader had developed, but their backplanes were much faster than the Wide Area Network. So compute clusters were basically digital islands. The OptIPuter team had the idea that because of the fiber optics, we would be able to make the wide area bandwidth equal to the cluster bandwidth and remove the locality issue,” recalled Smarr.
Working with the funds from the OptIPuter grant, we were able to build a prototype and interconnect with the NSF TeraGrid over the then National LambdaRail. This happened as 4K video streaming was just emerging. “We had the first 4K projector in the United States at Calit2 which I founded in 2000.” said Smarr.
“As all this developed, DOE and particularly ESNET, were able to codify the notion of what it meant to have this kind of high-speed networking for scientific computing and instrumental data on a campus. This led to their ScienceDMZ concept in 2010, in which you needed a Data Transfer Node (DTN) to terminate the high-speed optical fiber, as we had learned from the gigabit testbeds, with network performance monitoring being carried out by perfSONAR,” said Smarr.
“NSF heard about this and this created a second great example of NSF adopting from DOE (the first being setting up the NSF Supercomputer Centers). NSF created the campus cyberinfrastructure (CC*) program that Kevin Thompson has been running since 2012; [it] has made over 340 awards, funding these optically fiber connected ScienceDMZ cyberinfrastructure systems on campuses across the country,” said Smarr.
The next obvious step, said Smarr, was to take all the campus ScienceDMZ deployments that NSF had funded, and hook them together. By 2015 CENIC had [its own] elaborate fiber optic system that connected all the California campuses and supercomputer centers (NASA, DOE and NSF) together. “We put a proposal together six years ago called the Pacific Research Platform [PRP] to see if we could create a regional Science DMZ,” he said.
“We’ve created a PRP website over the last few years that has everything you need to get started on the PRP. It includes all the technical details of how to use Jupyter notebooks, how to containerize, and so forth. I recommend you go there if you’re interested,” said Smarr.
To solve the host I/O bottleneck that the Gigabit Testbeds discovered, “We’ve developed these DTNs we call Flash I/O Network Appliances, FIONAs. Flash made all the difference. It solved the host I/O data bottleneck, the Achilles heel that Bob Kahn talked about with four gigabit testbeds, but now at 10-40-or-100 times the speed that the gigabit testbeds had. They are just PCs with terabytes of Flash memory, multicore CPUs. But then you can put in eight gaming 32-bit GPUs and turn it into a machine learning platform. So, these FIONAs are the endpoints that the fiber optics terminate in and they’re on your campus. Because they’re commodity if you just get the specs, you can order them yourself,” said Smarr.
Of course, the science community was not the only one vigorously pursuing distributed computing. Commercial cloud vendors also have racks of PC systems tied together with fiber optics and faced similar challenges with regard to managing user access and software execution.
“The Google Cloud developed Kubernetes, and then made it open-source, as a way to orchestrate containers across a global set of optically fiber connected PCs. Now if you’ve containerized your application, you have this very global digital fabric on which Kubernetes can move your things around, and you can run Ceph as the storage system under Rook, which runs under Kubernetes. That means we can manage petabytes of distributed storage and GPUs for data science use,” said Smarr, crediting work by John Graham to accomplish this.
The PRP kept expanding beyond the West Coast and across the country. “Now we have 184 of these [FIONAs] located on over 25 campuses and they’re all interconnected at 10 to 100 gigabits per second. Together the PRP FIONAs contain nearly 7,500 CPU-cores of over 500 32-bit GPUs, as well as four petabytes of rotating storage all running over CENIC, the Quilt regional networks, and Internet2. The extended PRP runs all the way from Korea and out in Guam, to Hawaii, across the U.S., to Amsterdam in Europe, Amsterdam” said Smarr.
There were speedbumps along the way – this is science after all – and Smarr recalled that early efforts to use the FIONAs widely had problems. “When we got started in 2016 we tried to do10 gigabyte file transfers between all and all of the FIONAs and then measure it. However, you can see back in 2016, the orange [on this slide] means we couldn’t even get the thing to read; we got basically no throughput, it was broken. But by a year and a half later, you can see the green, indicating that we had accomplished disk-to-disk file transfer at five gigabits per second and this was pretty much uniform. We also added TraceRoute to determine the state of all the network segments between the PRP FIONAs. These measurements have by now all been containerized.”
There was a great deal more to Smarr’s talk. Maybe SC21 will make a video of it widely available. That said, it’s important to remember how transformational it is for enabling the pursuit of science using powerful distributed computing infrastructure. Smarr cited few examples. Here’s one.
“We built this [research platform] to enhance multi-campus, data-intensive applications. We brought in a bunch of application teams from particle physics, telescope surveys, atmosphere sciences, biomedical, and virtual reality as drivers for this,” said Smarr. “Just to give you an example, this is a global precipitation map from the MERRA NASA satellite archives that sends data to a center in Irvine for hydrometeorology to see all the developing precipitation systems,” said Smarr.
“As you’ve been reading, the atmospheric rivers just wiped out a lot of the Northwest US and Western Canada. You need to be able to predict those extreme precipitation events and you want to be able to do it in real time,” he said. “So, there’s a center for Western weather that looks at all of this in San Diego. We wanted to link the NASA satellite archives to the Irvine Center and then to the San Diego center. Well, the workflow was taking 19 days to do one round and that’s not predictive. But by using the PRP distributed workflow with its machine learning, and optimizing that on the FIONAs, we were able to get it from 19 days down to 52 minutes, which means it goes from being retrospective to predictive.”
So where are we going in the future, asked Smarr?
“The first thing is we’ve just gotten two awards from NSF that are going to allow us to increase participation by a number of Minority Serving Institutions, by building FIONAs on those campuses, as well as bringing in more EPSCoR states, the least funded by NSF states, in South Carolina, up in Nebraska, as well as to increase the spread of the of the PRP across the country,” he said.
Secondly, he said, “The NSF XSEDE program that funds the supercomputer centers, has funded a distributed version of SDSC called Expanse that uses the same kind of composable systems [and] Ceph storage as the PRP and as the Open Science Grid. We have already federated the containerized applications on the PRP, Expanse, and the Open Science Grid. Expanse is a five-year proposal.”
“Third, the NSF has just funded Frank Wuerthwein, a PRP co-PI, with his NRP, a prototype of the National Research Platform. This is going to put a lot of emerging computational systems like FPGAs – as well as 64-bit GPUs like are used on the supercomputers – and more distributed data systems building on the work he’s done as executive director of the Open Science Grid.”
“I think you’re going to see this type of cyberinfrastructure become a permanent part of the research landscape – a completely distributed high-performance computing, networking, storage and analysis facility,” said Smarr.
“As I said, even though NSF is funding it in our country, we already have seen that it’s being interconnected to Europe and the Pacific Rim as well. In closing, I wanted to say that this talk is not meant as a scholarly historical analysis of the last 40 years. Rather, it’s my personal recollection of key moments that I was personally involved in.”
Brief Smarr Bio
Dr. Larry Smarr is a University of California San Diego Distinguished Professor Emeritus in UCSD’s Department of Computer Science and Engineering. For the last 20 years he served as the founding Director of the California Institute for Telecommunications and Information Technology (Calit2). He was earlier the Founding Director of the National Center for Supercomputing Applications (NCSA). In 2006, he received the IEEE Computer Society Tsutomu Kanai Award for his lifetime achievements in distributed computing. Smarr continues to provide national leadership in advanced cyberinfrastructure (CI), currently serving as Principal Investigator on three NSF CI research grants: the NSF Pacific Research Platform, Cognitive Hardware and Software Ecosystem Community Infrastructure, and Toward the National Research Platform. He is a member of the National Academy of Engineering and a Fellow of the American Physical Society and has served on top level advisory committees to NIH, NASA, NSF, and DOE.