“We’re pretty far from being out of this mess right now,” cautioned BioTeam CEO Ari Berman as he opened for the latest ASF Roundtable, Fighting COVID with Advanced Technologies: Reprovisioning Resources During a Disruptive Event. The roundtable, which brought together experts from private industry, national laboratories and computing centers, explored how major HPC centers rapidly pivoted when faced with the sudden needs of pandemic research – and what, a year and a half later, they now consider necessary for the HPC community to face similar crises in the future.
A deluge of data – and strained resources
“Early in the pandemic, we didn’t know much about SARS-CoV-2 [or] the disease that it caused, COVID-19,” said Jack Collins, director of the Advanced Biomedical Computational Science Group at Frederick National Laboratory. Being computational scientists, they immediately went to the data – but, as Berman had put it, “the data was, and is, a complete mess. People are still sifting through it … a lot of errors creeped into the research due to the urgency.”
Collins’ experience bore this out: the data coming out of Italy was in a mix of Italian and English, and “every hospital had their own way of doing it.” The disorganization was so severe, he said, that he had a data analyst who actually learned enough Italian to be able to start translating it. “Ari talked about data early on,” Collins said, “and it’s the most critical source – and oftentimes the least portable.”
Many of the panelists also struggled with data sensitivity, as much of the COVID-related data at the patient level was in some way restricted due to privacy and health data laws. “When it comes to sensitive data, it’s really hard to deploy that and use it on resources that were not deployed for it prior to the need,” said Glenn Brook, senior solution architect for Cornelis Networks.
For Collins’ research team, one of their interests was in identifying the causes of severe COVID-19, for which they began to explore patient susceptibility using whole-genome sequencing. This work was backed by the Biowulf cluster, an HPC resource provided by the National Institutes of Health (NIH) that is equipped with thousands of CPU nodes and over a hundred GPU nodes.
Still, the scale of the data was daunting. “For every one of the samples, we take about 96 hours of compute,” Collins said. Further, each sample required up to 300GB of storage – and beyond the computational and storage limitations, there were knowledge limitations. “One of the things we run into is … people who don’t know how to use high-performance computing systems,” Collins said. Access to computational resources, it turned out, wasn’t all they needed to support COVID research.
Such resources were suddenly stretched thin at many supercomputing centers: at the Texas Advanced Computing Center (TACC), for instance, up to 30 percent of its total capacity – including personnel – were pivoted to pandemic research. “We supported over 90 different projects for different researchers around the world,” said Dan Stanzione, TACC’s executive director. “We used about fifteen million node hours, call it three-quarters of a billion core hours … Give or take, we probably used about $30 million worth of time on COVID-related things on very little notice.”
TACC, powered by its 23.5 Linpack petaflops Frontera system, supported the advanced modeling of SARS-CoV-2’s spike protein led by Rommie Amaro at the University of California, San Diego, which won the Gordon Bell Special Prize at SC20 last year. That process, Stanzione explained, was iterative and collaborative. “Your simulations keep getting bigger and bigger and you learn more and more over time. So there really wasn’t a sort of, ‘run the computation and move on,’” he said. “It was a journey.”
But how does a center as large as TACC embark on such a journey – especially when, in the case of the research by the Amaro Lab, that journey costs millions of node hours and a hundred million core hours?
Bypassing the red tape
“All existing policy and agreements about who gets time on the systems and how we allocate time – length of runs, priorities – you throw all of that out the door almost immediately,” Stanzione said. Then, he said, “we had to rapidly build collaborations” to pull off what he called a “fairly large pivot” into heavy biomedical research in a matter of weeks.
Stanzione cited the White House COVID-19 HPC Consortium as a major success, noting that not one organization had signed a formal agreement and that there were zero lawyers involved. “No formal agreements, no legalese,” he said. “We just did it.” (Collins called that lack of legal hang-ups a “huge relief” that “released a lot of creative energy.”)
But Stanzione stressed a crucial caveat. “All of that only worked,” he said, “because there had been a lot of investments that led up to this. … We had a lot of people who had spent their careers getting ready to fight a pandemic … We did not invent new codes overnight. … We didn’t deploy new infrastructure overnight. We repurposed it.”
Many of the partnerships, too, relied on existing connections: in Amaro’s case, her lab had been using TACC’s systems for twenty years. In late February, when Stanzione got an email from her anticipating that the pandemic was going to continue accelerating and requesting urgent computing time, he knew that her lab was prepared to take full advantage of TACC’s considerable resources. Collins, too, said that they “had a lot of the connections already together” at Frederick National Laboratory.
Moving toward the National Strategic Computing Reserve
The panelists, broadly, converged on a central theme: the infrastructure and knowledge that had been established prior to the pandemic worked well; but the gaps that researchers were sent scrambling to fill, in many cases, remained problem areas.
“Where do we go from here?” Stanzione asked, answering: “We could do much of the same for many other kinds of disasters.”
Stanzione was referring to the notion of a National Strategic Computing Reserve (NSCR), which he compared to a Merchant Marine for urgent computing. In late December 2020, the National Science Foundation (NSF) and the White House Office of Science and Technology Policy (OSTP) issued a Request for Information (RFI) on the idea of an NSCR, but since then, nothing major has happened – at least, not officially.
“No timeframe, nothing definite, no bill to lobby for,” Stanzione said. “But there is strong interest in high levels of at least four agencies in pushing this forward. And advocacy still needs to happen, I’d say.”
The NSCR, in principle, would allow researchers to maintain expertise and infrastructure critical to managing a sudden catastrophic event, rather than leaving them to fill those needs on short notice when such an event occurred. “You keep doing research and you keep funding the codes that can be repurposed and the computers that can be repurposed,” said Ian Karlin, principal HPC strategist for Lawrence Livermore National Laboratory. “You don’t know exactly how they’re gonna pivot, but you know they’re constantly getting better so they can pivot.”
And that infrastructure, Karlin said, needed to be well-suited for those specific purposes. “It doesn’t matter if people talk to each other if they can’t get to the data,” he said. Brook added that institutions should look closely at rapid provisioning of secure computing for sensitive and restricted data use, while Collins noted that many local public health agencies lacked the infrastructure to participate in HPC activities and that researchers should work on “having a way to reach in and help them in a way where they can actually make use of all this computation.”
Preparedness was key for the panelists, many of whom strongly expressed the need for the research community to be well-supplied for urgent computing in advance of the next disaster.
“If we don’t do that,” Collins said. “We’re just playing Russian roulette.”
To watch the sessions from the ASF Roundtable, follow the links below.
Computational Resources and COVID: A critical public health tool – Ari Berman
Responding to COVID 19 – Research Pivot – Jack Collins
Fighting Covid at TACC – Dan Stanzione
LLNL and Cornelis Networks Accelerate Covid Research at Scale – Glenn Brook & Ian Karlin
The Impact of Reprovisioning Resources During a Disruptive Event | Best Practices and Lessons Learned – Ari Berman, Glenn Brook, Jack Collins, Ian Karlin & Dan Stanzione