Ask any Epidemiologist about another pandemic, and they will reply, “It is not a question of if, but of when.” Many hard lessons were learned during the height of the recent COVID-19 pandemic (which continues into the fall of 2024). One of these lessons was the large amounts of heroic, overlapping, and redundant work that was taking place. In this interview, HPCwire talks with Dr. Jonathan Ozik and Dr. Valerie Hayot-Sasson about algorithm driven HPC workflows, improved data ingestion, curation, and management capabilities, and a shared development environment for rapid response and collaboration. One of the key goals is developing an open science data-flow platform, using tools like those developed by the Globus Project, to aid in better and timely public health decisions.
HPCwire: Hello and welcome. I’m Doug Eadline, managing editor of HPCwire. Today, we’re going to be talking about how the lessons from COVID-19 have helped shape better ways to manage resources and distribute critical lifesaving information. COVID-19 had an unprecedented impact on scientific collaboration, the pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making.
Today, we’re speaking with Dr. Jonathan Ozik and Dr. Valerie Hayot-Sasson by way of introduction, Dr. Valerie Hayot-Sasson is a postdoctoral scholar at the University of Chicago and holds a joint appointment at Argonne National Laboratory. Her interests focus on improving accessibility and scientific research by developing software solutions that speed up processing of workflows or improve ease of use. Doctor Jonathan Ozik is a principal computational scientist at Argonne National Laboratory, senior scientist at the consortium for Advanced Science and Engineering with Public Health Science affiliation at the University of Chicago and Senior Institute Fellow, Northwestern Oregon Institute of Science and Engineering, Northwestern University. Doctor Ozik leads multiple US National Science Foundation and National Institute of Health projects, some of which we’re going to talk about today.
So first, I want to thank both of you for being here. And I’d like to start off with Jonathan and ask the basic, you know, COVID pandemic brought together valiant efforts from many sectors, yet there were many challenges that seemed to occur, and trying to support critical public health decision making and data modeling seemed a little strained while this was going on, given that there will be more pandemics. What lessons did you learn or have we learned from COVID-19? Thank you.
Dr. Jonathan Ozik: Doug, and thank you for having us on. So, let me first provide some background on our experiences. So, during COVID, our group dropped what we were doing. And as part of DOE’s multi lab and NVBL effort, we were asked to support the nation’s effort in the COVID response. For us at Argonne, we were one of four academic and national lab groups that made up Illinois Governor’s COVID-19 Task Force, which had us meeting twice weekly to develop and provide analyses to the Illinois and Chicago departments of Public Health and the governor’s and the Chicago mayor’s offices. We brought to bear our previous experiences with HPC workflows, machine learning, and large-scale agent-based epidemiological modeling. But unlike research efforts, we were responding to rapidly evolving policy questions as our understanding of Covid epidemiology changed as well.
So this experience in working directly with public health stakeholders and supporting their decision making was, on one hand very gratifying and on the other extremely difficult. So based on these experiences, we identified ways in which we thought we could improve our ability to enhance evidence based decision making through better use of compute, automation and analytics pipelines. And since then, our goal has been to develop and provide these capabilities to the public health modeling and analysis community as an open science platform. What we experienced with COVID is very likely not a one off event. The better question isn’t if something like that will happen again, but when.
HPCwire: During this your initial efforts into this, what were some of the gaps you found that slowed things down or were real challenges in trying to deliver this information?
Dr. Jonathan Ozik: That’s a really good question. So during COVID, individual research groups across the world really were independently using HPC data management, machine learning and AI and attempting to use automation methods to develop, calibrate, modify, verify and validate their epidemiological models, which involved really large amounts of heroic and overlapping work and also unfortunately, lacked generalizability. So we identified three broad areas where we thought we could push the field forward. So the first is we determined that there was a need for integrated algorithm driven HPC workflows.
This integration is critical in bringing three important areas of computational science, which are simulation, large scale workflow coordination, and then the algorithms that strategically and efficiently guide the simulation and machine learning based analysis. These workflows need to coordinate across distributed and heterogeneous resources, and this is because epidemiological modeling includes a range of different types of computational tasks. They also need to be fault tolerant. They need to be secure, and they also need to facilitate automated access to these heterogeneous distributed resources. And also scalability is important to handle the varying demands of epidemiological workflows, as is really fast time to solution approaches that can provide actionable insights quickly.
Dr. Jonathan Ozik: So that was the first area. The second area is really a need for improved data ingestion, curation, and management capabilities. With all the different data sources that kept changing over time, there was a need to access, move, and track the diverse data sets from their origin to their use within our computational analyses. And there was also an overwhelming need to automate data curation, through data analysis pipelines for data de-biasing, integration, uncertainty quantification, and also metadata and provenance tracking. And the third broad area was a real need for a shared development environment for rapid response and collaboration.
So, this includes the ability to quickly and efficiently share models and create portable workflows that can run across federated HPC systems and systems to house models include how to reproduce, extend, or scale them by others. Based on these requirements, we’re developing the Open Science Platform for Robust Epidemic analysis, or Ospreys, we call it, and we’re taking a decentralized approach that leverages cloud services to connect compute and storage across national laboratory and university resources.
HPCwire: That’s a lot. Can you touch on the decentralized approach and the tools you’re using to connect all these services?
Dr. Jonathan Ozik: Sure thing. I’m happy to start. And then Valerie can expand. So one aspect of decentralization is that different researchers will have different compute and storage resources that they can access. And we want to be able to allow for any combination of these to support their epidemiological analysis needs. The second aspect of decentralization has to do with the data that, some of which might be, you know, proprietary or sensitive, which requires us to really move away from a central monolithic storage or compute requirement.
Dr. Valerie Hayot-Sasson: Yeah. So because we needed to find a solution that can work with all the resources available to researchers, whether that be workstations, high performance computing or cloud, we needed to look at solutions which could seamlessly use all of these services. We decided to use Globus services, which have been designed with the scientific community and diverse and distributed resources in mind. While the services themselves are cloud-hosted, they operate on a hybrid model. These services are also widely used by the scientific community and are already deployed in most research. Cyber infrastructure, making adoption of the tools easier.
HPCwire: So here’s a question I have. I’m always thinking of like you have all this infrastructure and everything. What does it look like to the end user or what? What are you planning to make it look like to the end user?
Dr. Jonathan Ozik: So this is our vision for the end user as an analyst on the bottom left that defines a policy or a timer that reaches out to data portals. And then data from these portals are ingested automatically then, verified, validated, transformed as needed, and then stored in an accessible location, and then subsequently analyses are kicked off with this newly updated data, where we employ HPC workflows to generate, for example, estimates of epidemiological parameters and forecasts of future trends and public health stakeholders are able to access these analysis products and in turn, they provide additional data into the data portals. The ultimate vision is that all of this happens in the background and individuals to access automatically generated analyses, similarly to how we experience, let’s say, weather forecasting. Now Valerie will introduce the automated events based research orchestration or aero tool that we have been building together.
Dr. Valerie Hayot-Sasson: Okay, so this is kind of what it looks like behind the scenes. In this figure, we can see that there are two different types of flows, each of which will be Globus flows. So on one side you have the on the left hand side you have the data ingestion, verification and curation. And on the right hand side you have the data analysis flow. Now for the data ingestion, verification, and curation. This flow is typically used to process data that are not created by an arrow flow. So for example this is for consolidating data made available by different sources. We can think of open data portals hospitals etc. and preparing it for future analysis. So preparing it for the right hand side of or like a right hand side flow.
These flows are expected to be periodical in nature, only being updated at the same cadence in which the data portals are updated. And now if we look at the right hand side, we have our analysis flow. These flows typically operate on data already available in the system. So unlike the data ingestion flows and, produce outputs which can then be used as inputs within other analysis flows, and in this case users can define two types of policies, either timer based or on a database update policies to relaunch their analyses.
HPCwire: Yeah, thanks. That’s interesting. And what struck me was the “weather forecast” ability to I could just imagine during the COVID pandemic if, if the ability to put up heat maps, as it were, of COVID and so forth, just like with the nightly weather forecast, how that may have helped many people. Guessing rather than guessing about what was going on or getting misinformation from who knows where. Is there any place where this is being used right now and is it ready to run, or where do you see this going at this point?
Dr. Jonathan Ozik: So we are working with the Chicago Department of Public Health to integrate these capabilities into their processes. Now, they have been both very supportive of our work and really keen to incorporate what we develop into evidence-based decision-making as we develop the infrastructure. We’re also co-developing test use cases, for example, being able to automatically ingest wastewater data and kick off automated analyses as data is updated. And this is of particular interest. As it turns out, due to the passive nature of this type of data source, especially, for example, as reporting requirements for COVID have largely gone away. So, as a result, previously available testing or hospitalization are not as readily available, if at all. However, I do want to say that our goal is to go beyond Department of Public Health partners, but to have researchers be able to build on what we build to further advance the role of computation in supporting decision making in public health.
Dr. Valerie Hayot-Sasson: And what’s great about Arrow is that it’s very generalizable. So at its base, it’s fundamentally a data sharing platform. And the data types and storage it uses can be at the discretion of the users. The same can be said for computation, which can be executed on any infrastructure that you, the user, has and enables users to describe their analysis using software they’re more most familiar with. So this kind of results in a type of bring-your-own infrastructure model.
HPCwire: So, Valerie that intrigues me. What is a bring-your-own infrastructure model?
Dr. Valerie Hayot-Sasson: So the bring your own infrastructure model means that both compute and storage resources are actually provided by the users. This results in users maintaining full ownership of their data, as the data never gets transmitted to the error servers, and the users get to reuse their configured infrastructure for their services, not having to specify to Aero how to install their code. All the users need to do when specifying a flow is to provide their Globus compute endpoint information, so that arrow knows where the automated flow needs to be executed, and a Globus collection URL that provides information to error on where to store and retrieve the data from.
HPCwire: So this is all very interesting to me. The fact that globally, locally bringing data together and making it usable, actionable data in times of, you know, in this case pandemic. So, what really excites you about taking this beyond where we are now with it today?
Dr. Valerie Hayot-Sasson: So the world is filled with problems that are worthy of investigation. And researchers progress is often hindered by these repetitive tasks that could easily be automated. So the goal of these tools is really to enhance the day to day lives of researchers, such that they can focus on their energy, on the problems that they set out to solve, rather than spend significant amounts of time on these repetitive tasks. And so what I hope when we build these kinds of tools is that it can ultimately lead to faster discoveries. And yeah.
HPCwire: I think being able to manage data and bring it to bear is really one of the challenges we have at this point in, in HPC and in many other areas as well. You don’t hear a lot anymore about big data, although it’s still there and being able to manipulate, move and calculate with big data, I think is really important. And to, really provide these tools I think is a great effort on your part. Jonathan, any closing comments?
Dr. Jonathan Ozik: Yes, Doug. Thank you. Really glad that we had this opportunity to sit down and present to you all the work that we’re doing now and then that we’re hoping to do to really expand the ways in which computational science can really push this field forward.
HPCwire: I want to thank you as well, both you and Valerie. It’s been a great interview hearing about this, that how this can apply to many different areas and in particularly with, with pandemics, which is very clear; the lessons learned from COVID that we had some work to do to get things right and hope to hear more in the future. And we’ll be talking to you soon.
This material is based upon work supported by the National Science Foundation under Grant 2200234, the U.S. Department of Energy, Office of Science, under contract number DE-AC02-06CH11357 and the Bio-preparedness Research Virtual Environment (BRaVE) initiative. This research was completed with resources provided by the Research Computing Center at the University of Chicago, the Laboratory Computing Resource Center at Argonne National Laboratory, and the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility.