In November, researchers at the San Diego Supercomputer Center (SDSC) and the IceCube Particle Astrophysics Center (WIPAC) set out to break the internet – or at least, pull off the cloud HPC equivalent. As part of their efforts to study the data from the shockwaves that neutrinos send rippling through Antarctic ice sheets, the team executed a massive, unprecedented stunt: over the course of a couple of hours, they bought up as many cloud GPUs as they could manage and ran them in tandem in a single HTCondor pool, achieving a peak of 51,000 cloud GPUs simultaneously crunching IceCube’s simulations. Now, just a few months later, the cloud bursting experiment has played an encore with a smaller group of cloud GPUs.
Using some of the remaining funding from the November run (about $60,000), the team pooled cloud resources from Amazon Web Services (AWS), Microsoft Azure and Google Cloud Platform. While they again used HTCondor to manage the workloads, the researchers were a bit choosier with their cloud resource selections and also grouped the cloud resources with on-prem resources from the Open Science Grid, the Extreme Science and Engineering Discovery Environment (XSEDE) and the Pacific Research Platform.
In total, the experiment commandeered 15,000 GPUs over a sustained period, aggregating the equivalent of 170 peak single-precision petaflops. At about eight hours, this experiment ran around four times longer than the first, larger cloud bursting experiment, allowing a much smaller group of cloud resources to process 151,000 jobs compared to the initial experiment’s 101,000. “This means that the second IceCube cloud run produced 50% more science, even though the peak was significantly lower,” explained Igor Sfiligoi, lead scientific software developer and researcher at SDSC.
“We drew several key conclusions from this second demonstration,” said Sfiligoi. “We showed that the cloudburst run can actually be sustained during an entire workday instead of just one or two hours, and have moreover measured the cost of using only the two most cost-effective cloud instances for each cloud provider.” The researchers also found that cloud resources with Nvidia Tesla T4 GPUs ranked as the most suitable for their needs – about three times as cost-effective compared to the next-best option (the Nvidia Tesla V100).
Crucially, this second experiment also broke from the extraordinary circumstances of its predecessor, which was carefully planned for a time of year, time of week and time of day when global cloud GPU usage was expected to be minimal.
“This second cloud burst was performed on a random day, during regular business hours, adding cloud resources to the standard IceCube on-prem production infrastructure,” said Frank Würthwein, SDSC’s lead for high-throughput computing. “It’s the kind of cloud burst that we could do routinely, a few times a week if there were principal investigators willing to pay the bill to do so.”
Header image: the IceCube Neutrino Observatory in Antarctica.
To read SDSC’s release discussing this research, click here.