Cycling through Genomics and Other Cloud HPC Applications
HPC applications run in the cloud tend to be those of the experimental nature. That property thus lends cloud-based HPC nicely to scientific purposes, especially that in the genomics world, to the extent that such efforts are being recognized as a ‘best practice’ in a biological IT context.
HPC in the Cloud caught up with Cycle Computing CEO and Co-Founder Jason Stowe, where he discussed the company’s efforts in aiding Schrodinger, Inc., a company focused on chemical simulation for biotechnical and pharmaceutical purposes, in their efforts as they won Bio-IT World’s best practices award last month. Stowe also discussed how exactly their Utility HPC software advances the state of scientific HPC in the cloud as well as their initiatives in the months and years to come.
“Schrodinger won the best practice award,” Stowe said, “for a large-scale run that we did with them where we had a 50,000-core computing environment and ran approximately 12 years of science on it in 3 hours.” For Stowe, the biggest benefits here are cost and speed. In speaking with analysts from places like IDC, the cost of buying and operating such a server to run those computations could easily run to the millions.
That cost is worth it for national labs and large institutions that would continually use those servers. For a company like Schrodinger, however, the cost and space requirements to install such a datacenter would be prohibitive.
As such, through Cycle’s Utility HPC software running in the Amazon Web Services cloud, Schrodinger was able to significantly reduce costs on the simulation. “We turned [the system] off,” Stowe explained, “and the total cost at the time to do this was $4829 to run per hour so about $14,500 total for the workload.”
However, as one would surmise from previous HPC in the Cloud articles on organizations like CERN and the European Space Agency running experimental applications on a virtualized cloud environment, cloud-based HPC is not limited to those who can ill afford an idle datacenter. “We have customers who use 40 cores and customers who use 40,000 cores.”
According to Stowe, Cycle worked recently with a large pharmaceutical company, which was running genomics simulations, to garner similar cost and time compression, where they reportedly ran “39 years of science in 11 hours” on a ten thousand server infrastructure, a process which only cost about $4400.
Stowe explained how their software utilizes and takes advantage of server clusters such that they mimic an in-house scientific HPC machine. “Our premise here with utility supercomputing is basically that individual researchers can now grab very large high throughput capability machines.”
High throughput is important, as it is that feature which appeals to the majority of new scientific applications being built and run today. “[The new science is] data parallel, it’s big data, it’s analytics. All of those workloads work well on high throughput computing environments. Basically we have the ability to create large-scale environments that operate quickly to run these newer classes of workloads that require a high throughput,” Stowe said.
Specifically, according to Stowe, Cycle’s Utility HPC software works on creating that throughput with a heavy emphasis on job scheduling and workload management. Further, the software is quite active in the automatic bidding for Amazon’s idle computing services, acquiring additional resources when various jobs require it. “As you accumulate more and more samples from the sequencer, we would be able to deploy large scale clusters that would be capable of analyzing that data and then turn around and managing cost across those clusters by handling spot market bidding, which is Amazon’s marketplace for idle computing.”
To give an example, Stowe spoke of a genomics company that requested MPI jobs that required many processors and heavy throughput. “If you’ve got a next gen sequencer, putting data down on a local cloud system, our software would schedule copying the data externally and would deploy clusters to run secondary and tertiary analysis on the genomic data, it would handle automatically archiving a copy of that data into glaciers so you always had a backup at a very low cost point”
Genomics is one of the more notable use cases for those looking to run certain HPC applications in a virtualized environment. This makes sense, as the ability to cheaply and quickly run genomic sequencing relative to ten years ago (when it took a decade and several billion dollars) is impressive. It is also highly data-intensive, and most of that data is necessary in the analytics. Stowe noted that Cycle’s goal is to be able to run background analytics while the data is stored in various cloud servers.
However, Cycle does not aim to solely focus on genomics. Stowe noted that cloud-based HPC applications are attracting the attentions of manufacturing and finance folks, as they look to run multiple experimental simulations without having to further tax their in-house HPC resources, and Cycle hopes to be on the forefront of that.