June 5, 2013

Floating Genomics to the Cloud with AWS

Ian Armas Foster

As more institutions implement cloud strategies to supplement their best HPC practices, it is important to consider the extent to which companies run HPC applications in the cloud and for which applications it is particularly useful.

David Pellerin and Jafar Shameen, both of HPC Business Development at Amazon Web Services, gave a presentation at AWS Summit 2013 to discuss which industries and companies are using the cloud service to run HPC applications. Not surprisingly, the talk mostly centered on applications in genomics and the life sciences, as highlighted by a third speaker in Alex Dickinson, SVP of Cloud Genomics at Illumina.

“What you end up doing is building a cluster for the worst, nastiest problem you have,” said Pellerin on the risks and costs of building in-house HPC clusters. “You get this big, expensive cluster that for most of the workload, it doesn’t need to be there.” No company should know this better than Amazon, as they started being a cloud services provider as a result of having an excess of computing resources that were only put to use at certain peak times.

Scientific disciplines such as genomics and high energy particle physics turn to cloud computing for certain HPC applications for a fairly basic reason: cloud computing is optimal for experimentation. For Pellerin, computing on AWS allows ‘the ability to fail fast.’ An in-house system is subject to job queue and scheduling limitations that generally prove both costly and time-consuming.

Again, ‘the ability to fail fast’ is an important one for a researcher looking to initially test several hypotheses he or she may have given their large dataset. This capability doesn’t exclusively help those in the sciences, as financial services are running risk analytics on AWS while engineering firms run CAD and CAE simulations for aerospace, according to Pellerin. However, those terms of ‘risk analytics’ and ‘CAD simulations’ imply a theoretical, experimental approach to computing, where the value of running multiple scenarios in a short amount of time is considerable.

The focus here, though, was on the life sciences and on genomics in particular. The advances over the last decade have turned genome sequencing from a problem of actually performing the procedure to storing the relevant data. As Dickinson explained, “When we ask our customers where do they spend their time…the actual time they spend sequencing is relatively small. What really kills them is the bioinformatics, which is comprised of a lot of computationally intensive processing and also now interpretation.”

Ten years ago, the Human Genome was completed after 13 years and a $4 billion investment. Today, that same process takes only a day and about a thousand dollars to complete.

As such, genomic sequencing has scaled faster than Moore’s Law over the last decade, as seen in the figure below. This presents an obvious storage issue, especially when policy requires for that information to be kept for several years.

Last week, we highlighted the work being done in BonFIRE to test angles of incidence to maximize the destruction of cancer rays while harming as few working cells as possible. Illumina isn’t working on this problem exactly but they are working on individual genomes to determine cancer causes. Dickinson argued that since everyone clearly has a different genome and that tumor growth is sparked by a malfunction in the cells processing genetic instructions, personalizing cancer treatment means running individual genomes.

“Our solution was to build something called BaseSpace,” Dickinson explained as he delved deeper into how Illumina works with AWS. “In the labs we connect the instruments to BaseSpace using standard internet connections. It turns out that even though they produce a lot of data, they do it at a relatively steady pace.”

Scientists like to keep the raw data of every genome that is sequenced, a commitment that requires approximately 120 GB of data. One might expect for a genome, which consists of about 3 billion bases, to require significantly more than 120 GB to unravel. However, since humans are quite genetically similar to each other, with variances among individuals only representing about 0.1 percent of the genetic signature, they are able to pare the dataset down to that 120 GB level. Once that’s done, according to Dickinson, Illumina can comfortably transfer that data to AWS through BaseSpace at a rate of about 7 Mbps.

Beyond storing genomes and running experimental tests on them, cloud and AWS in particular hope to be a facilitator of scientific collaboration. Today, the top method for sending massive datasets is by sending physical hard drives through the mail, according to Dickinson. The hope is that someday the cloud will become the first choice in delivering massive datasets such that exist in genome sequencing to other facilities, and Illumina is one of the life science companies pushing that paradigm.

Of course, there are more examples of institutions performing HPC applications in AWS, as explained by Shameen. Among such is Pfizer, who uses the Amazon Virtual Private Cloud to run pharmaceutical computational experiments in an extra secure environment, according to Shameen. Globus is a genomics company who, similar to Illumina, transfers their data to AWS, but this time over the Amazon implemented Galaxy platform. Further, Shameen pointed to the Harvard Medical School as an early adopter of AWS for excess and experimental HPC workloads.

As shown by Illumina, running experimental HPC applications in a cloud service like AWS is gaining more traction, especially in the life sciences and genomics department.

Related Articles

The Science Cloud Cometh

Throwing Cancer on the BonFIRE

CERN, Google, and the Future of Global Science Initiatives