July 24, 2013

Coding Bioinformatics into the Cloud

Ian Armas Foster

A team comprised primarily of researchers from the Technical University of Munich published a paper this month on a system that will offer cloud functionality to a bioinformatics toolset that will cover high performance applications.

“On demand servers in the cloud promise to fit computing power to most tasks economically, and without a fair portion of the usual worries of system management: hardware purchasing, recruiting a system manager, high availability issues, and so forth,” the report noted in discussing some of the benefits of cloud on high tech problems like genomics. The problems for them start with porting a toolkit that was designed to run on non-virtualized systems.

“One problem remains: how to get the often adhoc analysis toolset from the desktop environment into the cloud?” They answer that question by offering up the PredictProtein bioinformatics toolkit over an operating system called Cloud BioLinux. “Directly addressing this problem, here we report the first Debian package release of the protein feature prediction toolset “PredictProtein,” developed at the Rost Lab.”

Third Wave Systems, a Minnesota-based manufacturing company, recently answered a similar question regarding specialized HPC cloud services, where they too developed a specific toolkit for cloud based high performance manufacturing applications 

The paper starts with an assertion that will not surprise those familiar with the HPC cloud space: cloud computing is gaining greater acceptance in the bioinformatics realm. The focus here has largely been on genomics, as the massive datasets associated with genomics are hypothetically easier to access and utilize when input into a cloud.

The attractiveness of cloud computing for bioinformatics runs deeper. According to the team, cost analytics models favor sustainability in the cloud. But more importantly, a common cited common advantage of cloud computing is cited here in the ability to offload peak workloads without needing large in-house clusters.

“The rate of data generation of “next generation” sequencing (NGS) drives the efforts to turn to cloud computing as a solution to handling peak-time loads, without the need to maintain large clusters,” they argue. And finally, according to the team, there exists high performance functionality. “Cloud-enabled bioinformatics tools are now available in the context of high throughput sequencing and genomics.”

Those tools, as mentioned earlier, come in the PredictProtein toolset, an open source package that has been around and improved upon for several years. Specifically, according to the report, “The PredictProtein cloud solution builds upon the open source operating system Debian and provides its functionality as a set of free software packages.”

They tested the system on two bioinformatics use cases, both of which focused on dealing with the “increase in the protein annotation gap.” Per the report, of a specific database called the UniProt Knowledgebase, only 500 thousand of 35 million sequences have been annotated.

For their research, they set up an inhouse compute system along with a cluster in an OpenNebula cloud provided by CSCFinland. Over the course of a couple of years, they managed to combine those resources in a manner where resource and workload management appeared to be done manually.  “Grid job submissions to the local and the cloud grid were manually adjusted according to available resources. Over 9 million disorder predictions were made over the course of the past few years,” the report stated.

The second was more cloud intensive, set up on the Amazon Elastic Compute Cloud (EC2), a provider familiar to scientific HPC instances. Here, they were able to allocate workloads through the Grid Engine. “Because every protein can be computed independently, we formed a grid job out of each protein and used the Grid Engine (GE) to distribute work on the machine instances,” the report noted on the workload management point.

Each individual PredictProtein that was to be analyzed took up 28 GB of space. Due to the large data requirements and the intensity of the computing that was to take place, the group opted for high performance instances. “We used StarCluster to automate grid setup on the EC2. Because a lot of CPU power was needed, the ‘Cluster Compute Eight Extra Large Instance’ was chosen,” the team said.

Each instance reportedly had 60.5 GB of memory and contained 88 EC2 Compute Units, each of which was equipped 2 eight-core Intel Xeon E5-2670 processors. In total, each instance provided 3370 GB of storage.

Over the course of their study, they were able to process precisely 29,036 sequences with over 16 million residues. According to the team, “this amounted to predicting the functional effect of 315,753,552 individual amino acid changes.”

Like the Third Wave Systems discussed earlier this month, this group looks to provide access to HPC capabilities through a cloud-available specific toolkit. For TWS, that toolkit dealt with drill manufacturing, while this one landed more in the scientific research field with bioinformatics and genomics. As such, it makes sense that the former is a commercially available SaaS, while this example tells of an open source package developed over years by computational biologists.

Either way, as cloud gains acceptance as a medium for high performance applications, more institutions are developing software to facilitate those applications in the cloud.