Storage and data management have become the perhaps the most challenging computational bottlenecks in life sciences (LS) research. The volume and diversity of data generated by modern life science lab instruments and the varying requirements of analysis applications make creating effective solutions far from trivial. What’s more, where LS was once adequately served by conventional cluster technology, HPC is now becoming important – one estimate is 25% of bench scientists will require HPC resources in 2015.
Currently, the emphasis is on sequence data analysis although imaging data is quickly joining the fray. Sometimes the sequence data is generated in one place and largely kept there – think of major biomedical research and sequencing centers such as the Broad Institute and Wellcome Trust Sanger Institute. Other times, the data is generated by thousands of far-flung researchers whose results must be pooled to optimize LS community benefit – think of the Cancer Genome Hub (CGHub) at UC Santa Cruz, which now holds about 2.3 petabytes of data, all contributed from researchers spread worldwide.
Given the twin imperatives of collaboration and faster analysis turnaround times, optimizing storage system performance is a high priority. Complicating the effort is the fact that genomics analysis workflows are themselves complicated and each step can be IO or CPU intensive and involve repetitively reading and writing many large files to and from disk. Beyond the need to scale storage capacity to support what can be petabytes of data in a single laboratory or organization, there is usually a need for a high-performance distributed file system to take advantage of today’s high core density, multi-server compute clusters.
Broadly speaking, accelerating genomics analysis pipelines can be tricky. CPU and memory issues are typically easier to resolve. Disk throughput is often the most difficult variable to tweak and researchers report it’s not always clear which combination of disk technology and distributed file system (NFS, GlusterFS, Lustre, PanFS, etc.) will produce the best results. IO is especially problematic.
Alignment and de-duplication, for example, is usually a multi-step disk intensive process: Perform alignment and write BAM file to disk, sort original BAM file to disk, deduplicate BAM file to disk. Researchers are using a full arsenal of approaches – powerful hardware, parallelization, algorithm refinement, storage system optimization – to accelerate throughput. Simply put, storage infrastructures must address two general areas:
- The infrastructure must be capable of handling the volume of data being generated by today’s lab equipment. Modern DNA sequencers already produce up to a few hundred TB per instrument per year, a rate that is expected to grow 100-fold as capacities increase and more annotation data is captured. With many genomics workflows, many terabytes of data must routinely be moved from the DNA sequencing machines that generate the data to the computational component that performs the DNA alignment, assembly, and subsequent genomic analysis.
- The analysis process, multiple tools are used on the data in its various forms. Each of the tools has different IO characteristics. For example, in a typical workflow, the output data from a sequencer might be prepared in some way, partitioning it into smaller working packages for an initial analysis. This type of operation involves many read/writes and is IO bound. The output from that step might then be used to perform a read alignment. This operation is CPU-bound. Finally, the work done on the smaller packages of data needs to be sorted and merged, aggregating the results into one file. This process requires many temporary file writes and is IO bound.
One the compute side, there are a variety of solutions available to help meet the raw processing demands of today’s genomics analysis workflows. Organizations can select high performance servers with multi-core, multi-threaded processors; systems with large amounts of shared memory; analytics nodes with lots of high-speed flash; systems that make use of in-memory processing; and servers that take advantage of co-processors or other forms of hardware-based acceleration.
One the storage side, the choices can be more limiting. Life sciences code, as noted earlier, tends to be IO bound. There are large numbers of rapid, read/write calls. The throughput demands per core can easily exceed practical IO limitations. Given the size and number of files being moved in genomic analysis workflows, traditional NAS storage solutions and NFS-based file systems frequently don’t scale out adequately and slow performance. High performance parallel file systems such as Lustre and the General Parallel File System (GPFS) are often needed.
Determining the right file system for use isn’t always straightforward. One example of this challenge is a project undertaken by a major biomedical research organization seeking to conduct whole genome sequencing (WGS) analysis on 1,500 subjects; that translated into 110 terabytes (TB) of data with each whole genome sample accounting for about 75GB. Samples were processed in batches of 75 to optimize throughput, requiring about 5TB of data to be read and written to disk multiple times during the 96 hour processing workflow, with intermediate files adding another 5TB to the I/O load.
Many of processing steps were IO intensive and involved reading and writing large 100GB BAM files to and from disk. These did not scale well. Several strategies were tried (e.g., upgrading the network bandwidth, minimizing IO operations, improving workload splitting). Despite the I/O improvements, significant bottlenecks remained in running disk intensive processes at scale. Specifically the post-alignment processing slowed down on NFS shared file systems due to a high number of concurrent file writes. In this instance, switching to Lustre delivered a threefold improvement in write performance.
Conversely Purdue chose GPFS during an upgrade of its cyberinfrastructure which serves a large community of very diverse domains.
“We have researchers pulling in data from instruments to a scratch file and this may be the sole repository of their data for several months while they are analyzing it, cleaning the data, and haven’t yet put it into archives,” said Mike Shuey, research infrastructure architect at Purdue. “We are taking advantage of a couple of GPFS RAS (reliability, availability, and serviceability) features, specifically data replication and snapshot capabilities to protect against site-wide failure and to protect against accidental data deletion. While Lustre is great for other workloads – and we use it in some places – it doesn’t have those sorts of features right now,” said Shuey.
LS processing requirements – a major portion of Purdue’s research activity – can be problematic in a mixed-use environment. Shuey noted LS workflows often have millions of tiny files whose IO access requirements can interfere with the more typical IO stream of simulation applications; larger files in a mechanical engineering simulation, for example, can be slowed by accesses to these millions of tiny files from a life sciences workflow. Purdue adopted deployed DataDirect Networks acceleration technology to help cope with this issue.
Two relatively new technologies that continue to gain traction are Hadoop and iRODS.
Hadoop, of course, uses a distributed file system and framework (MapReduce) to break large data sets into chunks, to distribute/store (Map) those chunks to nodes in a cluster, and to gather (Reduce) results following computation. Hadoop’s distinguishing feature is it automatically stores the chunks of data on the same nodes on which they will be processed. This strategy of co-locating of data and processing power (proximity computing) significantly accelerates performance.
It also turns out that Hadoop architecture is a good choice for many life sciences applications. This is largely because so much of life sciences data is semi- or unstructured file-based data and ideally suited for ‘embarrassingly parallel’ computation. Moreover, the use of commodity hardware (e.g. Linux cluster) keeps cost down, and little or no hardware modification is required. Conversely issues remain, say some observers.
“[W]hile genome scientists have adopted the concept of MapReduce for parallelizing IO, they have not embraced the Hadoop ecosystem. For example, the popular Genome Analysis Toolkit (GATK) uses a proprietary MapReduce implementation that can scale vertically but not horizontally…Efforts exist for adapting existing genomics data structures to Hadoop, but these don’t support the full range of analytic requirements,” noted Alan Day (principal data scientist) and Sungwook Yoon (data scientist) at vendor MapR, in a blog post during the Strata & Hadoop World conference, held earlier this month.
MapR’s approach is to implement an end-to-end analysis pipeline based on GATK and running on Hadoop. “The benefit of combining GATK and Hadoop is two-fold. First, Hadoop provides a more cost-effective solution than a traditional HPC+SAN substrate. Second, Hadoop applications are much easier for software engineers to design and scale,” they wrote, adding the MapR solution follows Hadoop and the GATK best practices. They argue results can be generated on easily available hardware and users can expect immediate ROI by moving existing GATK use cases to Hadoop.
iRODS solves a different challenge. It is a data grid technology that essentially puts a unified namespace on data files, regardless of where those files are physically located. You may have files in four or five different storage systems, but to the user it appears as one directory tree. iRODS also allows setting enforcement rules on any access to the data or submission of data. For example, if someone entered data into the system, that might trigger a rule to replicate the data to another system and compress it at the same time. Access protection rules based on metadata about a file can be set.
At the Renaissance Computing Institute (RENCI) of the University of North Carolina, iRODS has been used in several aspects of its genomics analysis pipeline. When analytical pipelines are processing the data they also register that data into iRODS, according to Charles Schmitt, director of informatics, RENCI[iv]. At the end of the pipeline, the data exists on disks and is registered into iRODS. Anyone wanting to use the data must come in through iRODS to get the data; this allows RENCI to set policies on access and data use.
Broad benefits cited by the iRODS consortium include:
- iRODS enables data discovery using a metadata catalog that describes every file, every directory, and every storage resource in the data grid.
- iRODS automates data workflows, with a rule engine that permits any action to be initiated by any trigger on any server or client in the grid.
- iRODS enables secure collaboration, so users only need to log in to their home grid to access data hosted on a remote grid.
- iRODS implements data virtualization, allowing access to distributed storage assets under a unified namespace, and freeing organizations from getting locked in to single-vendor storage solutions.
It’s worth noting that RENCI has been an important participant in iRODS consortium whose members include, for example, Seagate (NASDAQ: STX), DDN, Novartis (NYSE: NVS), IBM(NYSE: IBM), Wellcome Trust Sanger Institute, and EMC (NYSE: EMC).
 RENCI white paper, Life sciences at RENCI: Big Data IT to manage, decipher, and inform, http://www.emc.com/collateral/white-papers/h11692-life-sciences-renci-big-data-manage-info-wp.pdf