Stepping Up to the Life Science Storage System Challenge

By Louis Vistola

October 5, 2015

Storage and data management have become the perhaps the most challenging computational bottlenecks in life sciences (LS) research. The volume and diversity of data generated by modern life science lab instruments and the varying requirements of analysis applications make creating effective solutions far from trivial. What’s more, where LS was once adequately served by conventional cluster technology, HPC is now becoming important – one estimate is 25% of bench scientists will require HPC resources in 2015.

Currently, the emphasis is on sequence data analysis although imaging data is quickly joining the fray. Sometimes the sequence data is generated in one place and largely kept there – think of major biomedical research and sequencing centers such as the Broad Institute and Wellcome Trust Sanger Institute. Other times, the data is generated by thousands of far-flung researchers whose results must be pooled to optimize LS community benefit – think of the Cancer Genome Hub (CGHub) at UC Santa Cruz, which now holds about 2.3 petabytes of data, all contributed from researchers spread worldwide.

Given the twin imperatives of collaboration and faster analysis turnaround times, optimizing storage system performance is a high priority. Complicating the effort is the fact that genomics analysis workflows are themselves complicated and each step can be IO or CPU intensive and involve repetitively reading and writing many large files to and from disk. Beyond the need to scale storage capacity to support what can be petabytes of data in a single laboratory or organization, there is usually a need for a high-performance distributed file system to take advantage of today’s high core density, multi-server compute clusters.

Broadly speaking, accelerating genomics analysis pipelines can be tricky. CPU and memory issues are typically easier to resolve. Disk throughput is often the most difficult variable to tweak and researchers report it’s not always clear which combination of disk technology and distributed file system (NFS, GlusterFS, Lustre, PanFS, etc.) will produce the best results. IO is especially problematic.

Alignment and de-duplication, for example, is usually a multi-step disk intensive process: Perform alignment and write BAM file to disk, sort original BAM file to disk, deduplicate BAM file to disk. Researchers are using a full arsenal of approaches – powerful hardware, parallelization, algorithm refinement, storage system optimization – to accelerate throughput. Simply put, storage infrastructures must address two general areas:

  • The infrastructure must be capable of handling the volume of data being generated by today’s lab equipment. Modern DNA sequencers already produce up to a few hundred TB per instrument per year, a rate that is expected to grow 100-fold as capacities increase and more annotation data is captured. With many genomics workflows, many terabytes of data must routinely be moved from the DNA sequencing machines that generate the data to the computational component that performs the DNA alignment, assembly, and subsequent genomic analysis.
  • The analysis process, multiple tools are used on the data in its various forms. Each of the tools has different IO characteristics. For example, in a typical workflow, the output data from a sequencer might be prepared in some way, partitioning it into smaller working packages for an initial analysis. This type of operation involves many read/writes and is IO bound. The output from that step might then be used to perform a read alignment. This operation is CPU-bound. Finally, the work done on the smaller packages of data needs to be sorted and merged, aggregating the results into one file. This process requires many temporary file writes and is IO bound.

One the compute side, there are a variety of solutions available to help meet the raw processing demands of today’s genomics analysis workflows. Organizations can select high performance servers with multi-core, multi-threaded processors; systems with large amounts of shared memory; analytics nodes with lots of high-speed flash; systems that make use of in-memory processing; and servers that take advantage of co-processors or other forms of hardware-based acceleration.

One the storage side, the choices can be more limiting. Life sciences code, as noted earlier, tends to be IO bound. There are large numbers of rapid, read/write calls. The throughput demands per core can easily exceed practical IO limitations. Given the size and number of files being moved in genomic analysis workflows, traditional NAS storage solutions and NFS-based file systems frequently don’t scale out adequately and slow performance. High performance parallel file systems such as Lustre and the General Parallel File System (GPFS) are often needed.

Determining the right file system for use isn’t always straightforward. One example of this challenge is a project undertaken by a major biomedical research organization seeking to conduct whole genome sequencing (WGS) analysis on 1,500 subjects; that translated into 110 terabytes (TB) of data with each whole genome sample accounting for about 75GB. Samples were processed in batches of 75 to optimize throughput, requiring about 5TB of data to be read and written to disk multiple times during the 96 hour processing workflow, with intermediate files adding another 5TB to the I/O load.

Many of processing steps were IO intensive and involved reading and writing large 100GB BAM files to and from disk. These did not scale well. Several strategies were tried (e.g., upgrading the network bandwidth, minimizing IO operations, improving workload splitting). Despite the I/O improvements, significant bottlenecks remained in running disk intensive processes at scale. Specifically the post-alignment processing slowed down on NFS shared file systems due to a high number of concurrent file writes. In this instance, switching to Lustre delivered a threefold improvement in write performance.

Conversely Purdue chose GPFS during an upgrade of its cyberinfrastructure which serves a large community of very diverse domains.

“We have researchers pulling in data from instruments to a scratch file and this may be the sole repository of their data for several months while they are analyzing it, cleaning the data, and haven’t yet put it into archives,” said Mike Shuey, research infrastructure architect at Purdue. “We are taking advantage of a couple of GPFS RAS (reliability, availability, and serviceability) features, specifically data replication and snapshot capabilities to protect against site-wide failure and to protect against accidental data deletion. While Lustre is great for other workloads – and we use it in some places – it doesn’t have those sorts of features right now,” Vertical Focus: HPC in BioITsaid Shuey.

LS processing requirements – a major portion of Purdue’s research activity – can be problematic in a mixed-use environment. Shuey noted LS workflows often have millions of tiny files whose IO access requirements can interfere with the more typical IO stream of simulation applications; larger files in a mechanical engineering simulation, for example, can be slowed by accesses to these millions of tiny files from a life sciences workflow. Purdue adopted deployed DataDirect Networks acceleration technology to help cope with this issue.

Two relatively new technologies that continue to gain traction are Hadoop and iRODS.

hadoop_logo.jpgHadoop, of course, uses a distributed file system and framework (MapReduce) to break large data sets into chunks, to distribute/store (Map) those chunks to nodes in a cluster, and to gather (Reduce) results following computation. Hadoop’s distinguishing feature is it automatically stores the chunks of data on the same nodes on which they will be processed. This strategy of co-locating of data and processing power (proximity computing) significantly accelerates performance.

It also turns out that Hadoop architecture is a good choice for many life sciences applications. This is largely because so much of life sciences data is semi- or unstructured file-based data and ideally suited for ‘embarrassingly parallel’ computation. Moreover, the use of commodity hardware (e.g. Linux cluster) keeps cost down, and little or no hardware modification is required. Conversely issues remain, say some observers.

“[W]hile genome scientists have adopted the concept of MapReduce for parallelizing IO, they have not embraced the Hadoop ecosystem. For example, the popular Genome Analysis Toolkit (GATK) uses a proprietary MapReduce implementation that can scale vertically but not horizontally…Efforts exist for adapting existing genomics data structures to Hadoop, but these don’t support the full range of analytic requirements,” noted Alan Day (principal data scientist) and Sungwook Yoon (data scientist) at vendor MapR, in a blog post during the Strata & Hadoop World conference, held earlier this month.

MapR’s approach is to implement an end-to-end analysis pipeline based on GATK and running on Hadoop. “The benefit of combining GATK and Hadoop is two-fold. First, Hadoop provides a more cost-effective solution than a traditional HPC+SAN substrate. Second, Hadoop applications are much easier for software engineers to design and scale,” they wrote, adding the MapR solution follows Hadoop and the GATK best practices. They argue results can be generated on easily available hardware and users can expect immediate ROI by moving existing GATK use cases to Hadoop.

iRODS-Logo.2015iRODS solves a different challenge. It is a data grid technology that essentially puts a unified namespace on data files, regardless of where those files are physically located. You may have files in four or five different storage systems, but to the user it appears as one directory tree. iRODS also allows setting enforcement rules on any access to the data or submission of data. For example, if someone entered data into the system, that might trigger a rule to replicate the data to another system and compress it at the same time. Access protection rules based on metadata about a file can be set.[1]

At the Renaissance Computing Institute (RENCI) of the University of North Carolina, iRODS has been used in several aspects of its genomics analysis pipeline. When analytical pipelines are processing the data they also register that data into iRODS, according to Charles Schmitt, director of informatics, RENCI[iv]. At the end of the pipeline, the data exists on disks and is registered into iRODS. Anyone wanting to use the data must come in through iRODS to get the data; this allows RENCI to set policies on access and data use.

Broad benefits cited by the iRODS consortium include:

  • iRODS enables data discovery using a metadata catalog that describes every file, every directory, and every storage resource in the data grid.
  • iRODS automates data workflows, with a rule engine that permits any action to be initiated by any trigger on any server or client in the grid.
  • iRODS enables secure collaboration, so users only need to log in to their home grid to access data hosted on a remote grid.
  • iRODS implements data virtualization, allowing access to distributed storage assets under a unified namespace, and freeing organizations from getting locked in to single-vendor storage solutions.

It’s worth noting that RENCI has been an important participant in iRODS consortium whose members include, for example, Seagate (NASDAQ: STX), DDN, Novartis (NYSE: NVS), IBM(NYSE: IBM), Wellcome Trust Sanger Institute, and EMC (NYSE: EMC).


[1] RENCI white paper, Life sciences at RENCI: Big Data IT to manage, decipher, and inform,


Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Researchers Scale COSMO Climate Code to 4888 GPUs on Piz Daint

October 17, 2017

Effective global climate simulation, sorely needed to anticipate and cope with global warming, has long been computationally challenging. Two of the major obstacles are the needed resolution and prolonged time to compute Read more…

By John Russell

Student Cluster Competition Coverage New Home

October 16, 2017

Hello computer sports fans! This is the first of many (many!) articles covering the world-wide phenomenon of Student Cluster Competitions. Finally, the Student Cluster Competition coverage has come to its natural home: H Read more…

By Dan Olds

UCSD Web-based Tool Tracking CA Wildfires Generates 1.5M Views

October 16, 2017

Tracking the wildfires raging in northern CA is an unpleasant but necessary part of guiding efforts to fight the fires and safely evacuate affected residents. One such tool – Firemap – is a web-based tool developed b Read more…

By John Russell

HPE Extreme Performance Solutions

Transforming Genomic Analytics with HPC-Accelerated Insights

Advancements in the field of genomics are revolutionizing our understanding of human biology, rapidly accelerating the discovery and treatment of genetic diseases, and dramatically improving human health. Read more…

Exascale Imperative: New Movie from HPE Makes a Compelling Case

October 13, 2017

Why is pursuing exascale computing so important? In a new video – Hewlett Packard Enterprise: Eighteen Zeros – four HPE executives, a prominent national lab HPC researcher, and HPCwire managing editor Tiffany Trader Read more…

By John Russell

Student Cluster Competition Coverage New Home

October 16, 2017

Hello computer sports fans! This is the first of many (many!) articles covering the world-wide phenomenon of Student Cluster Competitions. Finally, the Student Read more…

By Dan Olds

Intel Delivers 17-Qubit Quantum Chip to European Research Partner

October 10, 2017

On Tuesday, Intel delivered a 17-qubit superconducting test chip to research partner QuTech, the quantum research institute of Delft University of Technology (TU Delft) in the Netherlands. The announcement marks a major milestone in the 10-year, $50-million collaborative relationship with TU Delft and TNO, the Dutch Organization for Applied Research, to accelerate advancements in quantum computing. Read more…

By Tiffany Trader

Fujitsu Tapped to Build 37-Petaflops ABCI System for AIST

October 10, 2017

Fujitsu announced today it will build the long-planned AI Bridging Cloud Infrastructure (ABCI) which is set to become the fastest supercomputer system in Japan Read more…

By John Russell

HPC Chips – A Veritable Smorgasbord?

October 10, 2017

For the first time since AMD's ill-fated launch of Bulldozer the answer to the question, 'Which CPU will be in my next HPC system?' doesn't have to be 'Whichever variety of Intel Xeon E5 they are selling when we procure'. Read more…

By Dairsie Latimer

Delays, Smoke, Records & Markets – A Candid Conversation with Cray CEO Peter Ungaro

October 5, 2017

Earlier this month, Tom Tabor, publisher of HPCwire and I had a very personal conversation with Cray CEO Peter Ungaro. Cray has been on something of a Cinderell Read more…

By Tiffany Trader & Tom Tabor

Intel Debuts Programmable Acceleration Card

October 5, 2017

With a view toward supporting complex, data-intensive applications, such as AI inference, video streaming analytics, database acceleration and genomics, Intel i Read more…

By Doug Black

OLCF’s 200 Petaflops Summit Machine Still Slated for 2018 Start-up

October 3, 2017

The Department of Energy’s planned 200 petaflops Summit computer, which is currently being installed at Oak Ridge Leadership Computing Facility, is on track t Read more…

By John Russell

US Exascale Program – Some Additional Clarity

September 28, 2017

The last time we left the Department of Energy’s exascale computing program in July, things were looking very positive. Both the U.S. House and Senate had pas Read more…

By Alex R. Larzelere

How ‘Knights Mill’ Gets Its Deep Learning Flops

June 22, 2017

Intel, the subject of much speculation regarding the delayed, rewritten or potentially canceled “Aurora” contract (the Argonne Lab part of the CORAL “ Read more…

By Tiffany Trader

Reinders: “AVX-512 May Be a Hidden Gem” in Intel Xeon Scalable Processors

June 29, 2017

Imagine if we could use vector processing on something other than just floating point problems.  Today, GPUs and CPUs work tirelessly to accelerate algorithms Read more…

By James Reinders

NERSC Scales Scientific Deep Learning to 15 Petaflops

August 28, 2017

A collaborative effort between Intel, NERSC and Stanford has delivered the first 15-petaflops deep learning software running on HPC platforms and is, according Read more…

By Rob Farber

Oracle Layoffs Reportedly Hit SPARC and Solaris Hard

September 7, 2017

Oracle’s latest layoffs have many wondering if this is the end of the line for the SPARC processor and Solaris OS development. As reported by multiple sources Read more…

By John Russell

US Coalesces Plans for First Exascale Supercomputer: Aurora in 2021

September 27, 2017

At the Advanced Scientific Computing Advisory Committee (ASCAC) meeting, in Arlington, Va., yesterday (Sept. 26), it was revealed that the "Aurora" supercompute Read more…

By Tiffany Trader

Google Releases Deeplearn.js to Further Democratize Machine Learning

August 17, 2017

Spreading the use of machine learning tools is one of the goals of Google’s PAIR (People + AI Research) initiative, which was introduced in early July. Last w Read more…

By John Russell

GlobalFoundries Puts Wind in AMD’s Sails with 12nm FinFET

September 24, 2017

From its annual tech conference last week (Sept. 20), where GlobalFoundries welcomed more than 600 semiconductor professionals (reaching the Santa Clara venue Read more…

By Tiffany Trader

Graphcore Readies Launch of 16nm Colossus-IPU Chip

July 20, 2017

A second $30 million funding round for U.K. AI chip developer Graphcore sets up the company to go to market with its “intelligent processing unit” (IPU) in Read more…

By Tiffany Trader

Leading Solution Providers

Amazon Debuts New AMD-based GPU Instances for Graphics Acceleration

September 12, 2017

Last week Amazon Web Services (AWS) streaming service, AppStream 2.0, introduced a new GPU instance called Graphics Design intended to accelerate graphics. The Read more…

By John Russell

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

EU Funds 20 Million Euro ARM+FPGA Exascale Project

September 7, 2017

At the Barcelona Supercomputer Centre on Wednesday (Sept. 6), 16 partners gathered to launch the EuroEXA project, which invests €20 million over three-and-a-half years into exascale-focused research and development. Led by the Horizon 2020 program, EuroEXA picks up the banner of a triad of partner projects — ExaNeSt, EcoScale and ExaNoDe — building on their work... Read more…

By Tiffany Trader

Delays, Smoke, Records & Markets – A Candid Conversation with Cray CEO Peter Ungaro

October 5, 2017

Earlier this month, Tom Tabor, publisher of HPCwire and I had a very personal conversation with Cray CEO Peter Ungaro. Cray has been on something of a Cinderell Read more…

By Tiffany Trader & Tom Tabor

Cray Moves to Acquire the Seagate ClusterStor Line

July 28, 2017

This week Cray announced that it is picking up Seagate's ClusterStor HPC storage array business for an undisclosed sum. "In short we're effectively transitioning the bulk of the ClusterStor product line to Cray," said CEO Peter Ungaro. Read more…

By Tiffany Trader

Intel Launches Software Tools to Ease FPGA Programming

September 5, 2017

Field Programmable Gate Arrays (FPGAs) have a reputation for being difficult to program, requiring expertise in specialty languages, like Verilog or VHDL. Easin Read more…

By Tiffany Trader

IBM Advances Web-based Quantum Programming

September 5, 2017

IBM Research is pairing its Jupyter-based Data Science Experience notebook environment with its cloud-based quantum computer, IBM Q, in hopes of encouraging a new class of entrepreneurial user to solve intractable problems that even exceed the capabilities of the best AI systems. Read more…

By Alex Woodie

Intel, NERSC and University Partners Launch New Big Data Center

August 17, 2017

A collaboration between the Department of Energy’s National Energy Research Scientific Computing Center (NERSC), Intel and five Intel Parallel Computing Cente Read more…

By Linda Barney

  • arrow
  • Click Here for More Headlines
  • arrow
Share This