Stepping Up to the Life Science Storage System Challenge

By Louis Vistola

October 5, 2015

Storage and data management have become the perhaps the most challenging computational bottlenecks in life sciences (LS) research. The volume and diversity of data generated by modern life science lab instruments and the varying requirements of analysis applications make creating effective solutions far from trivial. What’s more, where LS was once adequately served by conventional cluster technology, HPC is now becoming important – one estimate is 25% of bench scientists will require HPC resources in 2015.

Currently, the emphasis is on sequence data analysis although imaging data is quickly joining the fray. Sometimes the sequence data is generated in one place and largely kept there – think of major biomedical research and sequencing centers such as the Broad Institute and Wellcome Trust Sanger Institute. Other times, the data is generated by thousands of far-flung researchers whose results must be pooled to optimize LS community benefit – think of the Cancer Genome Hub (CGHub) at UC Santa Cruz, which now holds about 2.3 petabytes of data, all contributed from researchers spread worldwide.

Given the twin imperatives of collaboration and faster analysis turnaround times, optimizing storage system performance is a high priority. Complicating the effort is the fact that genomics analysis workflows are themselves complicated and each step can be IO or CPU intensive and involve repetitively reading and writing many large files to and from disk. Beyond the need to scale storage capacity to support what can be petabytes of data in a single laboratory or organization, there is usually a need for a high-performance distributed file system to take advantage of today’s high core density, multi-server compute clusters.

Broadly speaking, accelerating genomics analysis pipelines can be tricky. CPU and memory issues are typically easier to resolve. Disk throughput is often the most difficult variable to tweak and researchers report it’s not always clear which combination of disk technology and distributed file system (NFS, GlusterFS, Lustre, PanFS, etc.) will produce the best results. IO is especially problematic.

Alignment and de-duplication, for example, is usually a multi-step disk intensive process: Perform alignment and write BAM file to disk, sort original BAM file to disk, deduplicate BAM file to disk. Researchers are using a full arsenal of approaches – powerful hardware, parallelization, algorithm refinement, storage system optimization – to accelerate throughput. Simply put, storage infrastructures must address two general areas:

  • The infrastructure must be capable of handling the volume of data being generated by today’s lab equipment. Modern DNA sequencers already produce up to a few hundred TB per instrument per year, a rate that is expected to grow 100-fold as capacities increase and more annotation data is captured. With many genomics workflows, many terabytes of data must routinely be moved from the DNA sequencing machines that generate the data to the computational component that performs the DNA alignment, assembly, and subsequent genomic analysis.
  • The analysis process, multiple tools are used on the data in its various forms. Each of the tools has different IO characteristics. For example, in a typical workflow, the output data from a sequencer might be prepared in some way, partitioning it into smaller working packages for an initial analysis. This type of operation involves many read/writes and is IO bound. The output from that step might then be used to perform a read alignment. This operation is CPU-bound. Finally, the work done on the smaller packages of data needs to be sorted and merged, aggregating the results into one file. This process requires many temporary file writes and is IO bound.

One the compute side, there are a variety of solutions available to help meet the raw processing demands of today’s genomics analysis workflows. Organizations can select high performance servers with multi-core, multi-threaded processors; systems with large amounts of shared memory; analytics nodes with lots of high-speed flash; systems that make use of in-memory processing; and servers that take advantage of co-processors or other forms of hardware-based acceleration.

One the storage side, the choices can be more limiting. Life sciences code, as noted earlier, tends to be IO bound. There are large numbers of rapid, read/write calls. The throughput demands per core can easily exceed practical IO limitations. Given the size and number of files being moved in genomic analysis workflows, traditional NAS storage solutions and NFS-based file systems frequently don’t scale out adequately and slow performance. High performance parallel file systems such as Lustre and the General Parallel File System (GPFS) are often needed.

Determining the right file system for use isn’t always straightforward. One example of this challenge is a project undertaken by a major biomedical research organization seeking to conduct whole genome sequencing (WGS) analysis on 1,500 subjects; that translated into 110 terabytes (TB) of data with each whole genome sample accounting for about 75GB. Samples were processed in batches of 75 to optimize throughput, requiring about 5TB of data to be read and written to disk multiple times during the 96 hour processing workflow, with intermediate files adding another 5TB to the I/O load.

Many of processing steps were IO intensive and involved reading and writing large 100GB BAM files to and from disk. These did not scale well. Several strategies were tried (e.g., upgrading the network bandwidth, minimizing IO operations, improving workload splitting). Despite the I/O improvements, significant bottlenecks remained in running disk intensive processes at scale. Specifically the post-alignment processing slowed down on NFS shared file systems due to a high number of concurrent file writes. In this instance, switching to Lustre delivered a threefold improvement in write performance.

Conversely Purdue chose GPFS during an upgrade of its cyberinfrastructure which serves a large community of very diverse domains.

“We have researchers pulling in data from instruments to a scratch file and this may be the sole repository of their data for several months while they are analyzing it, cleaning the data, and haven’t yet put it into archives,” said Mike Shuey, research infrastructure architect at Purdue. “We are taking advantage of a couple of GPFS RAS (reliability, availability, and serviceability) features, specifically data replication and snapshot capabilities to protect against site-wide failure and to protect against accidental data deletion. While Lustre is great for other workloads – and we use it in some places – it doesn’t have those sorts of features right now,” Vertical Focus: HPC in BioITsaid Shuey.

LS processing requirements – a major portion of Purdue’s research activity – can be problematic in a mixed-use environment. Shuey noted LS workflows often have millions of tiny files whose IO access requirements can interfere with the more typical IO stream of simulation applications; larger files in a mechanical engineering simulation, for example, can be slowed by accesses to these millions of tiny files from a life sciences workflow. Purdue adopted deployed DataDirect Networks acceleration technology to help cope with this issue.

Two relatively new technologies that continue to gain traction are Hadoop and iRODS.

hadoop_logo.jpgHadoop, of course, uses a distributed file system and framework (MapReduce) to break large data sets into chunks, to distribute/store (Map) those chunks to nodes in a cluster, and to gather (Reduce) results following computation. Hadoop’s distinguishing feature is it automatically stores the chunks of data on the same nodes on which they will be processed. This strategy of co-locating of data and processing power (proximity computing) significantly accelerates performance.

It also turns out that Hadoop architecture is a good choice for many life sciences applications. This is largely because so much of life sciences data is semi- or unstructured file-based data and ideally suited for ‘embarrassingly parallel’ computation. Moreover, the use of commodity hardware (e.g. Linux cluster) keeps cost down, and little or no hardware modification is required. Conversely issues remain, say some observers.

“[W]hile genome scientists have adopted the concept of MapReduce for parallelizing IO, they have not embraced the Hadoop ecosystem. For example, the popular Genome Analysis Toolkit (GATK) uses a proprietary MapReduce implementation that can scale vertically but not horizontally…Efforts exist for adapting existing genomics data structures to Hadoop, but these don’t support the full range of analytic requirements,” noted Alan Day (principal data scientist) and Sungwook Yoon (data scientist) at vendor MapR, in a blog post during the Strata & Hadoop World conference, held earlier this month.

MapR’s approach is to implement an end-to-end analysis pipeline based on GATK and running on Hadoop. “The benefit of combining GATK and Hadoop is two-fold. First, Hadoop provides a more cost-effective solution than a traditional HPC+SAN substrate. Second, Hadoop applications are much easier for software engineers to design and scale,” they wrote, adding the MapR solution follows Hadoop and the GATK best practices. They argue results can be generated on easily available hardware and users can expect immediate ROI by moving existing GATK use cases to Hadoop.

iRODS-Logo.2015iRODS solves a different challenge. It is a data grid technology that essentially puts a unified namespace on data files, regardless of where those files are physically located. You may have files in four or five different storage systems, but to the user it appears as one directory tree. iRODS also allows setting enforcement rules on any access to the data or submission of data. For example, if someone entered data into the system, that might trigger a rule to replicate the data to another system and compress it at the same time. Access protection rules based on metadata about a file can be set.[1]

At the Renaissance Computing Institute (RENCI) of the University of North Carolina, iRODS has been used in several aspects of its genomics analysis pipeline. When analytical pipelines are processing the data they also register that data into iRODS, according to Charles Schmitt, director of informatics, RENCI[iv]. At the end of the pipeline, the data exists on disks and is registered into iRODS. Anyone wanting to use the data must come in through iRODS to get the data; this allows RENCI to set policies on access and data use.

Broad benefits cited by the iRODS consortium include:

  • iRODS enables data discovery using a metadata catalog that describes every file, every directory, and every storage resource in the data grid.
  • iRODS automates data workflows, with a rule engine that permits any action to be initiated by any trigger on any server or client in the grid.
  • iRODS enables secure collaboration, so users only need to log in to their home grid to access data hosted on a remote grid.
  • iRODS implements data virtualization, allowing access to distributed storage assets under a unified namespace, and freeing organizations from getting locked in to single-vendor storage solutions.

It’s worth noting that RENCI has been an important participant in iRODS consortium whose members include, for example, Seagate (NASDAQ: STX), DDN, Novartis (NYSE: NVS), IBM(NYSE: IBM), Wellcome Trust Sanger Institute, and EMC (NYSE: EMC).

 

[1] RENCI white paper, Life sciences at RENCI: Big Data IT to manage, decipher, and inform, http://www.emc.com/collateral/white-papers/h11692-life-sciences-renci-big-data-manage-info-wp.pdf

 

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Bill Gropp – Pursuing the Next Big Thing at NCSA

March 28, 2017

About eight months ago Bill Gropp was elevated to acting director of the National Center for Supercomputing Applications (NCSA). Read more…

By John Russell

UK to Launch Six Major HPC Centers

March 27, 2017

Six high performance computing centers will be formally launched in the U.K. later this week intended to provide wider access to HPC resources to U.K. Read more…

By John Russell

AI in the News: Rao in at Intel, Ng out at Baidu, Nvidia on at Tencent Cloud

March 26, 2017

Just as AI has become the leitmotif of the advanced scale computing market, infusing much of the conversation about HPC in commercial and industrial spheres, it also is impacting high-level management changes in the industry. Read more…

By Doug Black

Scalable Informatics Ceases Operations

March 23, 2017

On the same day we reported on the uncertain future for HPC compiler company PathScale, we are sad to learn that another HPC vendor, Scalable Informatics, is closing its doors. Read more…

By Tiffany Trader

HPE Extreme Performance Solutions

Quants Achieving Maximum Compute Power without the Learning Curve

The financial services industry is a fast-paced and data-intensive environment, and financial firms are realizing that they must modernize their IT infrastructures and invest in high performance computing (HPC) tools in order to survive. Read more…

‘Strategies in Biomedical Data Science’ Advances IT-Research Synergies

March 23, 2017

“Strategies in Biomedical Data Science: Driving Force for Innovation” by Jay A. Etchings is both an introductory text and a field guide for anyone working with biomedical data. Read more…

By Tiffany Trader

HPC Compiler Company PathScale Seeks Life Raft

March 23, 2017

HPCwire has learned that HPC compiler company PathScale has fallen on difficult times and is asking the community for help or actively seeking a buyer for its assets. Read more…

By Tiffany Trader

Google Launches New Machine Learning Journal

March 22, 2017

On Monday, Google announced plans to launch a new peer review journal and “ecosystem” Read more…

By John Russell

Swiss Researchers Peer Inside Chips with Improved X-Ray Imaging

March 22, 2017

Peering inside semiconductor chips using x-ray imaging isn’t new, but the technique hasn’t been especially good or easy to accomplish. Read more…

By John Russell

Bill Gropp – Pursuing the Next Big Thing at NCSA

March 28, 2017

About eight months ago Bill Gropp was elevated to acting director of the National Center for Supercomputing Applications (NCSA). Read more…

By John Russell

HPC Compiler Company PathScale Seeks Life Raft

March 23, 2017

HPCwire has learned that HPC compiler company PathScale has fallen on difficult times and is asking the community for help or actively seeking a buyer for its assets. Read more…

By Tiffany Trader

Quantum Bits: D-Wave and VW; Google Quantum Lab; IBM Expands Access

March 21, 2017

For a technology that’s usually characterized as far off and in a distant galaxy, quantum computing has been steadily picking up steam. Read more…

By John Russell

Trump Budget Targets NIH, DOE, and EPA; No Mention of NSF

March 16, 2017

President Trump’s proposed U.S. fiscal 2018 budget issued today sharply cuts science spending while bolstering military spending as he promised during the campaign. Read more…

By John Russell

CPU-based Visualization Positions for Exascale Supercomputing

March 16, 2017

In this contributed perspective piece, Intel’s Jim Jeffers makes the case that CPU-based visualization is now widely adopted and as such is no longer a contrarian view, but is rather an exascale requirement. Read more…

By Jim Jeffers, Principal Engineer and Engineering Leader, Intel

US Supercomputing Leaders Tackle the China Question

March 15, 2017

Joint DOE-NSA report responds to the increased global pressures impacting the competitiveness of U.S. supercomputing. Read more…

By Tiffany Trader

New Japanese Supercomputing Project Targets Exascale

March 14, 2017

Another Japanese supercomputing project was revealed this week, this one from emerging supercomputer maker, ExaScaler Inc., and Keio University. The partners are working on an original supercomputer design with exascale aspirations. Read more…

By Tiffany Trader

Nvidia Debuts HGX-1 for Cloud; Announces Fujitsu AI Deal

March 9, 2017

On Monday Nvidia announced a major deal with Fujitsu to help build an AI supercomputer for RIKEN using 24 DGX-1 servers. Read more…

By John Russell

For IBM/OpenPOWER: Success in 2017 = (Volume) Sales

January 11, 2017

To a large degree IBM and the OpenPOWER Foundation have done what they said they would – assembling a substantial and growing ecosystem and bringing Power-based products to market, all in about three years. Read more…

By John Russell

Quantum Bits: D-Wave and VW; Google Quantum Lab; IBM Expands Access

March 21, 2017

For a technology that’s usually characterized as far off and in a distant galaxy, quantum computing has been steadily picking up steam. Read more…

By John Russell

Trump Budget Targets NIH, DOE, and EPA; No Mention of NSF

March 16, 2017

President Trump’s proposed U.S. fiscal 2018 budget issued today sharply cuts science spending while bolstering military spending as he promised during the campaign. Read more…

By John Russell

HPC Compiler Company PathScale Seeks Life Raft

March 23, 2017

HPCwire has learned that HPC compiler company PathScale has fallen on difficult times and is asking the community for help or actively seeking a buyer for its assets. Read more…

By Tiffany Trader

TSUBAME3.0 Points to Future HPE Pascal-NVLink-OPA Server

February 17, 2017

Since our initial coverage of the TSUBAME3.0 supercomputer yesterday, more details have come to light on this innovative project. Of particular interest is a new board design for NVLink-equipped Pascal P100 GPUs that will create another entrant to the space currently occupied by Nvidia's DGX-1 system, IBM's "Minsky" platform and the Supermicro SuperServer (1028GQ-TXR). Read more…

By Tiffany Trader

Tokyo Tech’s TSUBAME3.0 Will Be First HPE-SGI Super

February 16, 2017

In a press event Friday afternoon local time in Japan, Tokyo Institute of Technology (Tokyo Tech) announced its plans for the TSUBAME3.0 supercomputer, which will be Japan’s “fastest AI supercomputer,” Read more…

By Tiffany Trader

IBM Wants to be “Red Hat” of Deep Learning

January 26, 2017

IBM today announced the addition of TensorFlow and Chainer deep learning frameworks to its PowerAI suite of deep learning tools, which already includes popular offerings such as Caffe, Theano, and Torch. Read more…

By John Russell

Lighting up Aurora: Behind the Scenes at the Creation of the DOE’s Upcoming 200 Petaflops Supercomputer

December 1, 2016

In April 2015, U.S. Department of Energy Undersecretary Franklin Orr announced that Intel would be the prime contractor for Aurora: Read more…

By Jan Rowell

Leading Solution Providers

Is Liquid Cooling Ready to Go Mainstream?

February 13, 2017

Lost in the frenzy of SC16 was a substantial rise in the number of vendors showing server oriented liquid cooling technologies. Three decades ago liquid cooling was pretty much the exclusive realm of the Cray-2 and IBM mainframe class products. That’s changing. We are now seeing an emergence of x86 class server products with exotic plumbing technology ranging from Direct-to-Chip to servers and storage completely immersed in a dielectric fluid. Read more…

By Steve Campbell

Enlisting Deep Learning in the War on Cancer

December 7, 2016

Sometime in Q2 2017 the first ‘results’ of the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) will become publicly available according to Rick Stevens. He leads one of three JDACS4C pilot projects pressing deep learning (DL) into service in the War on Cancer. Read more…

By John Russell

BioTeam’s Berman Charts 2017 HPC Trends in Life Sciences

January 4, 2017

Twenty years ago high performance computing was nearly absent from life sciences. Today it’s used throughout life sciences and biomedical research. Genomics and the data deluge from modern lab instruments are the main drivers, but so is the longer-term desire to perform predictive simulation in support of Precision Medicine (PM). There’s even a specialized life sciences supercomputer, ‘Anton’ from D.E. Shaw Research, and the Pittsburgh Supercomputing Center is standing up its second Anton 2 and actively soliciting project proposals. There’s a lot going on. Read more…

By John Russell

HPC Startup Advances Auto-Parallelization’s Promise

January 23, 2017

The shift from single core to multicore hardware has made finding parallelism in codes more important than ever, but that hasn’t made the task of parallel programming any easier. Read more…

By Tiffany Trader

HPC Technique Propels Deep Learning at Scale

February 21, 2017

Researchers from Baidu’s Silicon Valley AI Lab (SVAIL) have adapted a well-known HPC communication technique to boost the speed and scale of their neural network training and now they are sharing their implementation with the larger deep learning community. Read more…

By Tiffany Trader

CPU Benchmarking: Haswell Versus POWER8

June 2, 2015

With OpenPOWER activity ramping up and IBM’s prominent role in the upcoming DOE machines Summit and Sierra, it’s a good time to look at how the IBM POWER CPU stacks up against the x86 Xeon Haswell CPU from Intel. Read more…

By Tiffany Trader

US Supercomputing Leaders Tackle the China Question

March 15, 2017

Joint DOE-NSA report responds to the increased global pressures impacting the competitiveness of U.S. supercomputing. Read more…

By Tiffany Trader

IDG to Be Bought by Chinese Investors; IDC to Spin Out HPC Group

January 19, 2017

US-based publishing and investment firm International Data Group, Inc. (IDG) will be acquired by a pair of Chinese investors, China Oceanwide Holdings Group Co., Ltd. Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Share This