Stepping Up to the Life Science Storage System Challenge

By Louis Vistola

October 5, 2015

Storage and data management have become the perhaps the most challenging computational bottlenecks in life sciences (LS) research. The volume and diversity of data generated by modern life science lab instruments and the varying requirements of analysis applications make creating effective solutions far from trivial. What’s more, where LS was once adequately served by conventional cluster technology, HPC is now becoming important – one estimate is 25% of bench scientists will require HPC resources in 2015.

Currently, the emphasis is on sequence data analysis although imaging data is quickly joining the fray. Sometimes the sequence data is generated in one place and largely kept there – think of major biomedical research and sequencing centers such as the Broad Institute and Wellcome Trust Sanger Institute. Other times, the data is generated by thousands of far-flung researchers whose results must be pooled to optimize LS community benefit – think of the Cancer Genome Hub (CGHub) at UC Santa Cruz, which now holds about 2.3 petabytes of data, all contributed from researchers spread worldwide.

Given the twin imperatives of collaboration and faster analysis turnaround times, optimizing storage system performance is a high priority. Complicating the effort is the fact that genomics analysis workflows are themselves complicated and each step can be IO or CPU intensive and involve repetitively reading and writing many large files to and from disk. Beyond the need to scale storage capacity to support what can be petabytes of data in a single laboratory or organization, there is usually a need for a high-performance distributed file system to take advantage of today’s high core density, multi-server compute clusters.

Broadly speaking, accelerating genomics analysis pipelines can be tricky. CPU and memory issues are typically easier to resolve. Disk throughput is often the most difficult variable to tweak and researchers report it’s not always clear which combination of disk technology and distributed file system (NFS, GlusterFS, Lustre, PanFS, etc.) will produce the best results. IO is especially problematic.

Alignment and de-duplication, for example, is usually a multi-step disk intensive process: Perform alignment and write BAM file to disk, sort original BAM file to disk, deduplicate BAM file to disk. Researchers are using a full arsenal of approaches – powerful hardware, parallelization, algorithm refinement, storage system optimization – to accelerate throughput. Simply put, storage infrastructures must address two general areas:

  • The infrastructure must be capable of handling the volume of data being generated by today’s lab equipment. Modern DNA sequencers already produce up to a few hundred TB per instrument per year, a rate that is expected to grow 100-fold as capacities increase and more annotation data is captured. With many genomics workflows, many terabytes of data must routinely be moved from the DNA sequencing machines that generate the data to the computational component that performs the DNA alignment, assembly, and subsequent genomic analysis.
  • The analysis process, multiple tools are used on the data in its various forms. Each of the tools has different IO characteristics. For example, in a typical workflow, the output data from a sequencer might be prepared in some way, partitioning it into smaller working packages for an initial analysis. This type of operation involves many read/writes and is IO bound. The output from that step might then be used to perform a read alignment. This operation is CPU-bound. Finally, the work done on the smaller packages of data needs to be sorted and merged, aggregating the results into one file. This process requires many temporary file writes and is IO bound.

One the compute side, there are a variety of solutions available to help meet the raw processing demands of today’s genomics analysis workflows. Organizations can select high performance servers with multi-core, multi-threaded processors; systems with large amounts of shared memory; analytics nodes with lots of high-speed flash; systems that make use of in-memory processing; and servers that take advantage of co-processors or other forms of hardware-based acceleration.

One the storage side, the choices can be more limiting. Life sciences code, as noted earlier, tends to be IO bound. There are large numbers of rapid, read/write calls. The throughput demands per core can easily exceed practical IO limitations. Given the size and number of files being moved in genomic analysis workflows, traditional NAS storage solutions and NFS-based file systems frequently don’t scale out adequately and slow performance. High performance parallel file systems such as Lustre and the General Parallel File System (GPFS) are often needed.

Determining the right file system for use isn’t always straightforward. One example of this challenge is a project undertaken by a major biomedical research organization seeking to conduct whole genome sequencing (WGS) analysis on 1,500 subjects; that translated into 110 terabytes (TB) of data with each whole genome sample accounting for about 75GB. Samples were processed in batches of 75 to optimize throughput, requiring about 5TB of data to be read and written to disk multiple times during the 96 hour processing workflow, with intermediate files adding another 5TB to the I/O load.

Many of processing steps were IO intensive and involved reading and writing large 100GB BAM files to and from disk. These did not scale well. Several strategies were tried (e.g., upgrading the network bandwidth, minimizing IO operations, improving workload splitting). Despite the I/O improvements, significant bottlenecks remained in running disk intensive processes at scale. Specifically the post-alignment processing slowed down on NFS shared file systems due to a high number of concurrent file writes. In this instance, switching to Lustre delivered a threefold improvement in write performance.

Conversely Purdue chose GPFS during an upgrade of its cyberinfrastructure which serves a large community of very diverse domains.

“We have researchers pulling in data from instruments to a scratch file and this may be the sole repository of their data for several months while they are analyzing it, cleaning the data, and haven’t yet put it into archives,” said Mike Shuey, research infrastructure architect at Purdue. “We are taking advantage of a couple of GPFS RAS (reliability, availability, and serviceability) features, specifically data replication and snapshot capabilities to protect against site-wide failure and to protect against accidental data deletion. While Lustre is great for other workloads – and we use it in some places – it doesn’t have those sorts of features right now,” Vertical Focus: HPC in BioITsaid Shuey.

LS processing requirements – a major portion of Purdue’s research activity – can be problematic in a mixed-use environment. Shuey noted LS workflows often have millions of tiny files whose IO access requirements can interfere with the more typical IO stream of simulation applications; larger files in a mechanical engineering simulation, for example, can be slowed by accesses to these millions of tiny files from a life sciences workflow. Purdue adopted deployed DataDirect Networks acceleration technology to help cope with this issue.

Two relatively new technologies that continue to gain traction are Hadoop and iRODS.

hadoop_logo.jpgHadoop, of course, uses a distributed file system and framework (MapReduce) to break large data sets into chunks, to distribute/store (Map) those chunks to nodes in a cluster, and to gather (Reduce) results following computation. Hadoop’s distinguishing feature is it automatically stores the chunks of data on the same nodes on which they will be processed. This strategy of co-locating of data and processing power (proximity computing) significantly accelerates performance.

It also turns out that Hadoop architecture is a good choice for many life sciences applications. This is largely because so much of life sciences data is semi- or unstructured file-based data and ideally suited for ‘embarrassingly parallel’ computation. Moreover, the use of commodity hardware (e.g. Linux cluster) keeps cost down, and little or no hardware modification is required. Conversely issues remain, say some observers.

“[W]hile genome scientists have adopted the concept of MapReduce for parallelizing IO, they have not embraced the Hadoop ecosystem. For example, the popular Genome Analysis Toolkit (GATK) uses a proprietary MapReduce implementation that can scale vertically but not horizontally…Efforts exist for adapting existing genomics data structures to Hadoop, but these don’t support the full range of analytic requirements,” noted Alan Day (principal data scientist) and Sungwook Yoon (data scientist) at vendor MapR, in a blog post during the Strata & Hadoop World conference, held earlier this month.

MapR’s approach is to implement an end-to-end analysis pipeline based on GATK and running on Hadoop. “The benefit of combining GATK and Hadoop is two-fold. First, Hadoop provides a more cost-effective solution than a traditional HPC+SAN substrate. Second, Hadoop applications are much easier for software engineers to design and scale,” they wrote, adding the MapR solution follows Hadoop and the GATK best practices. They argue results can be generated on easily available hardware and users can expect immediate ROI by moving existing GATK use cases to Hadoop.

iRODS-Logo.2015iRODS solves a different challenge. It is a data grid technology that essentially puts a unified namespace on data files, regardless of where those files are physically located. You may have files in four or five different storage systems, but to the user it appears as one directory tree. iRODS also allows setting enforcement rules on any access to the data or submission of data. For example, if someone entered data into the system, that might trigger a rule to replicate the data to another system and compress it at the same time. Access protection rules based on metadata about a file can be set.[1]

At the Renaissance Computing Institute (RENCI) of the University of North Carolina, iRODS has been used in several aspects of its genomics analysis pipeline. When analytical pipelines are processing the data they also register that data into iRODS, according to Charles Schmitt, director of informatics, RENCI[iv]. At the end of the pipeline, the data exists on disks and is registered into iRODS. Anyone wanting to use the data must come in through iRODS to get the data; this allows RENCI to set policies on access and data use.

Broad benefits cited by the iRODS consortium include:

  • iRODS enables data discovery using a metadata catalog that describes every file, every directory, and every storage resource in the data grid.
  • iRODS automates data workflows, with a rule engine that permits any action to be initiated by any trigger on any server or client in the grid.
  • iRODS enables secure collaboration, so users only need to log in to their home grid to access data hosted on a remote grid.
  • iRODS implements data virtualization, allowing access to distributed storage assets under a unified namespace, and freeing organizations from getting locked in to single-vendor storage solutions.

It’s worth noting that RENCI has been an important participant in iRODS consortium whose members include, for example, Seagate (NASDAQ: STX), DDN, Novartis (NYSE: NVS), IBM(NYSE: IBM), Wellcome Trust Sanger Institute, and EMC (NYSE: EMC).

 

[1] RENCI white paper, Life sciences at RENCI: Big Data IT to manage, decipher, and inform, http://www.emc.com/collateral/white-papers/h11692-life-sciences-renci-big-data-manage-info-wp.pdf

 

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

With New Owner and New Roadmap, an Independent Omni-Path Is Staging a Comeback

July 23, 2021

Put on a shelf by Intel in 2019, Omni-Path faced a uncertain future, but under new custodian Cornelis Networks, OmniPath is looking to make a comeback as an independent high-performance interconnect solution. A "significant refresh" – called Omni-Path Express – is coming later this year according to the company. Cornelis Networks formed last September as a spinout of Intel's Omni-Path division. Read more…

PEARC21 Panel Reviews Eight New NSF-Funded HPC Systems Debuting in 2021

July 23, 2021

Over the past few years, the NSF has funded a number of HPC systems to further supply the open research community with computational resources to meet that community’s changing and expanding needs. A review of these systems at the PEARC21 conference (July 19-22) highlighted... Read more…

Chameleon’s HPC Testbed Sharpens Its Edge, Presses ‘Replay’

July 22, 2021

“One way of saying what I do for a living is to say that I develop scientific instruments,” said Kate Keahey, a senior fellow at the University of Chicago and a computer scientist at Argonne National Laboratory, as s Read more…

PEARC21 Plenary Session: AI for Innovative Social Work

July 21, 2021

AI analysis of social media poses a double-edged sword for social work and addressing the needs of at-risk youths, said Desmond Upton Patton, senior associate dean, Innovation and Academic Affairs, Columbia University. S Read more…

Summer Reading: “High-Performance Computing Is at an Inflection Point”

July 21, 2021

At last month’s 11th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART), a group of researchers led by Martin Schulz of the Leibniz Supercomputing Center (Munich) presented a “position paper” in which they argue HPC architectural landscape... Read more…

AWS Solution Channel

Accelerate innovation in healthcare and life sciences with AWS HPC

With Amazon Web Services, researchers can access purpose-built HPC tools and services along with scientific and technical expertise to accelerate the pace of discovery. Whether you are sequencing the human genome, using AI/ML for disease detection or running molecular dynamics simulations to develop lifesaving drugs, AWS has the infrastructure you need to run your HPC workloads. Read more…

PEARC21 Panel: Wafer-Scale-Engine Technology Accelerates Machine Learning, HPC

July 21, 2021

Early use of Cerebras’ CS-1 server and wafer-scale engine (WSE) has demonstrated promising acceleration of machine-learning algorithms, according to participants in the Scientific Research Enabled by CS-1 Systems panel Read more…

With New Owner and New Roadmap, an Independent Omni-Path Is Staging a Comeback

July 23, 2021

Put on a shelf by Intel in 2019, Omni-Path faced a uncertain future, but under new custodian Cornelis Networks, OmniPath is looking to make a comeback as an independent high-performance interconnect solution. A "significant refresh" – called Omni-Path Express – is coming later this year according to the company. Cornelis Networks formed last September as a spinout of Intel's Omni-Path division. Read more…

Chameleon’s HPC Testbed Sharpens Its Edge, Presses ‘Replay’

July 22, 2021

“One way of saying what I do for a living is to say that I develop scientific instruments,” said Kate Keahey, a senior fellow at the University of Chicago a Read more…

Summer Reading: “High-Performance Computing Is at an Inflection Point”

July 21, 2021

At last month’s 11th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART), a group of researchers led by Martin Schulz of the Leibniz Supercomputing Center (Munich) presented a “position paper” in which they argue HPC architectural landscape... Read more…

PEARC21 Panel: Wafer-Scale-Engine Technology Accelerates Machine Learning, HPC

July 21, 2021

Early use of Cerebras’ CS-1 server and wafer-scale engine (WSE) has demonstrated promising acceleration of machine-learning algorithms, according to participa Read more…

15 Years Later, the Green500 Continues Its Push for Energy Efficiency as a First-Order Concern in HPC

July 15, 2021

The Green500 list, which ranks the most energy-efficient supercomputers in the world, has virtually always faced an uphill battle. As Wu Feng – custodian of the Green500 list and an associate professor at Virginia Tech – tells it, “noone" cared about energy efficiency in the early 2000s, when the seeds... Read more…

Frontier to Meet 20MW Exascale Power Target Set by DARPA in 2008

July 14, 2021

After more than a decade of planning, the United States’ first exascale computer, Frontier, is set to arrive at Oak Ridge National Laboratory (ORNL) later this year. Crossing this “1,000x” horizon required overcoming four major challenges: power demand, reliability, extreme parallelism and data movement. Read more…

Quantum Roundup: IBM, Rigetti, Phasecraft, Oxford QC, China, and More

July 13, 2021

IBM yesterday announced a proof for a quantum ML algorithm. A week ago, it unveiled a new topology for its quantum processors. Last Friday, the Technical Univer Read more…

ExaWind Prepares for New Architectures, Bigger Simulations

July 10, 2021

The ExaWind project describes itself in terms of terms like wake formation, turbine-turbine interaction and blade-boundary-layer dynamics, but the pitch to the Read more…

AMD Chipmaker TSMC to Use AMD Chips for Chipmaking

May 8, 2021

TSMC has tapped AMD to support its major manufacturing and R&D workloads. AMD will provide its Epyc Rome 7702P CPUs – with 64 cores operating at a base cl Read more…

Intel Launches 10nm ‘Ice Lake’ Datacenter CPU with Up to 40 Cores

April 6, 2021

The wait is over. Today Intel officially launched its 10nm datacenter CPU, the third-generation Intel Xeon Scalable processor, codenamed Ice Lake. With up to 40 Read more…

Berkeley Lab Debuts Perlmutter, World’s Fastest AI Supercomputer

May 27, 2021

A ribbon-cutting ceremony held virtually at Berkeley Lab's National Energy Research Scientific Computing Center (NERSC) today marked the official launch of Perlmutter – aka NERSC-9 – the GPU-accelerated supercomputer built by HPE in partnership with Nvidia and AMD. Read more…

Ahead of ‘Dojo,’ Tesla Reveals Its Massive Precursor Supercomputer

June 22, 2021

In spring 2019, Tesla made cryptic reference to a project called Dojo, a “super-powerful training computer” for video data processing. Then, in summer 2020, Tesla CEO Elon Musk tweeted: “Tesla is developing a [neural network] training computer called Dojo to process truly vast amounts of video data. It’s a beast! … A truly useful exaflop at de facto FP32.” Read more…

Google Launches TPU v4 AI Chips

May 20, 2021

Google CEO Sundar Pichai spoke for only one minute and 42 seconds about the company’s latest TPU v4 Tensor Processing Units during his keynote at the Google I Read more…

CentOS Replacement Rocky Linux Is Now in GA and Under Independent Control

June 21, 2021

The Rocky Enterprise Software Foundation (RESF) is announcing the general availability of Rocky Linux, release 8.4, designed as a drop-in replacement for the soon-to-be discontinued CentOS. The GA release is launching six-and-a-half months after Red Hat deprecated its support for the widely popular, free CentOS server operating system. The Rocky Linux development effort... Read more…

CERN Is Betting Big on Exascale

April 1, 2021

The European Organization for Nuclear Research (CERN) involves 23 countries, 15,000 researchers, billions of dollars a year, and the biggest machine in the worl Read more…

Iran Gains HPC Capabilities with Launch of ‘Simorgh’ Supercomputer

May 18, 2021

Iran is said to be developing domestic supercomputing technology to advance the processing of scientific, economic, political and military data, and to strengthen the nation’s position in the age of AI and big data. On Sunday, Iran unveiled the Simorgh supercomputer, which will deliver.... Read more…

Leading Solution Providers

Contributors

HPE Launches Storage Line Loaded with IBM’s Spectrum Scale File System

April 6, 2021

HPE today launched a new family of storage solutions bundled with IBM’s Spectrum Scale Erasure Code Edition parallel file system (description below) and featu Read more…

Julia Update: Adoption Keeps Climbing; Is It a Python Challenger?

January 13, 2021

The rapid adoption of Julia, the open source, high level programing language with roots at MIT, shows no sign of slowing according to data from Julialang.org. I Read more…

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

GTC21: Nvidia Launches cuQuantum; Dips a Toe in Quantum Computing

April 13, 2021

Yesterday Nvidia officially dipped a toe into quantum computing with the launch of cuQuantum SDK, a development platform for simulating quantum circuits on GPU-accelerated systems. As Nvidia CEO Jensen Huang emphasized in his keynote, Nvidia doesn’t plan to build... Read more…

Microsoft to Provide World’s Most Powerful Weather & Climate Supercomputer for UK’s Met Office

April 22, 2021

More than 14 months ago, the UK government announced plans to invest £1.2 billion ($1.56 billion) into weather and climate supercomputing, including procuremen Read more…

Q&A with Jim Keller, CTO of Tenstorrent, and an HPCwire Person to Watch in 2021

April 22, 2021

As part of our HPCwire Person to Watch series, we are happy to present our interview with Jim Keller, president and chief technology officer of Tenstorrent. One of the top chip architects of our time, Keller has had an impactful career. Read more…

Quantum Roundup: IBM, Rigetti, Phasecraft, Oxford QC, China, and More

July 13, 2021

IBM yesterday announced a proof for a quantum ML algorithm. A week ago, it unveiled a new topology for its quantum processors. Last Friday, the Technical Univer Read more…

Senate Debate on Bill to Remake NSF – the Endless Frontier Act – Begins

May 18, 2021

The U.S. Senate today opened floor debate on the Endless Frontier Act which seeks to remake and expand the National Science Foundation by creating a technology Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire