New Genomics Pipeline Combines AWS, Local HPC, and Supercomputing

By John Russell

September 22, 2016

Declining DNA sequencing costs and the rush to do whole genome sequencing (WGS) of large cohort populations – think 5000 subjects now, but many more thousands soon – presents a formidable computational challenge to researchers attempting to make sense of large cohort datasets. No single architecture is best. This month researchers report developing a hybrid approach that combines cloud (AWS), local high performance compute (LHPC) clusters, and supercomputers.

Their fascinating paper, A hybrid computational strategy to address WGS variant analysis in >5000 samples, spells out in some detail the obstacles associated with using each resource and how to divide the work to maximize throughput and minimize cost. Computational resources used included: Amazon AWS; a 4000-core in-house cluster at Baylor College of Medicine; IBM power PC Blue BioU at Rice University and Rhea at Oak Ridge National Laboratory (ORNL). DNAnexus was also a collaborator.

“Large cohort studies,” write the authors, “are extremely useful for discovering genotype phenotype associations and to characterize variation with great public health significance. The decreasing costs of sequencing are increasingly making it possible to sequence whole genomes in the millions in the coming years. The past decade has also seen the development of many joint calling approaches for genomic data produced with low coverage whole genome sequencing. Joint calling is necessary for low to medium coverage sequencing projects (~10×) as it further reduces false positives rate especially at the rarer end of the site frequency spectrum.”

The multidisciplinary team, led by Baylor, developed a genomics analysis pipeline – goSNAP – that distributes the workflow across the platforms. As a proof of principle, analysis was performed of Cohorts for Heart And Aging Research in Genomic Epidemiology (CHARGE) WGS freeze 3 dataset in which joint calling, imputation and phasing of over 5300 whole genome samples was produced in under six weeks using four state-of-the-art callers (SNPTools, GATK-HaplotypeCaller, GATK-UnifiedGenotyper, and GotCloud.)

“The entire operation was finished in 50 days with a total core hour usage of ~ 5.2 million across all the infrastructures. Each aligned BAM file was split into 1 Mbp region for joint calling on AWS. This created a cache data footprint of 360 TB with a time to live not exceeding 14 days. Only 6 TB of data was transferred across all platforms. The goSNAP pipeline is designed to minimize egress charges, data storage charges and data transfer costs. It optimizes on concurrent core usage to be cost effective and fast. To the best of our knowledge, ensemble calling on a WGS cohort with over 5000 samples has not been done before and this approach can be easily scaled to 10,000 samples.”

Manjunath Gorentla Venkata, ORNL
Manjunath Gorentla Venkata, ORNL

“This is an excellent example of two scientific communities coming together to address challenging science problems. We are happy to have played a part in conducting the analysis of such unprecedented scale,” said Manjunath Gorentla Venkata, co-author and ORNL computer scientist in an account of the work on the ORNL website. “While researchers from Baylor discussed the problem, we did not have a ready-made solution. After multiple discussions, we were convinced that mapping pipeline components based on system architecture strengths and tailoring parameters to the architecture would provide quality analysis with a relatively short turnaround.”

“There was previously no infrastructure for this large of a set, at 5,000 samples,” said Dr. Eric Boerwinkle, associate director of Baylor’s Human Genome Sequencing Center and dean of UT Health School of Public Health. “To address this, we employed a combination of platforms to perform large-scale variant calling, while maintaining high quality data.”

Fuli Yu, Baylor College of Medicine, led the study
Fuli Yu, Baylor College of Medicine, led the study

Their work, report the authors, demonstrates variant calling pipelines using a hybrid computational environment can leverage the strengths of each architecture to process cohorts with thousands of whole genome samples in real-time while minimizing operational costs.

The specifics of how the workflow (variant site identification; consensus site filtering step; genotype likelihood; and imputation & phasing) is divided up among the computational resources are best gleaned directly from the paper as some steps overlap. The authors write,” There has been some past work on porting state-of-the-art variant calling pipelines for targeted whole exome sequencing of thousands of samples to the Amazon Web Services (AWS) cloud, but a cloud based ensemble calling workflow for thousands of whole genomes is lacking.”

More broadly the authors note the following issues with each class of infrastructure:

  • Most LHPCs with typical research environments have few PBs of storage and millions of core-hours per month and are constrained by hardware limits on data storage, computing power and data transfer bandwidth to carry out large computes.
  • Scalability is not a problem for the AWS computing environment as it allows flexibility to increases the compute and data resources with a ‘pay per use’ model. However, the outbound data transfers incurs a cost which scales linearly with the amount of data transferred. It is also necessary to optimize on all aspects of the compute including memory bandwidth and capacity (RAM), computing cores (CPU) and IO capacity and bandwidth (HDD) to make optimal use of the instances and achieve cost-effectiveness. For projects involving big data, there is an additional cost of implementing data parallelization to overcome the limitations of local instance on HDD space.
  • The large supercomputing infrastructure has an extremely large data store, premium hardware optimized for high IO bandwidth, low-latency and high bandwidth network, and dedicated hardware and software support for CPU-intensive operations, but computing jobs have to finish within hard wall time limits. (For example, Titan at ORNL requires all jobs to finish within 24 hrs. Scheduling delays in allocating large number of resources can add to the turnaround times.)

Click on the image below to get a better sense of how the computational were used in this study.

screen-shot-2016-09-15-at-4-21-12-pm

The team used the Rhea computing cluster at the Oak Ridge Leadership Computing Facility to reconstruct chromosomal segments inherited from parents and to statistically predict the makeup of incomplete or missing genetic sequences from discovered genetic markers. This step was the most computationally intensive and required the greatest amount of power to calculate the probabilities of the most likely genetic patterns. More than 75 percent of this step was finished on Rhea and the rest was completed on supercomputers at Rice University. Baylor utilized the Amazon Web Services cloud computing environment to store raw data and discover genetic variants across the thousands of genome samples.

The authors conclude:

“With increasing number of genomic datasets freely available on the AWS cloud, the next generation of variant calling pipelines will also be increasingly common in the AWS environment. While the costs of storage and compute cores in the AWS environment is declining, it may still be prohibitively costly to carry out many steps of standard variant calling workflow on the cloud. A hybrid computational approach involving multiple HPC systems may be an important future direction to explore. Our work on the goSNAP pipeline demonstrates that using a hybrid computation strategy can be cost effective and fast even with thousands of individual genomes.”

Link to ORNL article:

https://www.ornl.gov/news/ornl-helps-develop-hybrid-computational-strategy-efficient-sequencing-massive-genome-datasets

Link to Baylor article:

https://www.bcm.edu/news/genome-sequencing/new-scalable-whole-genome-data-analysis

Link to paper on open access publisher BioMed Central (Sep 10, 2016,) https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1211-6

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Watch Nvidia’s GTC21 Keynote with Jensen Huang Livestreamed Here at HPCwire

April 9, 2021

Join HPCwire right here on Monday, April 12, at 8:30 am PT to see the Nvidia GTC21 keynote from Nvidia’s CEO, Jensen Huang, livestreamed in its entirety. Hosted by HPCwire, you can click to join the Huang keynote on our livestream to hear Nvidia’s expected news and... Read more…

The US Places Seven Additional Chinese Supercomputing Entities on Blacklist

April 8, 2021

As tensions between the U.S. and China continue to simmer, the U.S. government today added seven Chinese supercomputing entities to an economic blacklist. The U.S. Entity List bars U.S. firms from supplying key technolog Read more…

Argonne Supercomputing Supports Caterpillar Engine Design

April 8, 2021

Diesel fuels still account for nearly ten percent of all energy-related U.S. carbon emissions – most of them from heavy-duty vehicles like trucks and construction equipment. Energy efficiency is key to these machines, Read more…

Habana’s AI Silicon Comes to San Diego Supercomputer Center

April 8, 2021

Habana Labs, an Intel-owned AI company, has partnered with server maker Supermicro to provide high-performance, high-efficiency AI computing in the form of new training and inference servers that will power the upcoming Read more…

Intel Partners Debut Latest Servers Based on the New Intel Gen 3 ‘Ice Lake’ Xeons

April 7, 2021

Fresh from Intel’s launch of the company’s latest third-generation Xeon Scalable “Ice Lake” processors on April 6 (Tuesday), Intel server partners Cisco, Dell EMC, HPE and Lenovo simultaneously unveiled their first server models built around the latest chips. And though arch-rival AMD may... Read more…

AWS Solution Channel

Volkswagen Passenger Cars Uses NICE DCV for High-Performance 3D Remote Visualization

 

Volkswagen Passenger Cars has been one of the world’s largest car manufacturers for over 70 years. The company delivers more than 6 million automobiles to global customers every year, from 50 production locations on five continents. Read more…

What’s New in HPC Research: Tundra, Fugaku, µHPC & More

April 6, 2021

In this regular feature, HPCwire highlights newly published research in the high-performance computing community and related domains. From parallel programming to exascale to quantum computing, the details are here. Read more…

The US Places Seven Additional Chinese Supercomputing Entities on Blacklist

April 8, 2021

As tensions between the U.S. and China continue to simmer, the U.S. government today added seven Chinese supercomputing entities to an economic blacklist. The U Read more…

Habana’s AI Silicon Comes to San Diego Supercomputer Center

April 8, 2021

Habana Labs, an Intel-owned AI company, has partnered with server maker Supermicro to provide high-performance, high-efficiency AI computing in the form of new Read more…

Intel Partners Debut Latest Servers Based on the New Intel Gen 3 ‘Ice Lake’ Xeons

April 7, 2021

Fresh from Intel’s launch of the company’s latest third-generation Xeon Scalable “Ice Lake” processors on April 6 (Tuesday), Intel server partners Cisco, Dell EMC, HPE and Lenovo simultaneously unveiled their first server models built around the latest chips. And though arch-rival AMD may... Read more…

Intel Launches 10nm ‘Ice Lake’ Datacenter CPU with Up to 40 Cores

April 6, 2021

The wait is over. Today Intel officially launched its 10nm datacenter CPU, the third-generation Intel Xeon Scalable processor, codenamed Ice Lake. With up to 40 Read more…

HPE Launches Storage Line Loaded with IBM’s Spectrum Scale File System

April 6, 2021

HPE today launched a new family of storage solutions bundled with IBM’s Spectrum Scale Erasure Code Edition parallel file system (description below) and featu Read more…

RIKEN’s Ongoing COVID Research Includes New Vaccines, New Tests & More

April 6, 2021

RIKEN took the supercomputing world by storm last summer when it launched Fugaku – which became (and remains) the world’s most powerful supercomputer – ne Read more…

CERN Is Betting Big on Exascale

April 1, 2021

The European Organization for Nuclear Research (CERN) involves 23 countries, 15,000 researchers, billions of dollars a year, and the biggest machine in the worl Read more…

AI Systems Summit Keynote: Brace for System Level Heterogeneity Says de Supinski

April 1, 2021

Heterogeneous computing has quickly come to mean packing a couple of CPUs and one-or-many accelerators, mostly GPUs, onto the same node. Today, a one-such-node system has become the standard AI server offered by dozens of vendors. This is not to diminish the many advances... Read more…

Julia Update: Adoption Keeps Climbing; Is It a Python Challenger?

January 13, 2021

The rapid adoption of Julia, the open source, high level programing language with roots at MIT, shows no sign of slowing according to data from Julialang.org. I Read more…

Intel Launches 10nm ‘Ice Lake’ Datacenter CPU with Up to 40 Cores

April 6, 2021

The wait is over. Today Intel officially launched its 10nm datacenter CPU, the third-generation Intel Xeon Scalable processor, codenamed Ice Lake. With up to 40 Read more…

CERN Is Betting Big on Exascale

April 1, 2021

The European Organization for Nuclear Research (CERN) involves 23 countries, 15,000 researchers, billions of dollars a year, and the biggest machine in the worl Read more…

Programming the Soon-to-Be World’s Fastest Supercomputer, Frontier

January 5, 2021

What’s it like designing an app for the world’s fastest supercomputer, set to come online in the United States in 2021? The University of Delaware’s Sunita Chandrasekaran is leading an elite international team in just that task. Chandrasekaran, assistant professor of computer and information sciences, recently was named... Read more…

HPE Launches Storage Line Loaded with IBM’s Spectrum Scale File System

April 6, 2021

HPE today launched a new family of storage solutions bundled with IBM’s Spectrum Scale Erasure Code Edition parallel file system (description below) and featu Read more…

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

Saudi Aramco Unveils Dammam 7, Its New Top Ten Supercomputer

January 21, 2021

By revenue, oil and gas giant Saudi Aramco is one of the largest companies in the world, and it has historically employed commensurate amounts of supercomputing Read more…

Quantum Computer Start-up IonQ Plans IPO via SPAC

March 8, 2021

IonQ, a Maryland-based quantum computing start-up working with ion trap technology, plans to go public via a Special Purpose Acquisition Company (SPAC) merger a Read more…

Leading Solution Providers

Contributors

Can Deep Learning Replace Numerical Weather Prediction?

March 3, 2021

Numerical weather prediction (NWP) is a mainstay of supercomputing. Some of the first applications of the first supercomputers dealt with climate modeling, and Read more…

Livermore’s El Capitan Supercomputer to Debut HPE ‘Rabbit’ Near Node Local Storage

February 18, 2021

A near node local storage innovation called Rabbit factored heavily into Lawrence Livermore National Laboratory’s decision to select Cray’s proposal for its CORAL-2 machine, the lab’s first exascale-class supercomputer, El Capitan. Details of this new storage technology were revealed... Read more…

New Deep Learning Algorithm Solves Rubik’s Cube

July 25, 2018

Solving (and attempting to solve) Rubik’s Cube has delighted millions of puzzle lovers since 1974 when the cube was invented by Hungarian sculptor and archite Read more…

African Supercomputing Center Inaugurates ‘Toubkal,’ Most Powerful Supercomputer on the Continent

February 25, 2021

Historically, Africa hasn’t exactly been synonymous with supercomputing. There are only a handful of supercomputers on the continent, with few ranking on the Read more…

The History of Supercomputing vs. COVID-19

March 9, 2021

The COVID-19 pandemic poses a greater challenge to the high-performance computing community than any before. HPCwire's coverage of the supercomputing response t Read more…

HPE Names Justin Hotard New HPC Chief as Pete Ungaro Departs

March 2, 2021

HPE CEO Antonio Neri announced today (March 2, 2021) the appointment of Justin Hotard as general manager of HPC, mission critical solutions and labs, effective Read more…

Microsoft, HPE Bringing AI, Edge, Cloud to Earth Orbit in Preparation for Mars Missions

February 12, 2021

The International Space Station will soon get a delivery of powerful AI, edge and cloud computing tools from HPE and Microsoft Azure to expand technology experi Read more…

AMD Launches Epyc ‘Milan’ with 19 SKUs for HPC, Enterprise and Hyperscale

March 15, 2021

At a virtual launch event held today (Monday), AMD revealed its third-generation Epyc “Milan” CPU lineup: a set of 19 SKUs -- including the flagship 64-core, 280-watt 7763 part --  aimed at HPC, enterprise and cloud workloads. Notably, the third-gen Epyc Milan chips achieve 19 percent... Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire