New Genomics Pipeline Combines AWS, Local HPC, and Supercomputing

By John Russell

September 22, 2016

Declining DNA sequencing costs and the rush to do whole genome sequencing (WGS) of large cohort populations – think 5000 subjects now, but many more thousands soon – presents a formidable computational challenge to researchers attempting to make sense of large cohort datasets. No single architecture is best. This month researchers report developing a hybrid approach that combines cloud (AWS), local high performance compute (LHPC) clusters, and supercomputers.

Their fascinating paper, A hybrid computational strategy to address WGS variant analysis in >5000 samples, spells out in some detail the obstacles associated with using each resource and how to divide the work to maximize throughput and minimize cost. Computational resources used included: Amazon AWS; a 4000-core in-house cluster at Baylor College of Medicine; IBM power PC Blue BioU at Rice University and Rhea at Oak Ridge National Laboratory (ORNL). DNAnexus was also a collaborator.

“Large cohort studies,” write the authors, “are extremely useful for discovering genotype phenotype associations and to characterize variation with great public health significance. The decreasing costs of sequencing are increasingly making it possible to sequence whole genomes in the millions in the coming years. The past decade has also seen the development of many joint calling approaches for genomic data produced with low coverage whole genome sequencing. Joint calling is necessary for low to medium coverage sequencing projects (~10×) as it further reduces false positives rate especially at the rarer end of the site frequency spectrum.”

The multidisciplinary team, led by Baylor, developed a genomics analysis pipeline – goSNAP – that distributes the workflow across the platforms. As a proof of principle, analysis was performed of Cohorts for Heart And Aging Research in Genomic Epidemiology (CHARGE) WGS freeze 3 dataset in which joint calling, imputation and phasing of over 5300 whole genome samples was produced in under six weeks using four state-of-the-art callers (SNPTools, GATK-HaplotypeCaller, GATK-UnifiedGenotyper, and GotCloud.)

“The entire operation was finished in 50 days with a total core hour usage of ~ 5.2 million across all the infrastructures. Each aligned BAM file was split into 1 Mbp region for joint calling on AWS. This created a cache data footprint of 360 TB with a time to live not exceeding 14 days. Only 6 TB of data was transferred across all platforms. The goSNAP pipeline is designed to minimize egress charges, data storage charges and data transfer costs. It optimizes on concurrent core usage to be cost effective and fast. To the best of our knowledge, ensemble calling on a WGS cohort with over 5000 samples has not been done before and this approach can be easily scaled to 10,000 samples.”

Manjunath Gorentla Venkata, ORNL
Manjunath Gorentla Venkata, ORNL

“This is an excellent example of two scientific communities coming together to address challenging science problems. We are happy to have played a part in conducting the analysis of such unprecedented scale,” said Manjunath Gorentla Venkata, co-author and ORNL computer scientist in an account of the work on the ORNL website. “While researchers from Baylor discussed the problem, we did not have a ready-made solution. After multiple discussions, we were convinced that mapping pipeline components based on system architecture strengths and tailoring parameters to the architecture would provide quality analysis with a relatively short turnaround.”

“There was previously no infrastructure for this large of a set, at 5,000 samples,” said Dr. Eric Boerwinkle, associate director of Baylor’s Human Genome Sequencing Center and dean of UT Health School of Public Health. “To address this, we employed a combination of platforms to perform large-scale variant calling, while maintaining high quality data.”

Fuli Yu, Baylor College of Medicine, led the study
Fuli Yu, Baylor College of Medicine, led the study

Their work, report the authors, demonstrates variant calling pipelines using a hybrid computational environment can leverage the strengths of each architecture to process cohorts with thousands of whole genome samples in real-time while minimizing operational costs.

The specifics of how the workflow (variant site identification; consensus site filtering step; genotype likelihood; and imputation & phasing) is divided up among the computational resources are best gleaned directly from the paper as some steps overlap. The authors write,” There has been some past work on porting state-of-the-art variant calling pipelines for targeted whole exome sequencing of thousands of samples to the Amazon Web Services (AWS) cloud, but a cloud based ensemble calling workflow for thousands of whole genomes is lacking.”

More broadly the authors note the following issues with each class of infrastructure:

  • Most LHPCs with typical research environments have few PBs of storage and millions of core-hours per month and are constrained by hardware limits on data storage, computing power and data transfer bandwidth to carry out large computes.
  • Scalability is not a problem for the AWS computing environment as it allows flexibility to increases the compute and data resources with a ‘pay per use’ model. However, the outbound data transfers incurs a cost which scales linearly with the amount of data transferred. It is also necessary to optimize on all aspects of the compute including memory bandwidth and capacity (RAM), computing cores (CPU) and IO capacity and bandwidth (HDD) to make optimal use of the instances and achieve cost-effectiveness. For projects involving big data, there is an additional cost of implementing data parallelization to overcome the limitations of local instance on HDD space.
  • The large supercomputing infrastructure has an extremely large data store, premium hardware optimized for high IO bandwidth, low-latency and high bandwidth network, and dedicated hardware and software support for CPU-intensive operations, but computing jobs have to finish within hard wall time limits. (For example, Titan at ORNL requires all jobs to finish within 24 hrs. Scheduling delays in allocating large number of resources can add to the turnaround times.)

Click on the image below to get a better sense of how the computational were used in this study.

screen-shot-2016-09-15-at-4-21-12-pm

The team used the Rhea computing cluster at the Oak Ridge Leadership Computing Facility to reconstruct chromosomal segments inherited from parents and to statistically predict the makeup of incomplete or missing genetic sequences from discovered genetic markers. This step was the most computationally intensive and required the greatest amount of power to calculate the probabilities of the most likely genetic patterns. More than 75 percent of this step was finished on Rhea and the rest was completed on supercomputers at Rice University. Baylor utilized the Amazon Web Services cloud computing environment to store raw data and discover genetic variants across the thousands of genome samples.

The authors conclude:

“With increasing number of genomic datasets freely available on the AWS cloud, the next generation of variant calling pipelines will also be increasingly common in the AWS environment. While the costs of storage and compute cores in the AWS environment is declining, it may still be prohibitively costly to carry out many steps of standard variant calling workflow on the cloud. A hybrid computational approach involving multiple HPC systems may be an important future direction to explore. Our work on the goSNAP pipeline demonstrates that using a hybrid computation strategy can be cost effective and fast even with thousands of individual genomes.”

Link to ORNL article:

https://www.ornl.gov/news/ornl-helps-develop-hybrid-computational-strategy-efficient-sequencing-massive-genome-datasets

Link to Baylor article:

https://www.bcm.edu/news/genome-sequencing/new-scalable-whole-genome-data-analysis

Link to paper on open access publisher BioMed Central (Sep 10, 2016,) https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1211-6

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

What’s New in HPC Research: Quantum Clouds, Interatomic Models, Genetic Algorithms & More

February 14, 2020

In this bimonthly feature, HPCwire highlights newly published research in the high-performance computing community and related domains. From parallel programming to exascale to quantum computing, the details are here. Read more…

By Oliver Peckham

The Massive GPU Cloudburst Experiment Plays a Smaller, More Productive Encore

February 13, 2020

In November, researchers at the San Diego Supercomputer Center (SDSC) and the IceCube Particle Astrophysics Center (WIPAC) set out to break the internet – or at least, pull off the cloud HPC equivalent. As part of thei Read more…

By Oliver Peckham

ORNL Team Develops AI-based Cancer Text Mining Tool on Summit

February 13, 2020

A group of Oak Ridge National Laboratory researchers working on the Summit supercomputer has developed a new neural network tool for fast extraction of information from cancer pathology reports to speed research and clin Read more…

By John Russell

Nature Serves up Another Challenge to Quantum Computing?

February 13, 2020

Just when you thought it was safe to assume quantum computing – though distant – would eventually succumb to clever technology, another potentially confounding factor pops up. It’s the Heisenberg Limit (HL), close Read more…

By John Russell

Researchers Enlist Three Supercomputers to Apply Deep Learning to Extreme Weather

February 12, 2020

When it comes to extreme weather, an errant forecast can have serious effects. While advance warning can give people time to prepare for the weather as it did with the polar vortex last year, the absence of accurate adva Read more…

By Oliver Peckham

AWS Solution Channel

Challenging the barriers to High Performance Computing in the Cloud

Cloud computing helps democratize High Performance Computing by placing powerful computational capabilities in the hands of more researchers, engineers, and organizations who may lack access to sufficient on-premises infrastructure. Read more…

IBM Accelerated Insights

Intelligent HPC – Keeping Hard Work at Bay(es)

Since the dawn of time, humans have looked for ways to make their lives easier. Over the centuries human ingenuity has given us inventions such as the wheel and simple machines – which help greatly with tasks that would otherwise be extremely laborious. Read more…

Eni to Retake Industry HPC Crown with Launch of HPC5

February 12, 2020

With the launch of its Dell-built HPC5 system, Italian energy company Eni regains its position atop the industrial supercomputing leaderboard. At 52-petaflops peak, HPC5 should easily crack the top ten fold of the next T Read more…

By Tiffany Trader

The Massive GPU Cloudburst Experiment Plays a Smaller, More Productive Encore

February 13, 2020

In November, researchers at the San Diego Supercomputer Center (SDSC) and the IceCube Particle Astrophysics Center (WIPAC) set out to break the internet – or Read more…

By Oliver Peckham

Eni to Retake Industry HPC Crown with Launch of HPC5

February 12, 2020

With the launch of its Dell-built HPC5 system, Italian energy company Eni regains its position atop the industrial supercomputing leaderboard. At 52-petaflops p Read more…

By Tiffany Trader

Trump Budget Proposal Again Slashes Science Spending

February 11, 2020

President Donald Trump’s FY2021 U.S. Budget, submitted to Congress this week, again slashes science spending. It’s a $4.8 trillion statement of priorities, Read more…

By John Russell

Policy: Republicans Eye Bigger Science Budgets; NSF Celebrates 70th, Names Idea Machine Winners

February 5, 2020

It’s a busy week for science policy. Yesterday, the National Science Foundation announced winners of its 2026 Idea Machine contest seeking directions for futu Read more…

By John Russell

Fujitsu A64FX Supercomputer to Be Deployed at Nagoya University This Summer

February 3, 2020

Japanese tech giant Fujitsu announced today that it will supply Nagoya University Information Technology Center with the first commercial supercomputer powered Read more…

By Tiffany Trader

Intel Stopping Nervana Development to Focus on Habana AI Chips

February 3, 2020

Just two months after acquiring Israeli AI chip start-up Habana Labs for $2 billion, Intel is stopping development of its existing Nervana neural network proces Read more…

By John Russell

Lise Supercomputer, Part of HLRN-IV, Begins Operations

January 29, 2020

The second phase of the build-out of HLRN-IV – the planned 16 peak-petaflops supercomputer serving the North-German Supercomputing Alliance (HLRN) – is unde Read more…

By Staff report

IBM Debuts IC922 Power Server for AI Inferencing and Data Management

January 28, 2020

IBM today launched a Power9-based inference server – the IC922 – that features up to six Nvidia T4 GPUs, PCIe Gen 4 and OpenCAPI connectivity, and can accom Read more…

By John Russell

Julia Programming’s Dramatic Rise in HPC and Elsewhere

January 14, 2020

Back in 2012 a paper by four computer scientists including Alan Edelman of MIT introduced Julia, A Fast Dynamic Language for Technical Computing. At the time, t Read more…

By John Russell

Cray, Fujitsu Both Bringing Fujitsu A64FX-based Supercomputers to Market in 2020

November 12, 2019

The number of top-tier HPC systems makers has shrunk due to a steady march of M&A activity, but there is increased diversity and choice of processing compon Read more…

By Tiffany Trader

SC19: IBM Changes Its HPC-AI Game Plan

November 25, 2019

It’s probably fair to say IBM is known for big bets. Summit supercomputer – a big win. Red Hat acquisition – looking like a big win. OpenPOWER and Power processors – jury’s out? At SC19, long-time IBMer Dave Turek sketched out a different kind of bet for Big Blue – a small ball strategy, if you’ll forgive the baseball analogy... Read more…

By John Russell

Intel Debuts New GPU – Ponte Vecchio – and Outlines Aspirations for oneAPI

November 17, 2019

Intel today revealed a few more details about its forthcoming Xe line of GPUs – the top SKU is named Ponte Vecchio and will be used in Aurora, the first plann Read more…

By John Russell

Dell Ramps Up HPC Testing of AMD Rome Processors

October 21, 2019

Dell Technologies is wading deeper into the AMD-based systems market with a growing evaluation program for the latest Epyc (Rome) microprocessors from AMD. In a Read more…

By John Russell

IBM Unveils Latest Achievements in AI Hardware

December 13, 2019

“The increased capabilities of contemporary AI models provide unprecedented recognition accuracy, but often at the expense of larger computational and energet Read more…

By Oliver Peckham

SC19: Welcome to Denver

November 17, 2019

A significant swath of the HPC community has come to Denver for SC19, which began today (Sunday) with a rich technical program. As is customary, the ribbon cutt Read more…

By Tiffany Trader

D-Wave’s Path to 5000 Qubits; Google’s Quantum Supremacy Claim

September 24, 2019

On the heels of IBM’s quantum news last week come two more quantum items. D-Wave Systems today announced the name of its forthcoming 5000-qubit system, Advantage (yes the name choice isn’t serendipity), at its user conference being held this week in Newport, RI. Read more…

By John Russell

Leading Solution Providers

SC 2019 Virtual Booth Video Tour

AMD
AMD
ASROCK RACK
ASROCK RACK
AWS
AWS
CEJN
CJEN
CRAY
CRAY
DDN
DDN
DELL EMC
DELL EMC
IBM
IBM
MELLANOX
MELLANOX
ONE STOP SYSTEMS
ONE STOP SYSTEMS
PANASAS
PANASAS
SIX NINES IT
SIX NINES IT
VERNE GLOBAL
VERNE GLOBAL
WEKAIO
WEKAIO

Jensen Huang’s SC19 – Fast Cars, a Strong Arm, and Aiming for the Cloud(s)

November 20, 2019

We’ve come to expect Nvidia CEO Jensen Huang’s annual SC keynote to contain stunning graphics and lively bravado (with plenty of examples) in support of GPU Read more…

By John Russell

51,000 Cloud GPUs Converge to Power Neutrino Discovery at the South Pole

November 22, 2019

At the dead center of the South Pole, thousands of sensors spanning a cubic kilometer are buried thousands of meters beneath the ice. The sensors are part of Ic Read more…

By Oliver Peckham

Fujitsu A64FX Supercomputer to Be Deployed at Nagoya University This Summer

February 3, 2020

Japanese tech giant Fujitsu announced today that it will supply Nagoya University Information Technology Center with the first commercial supercomputer powered Read more…

By Tiffany Trader

Top500: US Maintains Performance Lead; Arm Tops Green500

November 18, 2019

The 54th Top500, revealed today at SC19, is a familiar list: the U.S. Summit (ORNL) and Sierra (LLNL) machines, offering 148.6 and 94.6 petaflops respectively, Read more…

By Tiffany Trader

Azure Cloud First with AMD Epyc Rome Processors

November 6, 2019

At Ignite 2019 this week, Microsoft's Azure cloud team and AMD announced an expansion of their partnership that began in 2017 when Azure debuted Epyc-backed instances for storage workloads. The fourth-generation Azure D-series and E-series virtual machines previewed at the Rome launch in August are now generally available. Read more…

By Tiffany Trader

Intel’s New Hyderabad Design Center Targets Exascale Era Technologies

December 3, 2019

Intel's Raja Koduri was in India this week to help launch a new 300,000 square foot design and engineering center in Hyderabad, which will focus on advanced com Read more…

By Tiffany Trader

Using AI to Solve One of the Most Prevailing Problems in CFD

October 17, 2019

How can artificial intelligence (AI) and high-performance computing (HPC) solve mesh generation, one of the most commonly referenced problems in computational engineering? A new study has set out to answer this question and create an industry-first AI-mesh application... Read more…

By James Sharpe

In Memoriam: Steve Tuecke, Globus Co-founder

November 4, 2019

HPCwire is deeply saddened to report that Steve Tuecke, longtime scientist at Argonne National Lab and University of Chicago, has passed away at age 52. Tuecke Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This