Pathogens Can’t Hide from Novel HPC Approach

By Tiffany Trader

January 21, 2015

In the realm of pathogenic threats, there are the usual suspects – anthrax, botulism, tuberculosis – but in reality there are a whole host of bacterial and viral pathogenic microbes that can be problematic to human and animal health. Protecting the public against these threats falls under the domain of biosecurity. While staying ahead of mother nature is a daunting task, scientists are beginning to unlock the secrets of the microbial world thanks to powerful sequencing technology and advanced computing tools.

Metagenomic sequencing is an offshoot of traditional genomic sequencing that is emerging as a key enabler for biosecurity. This application area involves detecting and characterizing potentially dangerous pathogens and assessing the threat potential of the organisms for human health. In order for this research tool to be deployed more widely, however, there are serious data challenges that need to be addressed.

Scientists at Lawrence Livermore National Laboratory (LLNL) are on the cusp of a breakthrough that would bring this problem down to size to facilitate use cases at a range of scales. Led by bioinformatics scientist Jonathan Allen, the team developed a novel approach to metagenomic sequencing using flash drives as a supplemental memory source to more efficiently search very large datasets.

Dr. Allen explains that while conventional sequencing targets a known biological isolate, metagenomic sequencing is applied when the sample contains organisms or DNA of unknown origin. In this scenario, researchers take a biological mass and do a DNA extraction step to determine what DNA fragments can be recovered. “It’s a different tool in the toolbox,” says Allen. “From the pathogen or biosecurity perspective, it’s a last resort for when you don’t know what you might be dealing with.”

Because of the element of the unknown, metagenomic sequencing is orders more challenging than conventional sequencing, says Allen. One reason is just the sheer abundance of microbial life. “Each sample has potentially hundreds or more organisms. In human clinical samples, some of the DNA may come from the host, a lot of it may be benign organisms. So sorting through all of that to understand the key functionally relevant portions of the sample is a major challenge,” he adds.

It’s one of the emerging data-intensive problems in life sciences. A single sequencing run can generate potentially billions of genetic fragments. Then each of these unlabeled fragments needs to be compared with every reference genome independently. The team’s objective was to take this large dataset and provide a fast and efficient and scalable way to perform an accurate assessment of what organisms and genes are present.

By creating a searchable index of everything that’s been previously sequenced, and assigning some organizational information to a given genomic sequence, researchers can provide this hierarchical organization to illustrate that certain fragments are conserved at the species level, some are conserved at the family level and others may be unique to a particular isolate. Then they search against all of that information to make an assessment about where a given fragment fits in that pantheon of previously-seen data.

“That’s what makes this the classic data-intensive computing challenge,” he said. “We have these large query sets but then we also have this similarly-growing reference database as more and more isolates are being sequenced, we’ve got many different hundreds of potential isolates of a similar species that are being sequenced on a steady on-going basis, so we want to be able to take advantage of all of that new genetic diversity that’s being captured, so we can provide a more accurate assessment.”

The efforts of Allen and his Livermore colleagues computer scientists Maya Gokhale and Sasha Ames and bioinformaticist Shea Gardner led to the development of the Livermore Metagenomic Analysis Toolkit (LMAT), a custom reference database with a fast searchable index that addresses the scaling limitations of existing metagenomic classification methods.

When the team gets a new query sequence, they break it into its constitute elements, referred to as k-mers. A look-up key is then assigned for the purpose of tracking where this short fragment has been seen before. One of the primary challenges is the size of the look-up table quickly becomes quite large. The original query database was 620 gigabytes of DRAM, which limited its use to researchers who had access to large memory machines.

Although LLNL researchers have access to terabyte-plus DRAM machines, wider accessibility was being hampered. The team looked at how they could reduce the size of the database and how they could fit it to architectures that would be more scalable and more affordable. The essential innovation of the project was the development of a data structure that was optimized to store the searchable index on flash drives as if it were in memory. So that when they are doing the lookups, they’ve memory mapped the database from the flash drive into a single address space and just treat it as if it were in memory.

By tuning the software to exploit a combination of DRAM and NVRAM, and also reducing the full database size down to 458 gigabytes, the team was moving toward greater accessibility, but the database size was still outside the purview of a very low-cost machine. Dr. Allen explains there are two workarounds for this. One is to build much smaller databases called marker databases, which only retain only the most essential information for identifying which fragments are present. This approach gets the database down to 17 gigabytes, but there is a tradeoff in that it is no longer possible to tag every read in order to separate all the knowns from the unknowns.

The harder problem requires the much larger, complete database. That’s where the Catalyst machine comes in. The Cray CS300 cluster supercomputer was designed with expanded DRAM and fast, persistent NVRAM to be well suited to big data problems.

“The Catalyst cluster with 128 gigabytes of DRAM [per node] has been outstanding in terms of performance,” Allen reports. “We can put an entire database on flash drive, treat that as memory, and just cache what’s used in practice on DRAM and it works very well.”

Accommodating both small- and large-scale deployments creates new avenues for metagenomic sequencing. Appropriately downscaled to one node of the Catalyst cluster, the software can be deployed on substantially lower cost machines, making it possible for LMAT to be used for post-sequencing analysis in tandem with the sequencer.

“We can still have these very large searchable indexes that are stored on single compute nodes with the vision of lower-cost computers that can be widely distributed,” says Dr. Allen. “It doesn’t necessarily have to be a huge compute cluster in the cloud; it could potentially sit out in the field closer to where the sequencing is taking place, but you’d still have this efficient analysis tool that could be available on a relatively lower cost computing platform.”

At the other end, the software can also be scaled up for very large scale analysis. In an upcoming paper, the team reports on how they used the full-scale approach to analyze the entire collection of Human Microbiome Project (HMP) data in about 26 hours. They did this by taking the searchable index and replicating it across the Catalyst cluster with a copy on every flash drive.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

At SC19: Bespoke Supercomputing for Climate and Weather

November 20, 2019

Weather and climate applications are some of the most important uses of HPC – a good model can save lives, as well as billions of dollars. But many weather and climate models struggle to run efficiently in their HPC en Read more…

By Oliver Peckham

Microsoft, Nvidia Launch Cloud HPC Service

November 20, 2019

Nvidia and Microsoft have joined forces to offer a cloud HPC capability based on the GPU vendor’s V100 Tensor Core chips linked via an InfiniBand network scaling up to 800 graphics processors. The partners announced Read more…

By George Leopold

Hazra Retiring from Intel Data Center Group, Successor Not Known

November 20, 2019

Rajeeb Hazra, corporate VP of Intel’s Data Center Group and GM for the Enterprise and Government Group, is retiring after more than 24 years at the company. At this writing, his successor is unknown. An earlier story on... Read more…

By Doug Black

Jensen Huang’s SC19 – Fast Cars, a Strong Arm, and Aiming for the Cloud(s)

November 20, 2019

We’ve come to expect Nvidia CEO Jensen Huang’s annual SC keynote to contain stunning graphics and lively bravado (with plenty of examples) in support of GPU-accelerated computing. In recent years, AI has joined the s Read more…

By John Russell

SC19 Student Cluster Competition: Know Your Teams

November 19, 2019

I’m typing this live from Denver, the location of the 2019 Student Cluster Competition… and, oh yeah, the annual SC conference too. The attendance this year should be north of 13,000 people, with the majority attende Read more…

By Dan Olds

AWS Solution Channel

Making High Performance Computing Affordable and Accessible for Small and Medium Businesses with HPC on AWS

High performance computing (HPC) brings a powerful set of tools to a broad range of industries, helping to drive innovation and boost revenue in finance, genomics, oil and gas extraction, and other fields. Read more…

IBM Accelerated Insights

Data Management – The Key to a Successful AI Project

 

Five characteristics of an awesome AI data infrastructure

[Attend the IBM LSF & HPC User Group Meeting at SC19 in Denver on November 19!]

AI is powered by data

While neural networks seem to get all the glory, data is the unsung hero of AI projects – data lies at the heart of everything from model training to tuning to selection to validation. Read more…

Top500: US Maintains Performance Lead; Arm Tops Green500

November 18, 2019

The 54th Top500, revealed today at SC19, is a familiar list: the U.S. Summit (ORNL) and Sierra (LLNL) machines, offering 148.6 and 94.6 petaflops respectively, remain in first and second place. The only new entrants in t Read more…

By Tiffany Trader

At SC19: Bespoke Supercomputing for Climate and Weather

November 20, 2019

Weather and climate applications are some of the most important uses of HPC – a good model can save lives, as well as billions of dollars. But many weather an Read more…

By Oliver Peckham

Hazra Retiring from Intel Data Center Group, Successor Not Known

November 20, 2019

Rajeeb Hazra, corporate VP of Intel’s Data Center Group and GM for the Enterprise and Government Group, is retiring after more than 24 years at the company. At this writing, his successor is unknown. An earlier story on... Read more…

By Doug Black

Jensen Huang’s SC19 – Fast Cars, a Strong Arm, and Aiming for the Cloud(s)

November 20, 2019

We’ve come to expect Nvidia CEO Jensen Huang’s annual SC keynote to contain stunning graphics and lively bravado (with plenty of examples) in support of GPU Read more…

By John Russell

Top500: US Maintains Performance Lead; Arm Tops Green500

November 18, 2019

The 54th Top500, revealed today at SC19, is a familiar list: the U.S. Summit (ORNL) and Sierra (LLNL) machines, offering 148.6 and 94.6 petaflops respectively, Read more…

By Tiffany Trader

ScaleMatrix and Nvidia Launch ‘Deploy Anywhere’ DGX HPC and AI in a Controlled Enclosure

November 18, 2019

HPC and AI in a phone booth: ScaleMatrix and Nvidia announced today at the SC19 conference in Denver a joint offering that puts up to 13 petaflops of Nvidia DGX Read more…

By Doug Black

Intel Debuts New GPU – Ponte Vecchio – and Outlines Aspirations for oneAPI

November 17, 2019

Intel today revealed a few more details about its forthcoming Xe line of GPUs – the top SKU is named Ponte Vecchio and will be used in Aurora, the first plann Read more…

By John Russell

SC19: Welcome to Denver

November 17, 2019

A significant swath of the HPC community has come to Denver for SC19, which began today (Sunday) with a rich technical program. As is customary, the ribbon cutt Read more…

By Tiffany Trader

SC19’s HPC Impact Showcase Chair: AI + HPC a ‘Speed Train’

November 16, 2019

This year’s chair of the HPC Impact Showcase at the SC19 conference in Denver is Lori Diachin, who has spent her career at the spearhead of HPC. Currently deputy director for the U.S. Department of Energy’s (DOE) Exascale Computing Project (ECP), Diachin is also... Read more…

By Doug Black

Supercomputer-Powered AI Tackles a Key Fusion Energy Challenge

August 7, 2019

Fusion energy is the Holy Grail of the energy world: low-radioactivity, low-waste, zero-carbon, high-output nuclear power that can run on hydrogen or lithium. T Read more…

By Oliver Peckham

Using AI to Solve One of the Most Prevailing Problems in CFD

October 17, 2019

How can artificial intelligence (AI) and high-performance computing (HPC) solve mesh generation, one of the most commonly referenced problems in computational engineering? A new study has set out to answer this question and create an industry-first AI-mesh application... Read more…

By James Sharpe

Cray Wins NNSA-Livermore ‘El Capitan’ Exascale Contract

August 13, 2019

Cray has won the bid to build the first exascale supercomputer for the National Nuclear Security Administration (NNSA) and Lawrence Livermore National Laborator Read more…

By Tiffany Trader

DARPA Looks to Propel Parallelism

September 4, 2019

As Moore’s law runs out of steam, new programming approaches are being pursued with the goal of greater hardware performance with less coding. The Defense Advanced Projects Research Agency is launching a new programming effort aimed at leveraging the benefits of massive distributed parallelism with less sweat. Read more…

By George Leopold

AMD Launches Epyc Rome, First 7nm CPU

August 8, 2019

From a gala event at the Palace of Fine Arts in San Francisco yesterday (Aug. 7), AMD launched its second-generation Epyc Rome x86 chips, based on its 7nm proce Read more…

By Tiffany Trader

D-Wave’s Path to 5000 Qubits; Google’s Quantum Supremacy Claim

September 24, 2019

On the heels of IBM’s quantum news last week come two more quantum items. D-Wave Systems today announced the name of its forthcoming 5000-qubit system, Advantage (yes the name choice isn’t serendipity), at its user conference being held this week in Newport, RI. Read more…

By John Russell

Ayar Labs to Demo Photonics Chiplet in FPGA Package at Hot Chips

August 19, 2019

Silicon startup Ayar Labs continues to gain momentum with its DARPA-backed optical chiplet technology that puts advanced electronics and optics on the same chip Read more…

By Tiffany Trader

Crystal Ball Gazing: IBM’s Vision for the Future of Computing

October 14, 2019

Dario Gil, IBM’s relatively new director of research, painted a intriguing portrait of the future of computing along with a rough idea of how IBM thinks we’ Read more…

By John Russell

Leading Solution Providers

ISC 2019 Virtual Booth Video Tour

CRAY
CRAY
DDN
DDN
DELL EMC
DELL EMC
GOOGLE
GOOGLE
ONE STOP SYSTEMS
ONE STOP SYSTEMS
PANASAS
PANASAS
VERNE GLOBAL
VERNE GLOBAL

Intel Confirms Retreat on Omni-Path

August 1, 2019

Intel Corp.’s plans to make a big splash in the network fabric market for linking HPC and other workloads has apparently belly-flopped. The chipmaker confirmed to us the outlines of an earlier report by the website CRN that it has jettisoned plans for a second-generation version of its Omni-Path interconnect... Read more…

By Staff report

Cray, Fujitsu Both Bringing Fujitsu A64FX-based Supercomputers to Market in 2020

November 12, 2019

The number of top-tier HPC systems makers has shrunk due to a steady march of M&A activity, but there is increased diversity and choice of processing compon Read more…

By Tiffany Trader

Kubernetes, Containers and HPC

September 19, 2019

Software containers and Kubernetes are important tools for building, deploying, running and managing modern enterprise applications at scale and delivering enterprise software faster and more reliably to the end user — while using resources more efficiently and reducing costs. Read more…

By Daniel Gruber, Burak Yenier and Wolfgang Gentzsch, UberCloud

Dell Ramps Up HPC Testing of AMD Rome Processors

October 21, 2019

Dell Technologies is wading deeper into the AMD-based systems market with a growing evaluation program for the latest Epyc (Rome) microprocessors from AMD. In a Read more…

By John Russell

Rise of NIH’s Biowulf Mirrors the Rise of Computational Biology

July 29, 2019

The story of NIH’s supercomputer Biowulf is fascinating, important, and in many ways representative of the transformation of life sciences and biomedical res Read more…

By John Russell

Xilinx vs. Intel: FPGA Market Leaders Launch Server Accelerator Cards

August 6, 2019

The two FPGA market leaders, Intel and Xilinx, both announced new accelerator cards this week designed to handle specialized, compute-intensive workloads and un Read more…

By Doug Black

When Dense Matrix Representations Beat Sparse

September 9, 2019

In our world filled with unintended consequences, it turns out that saving memory space to help deal with GPU limitations, knowing it introduces performance pen Read more…

By James Reinders

With the Help of HPC, Astronomers Prepare to Deflect a Real Asteroid

September 26, 2019

For years, NASA has been running simulations of asteroid impacts to understand the risks (and likelihoods) of asteroids colliding with Earth. Now, NASA and the European Space Agency (ESA) are preparing for the next, crucial step in planetary defense against asteroid impacts: physically deflecting a real asteroid. Read more…

By Oliver Peckham

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This