Pathogens Can’t Hide from Novel HPC Approach

By Tiffany Trader

January 21, 2015

In the realm of pathogenic threats, there are the usual suspects – anthrax, botulism, tuberculosis – but in reality there are a whole host of bacterial and viral pathogenic microbes that can be problematic to human and animal health. Protecting the public against these threats falls under the domain of biosecurity. While staying ahead of mother nature is a daunting task, scientists are beginning to unlock the secrets of the microbial world thanks to powerful sequencing technology and advanced computing tools.

Metagenomic sequencing is an offshoot of traditional genomic sequencing that is emerging as a key enabler for biosecurity. This application area involves detecting and characterizing potentially dangerous pathogens and assessing the threat potential of the organisms for human health. In order for this research tool to be deployed more widely, however, there are serious data challenges that need to be addressed.

Scientists at Lawrence Livermore National Laboratory (LLNL) are on the cusp of a breakthrough that would bring this problem down to size to facilitate use cases at a range of scales. Led by bioinformatics scientist Jonathan Allen, the team developed a novel approach to metagenomic sequencing using flash drives as a supplemental memory source to more efficiently search very large datasets.

Dr. Allen explains that while conventional sequencing targets a known biological isolate, metagenomic sequencing is applied when the sample contains organisms or DNA of unknown origin. In this scenario, researchers take a biological mass and do a DNA extraction step to determine what DNA fragments can be recovered. “It’s a different tool in the toolbox,” says Allen. “From the pathogen or biosecurity perspective, it’s a last resort for when you don’t know what you might be dealing with.”

Because of the element of the unknown, metagenomic sequencing is orders more challenging than conventional sequencing, says Allen. One reason is just the sheer abundance of microbial life. “Each sample has potentially hundreds or more organisms. In human clinical samples, some of the DNA may come from the host, a lot of it may be benign organisms. So sorting through all of that to understand the key functionally relevant portions of the sample is a major challenge,” he adds.

It’s one of the emerging data-intensive problems in life sciences. A single sequencing run can generate potentially billions of genetic fragments. Then each of these unlabeled fragments needs to be compared with every reference genome independently. The team’s objective was to take this large dataset and provide a fast and efficient and scalable way to perform an accurate assessment of what organisms and genes are present.

By creating a searchable index of everything that’s been previously sequenced, and assigning some organizational information to a given genomic sequence, researchers can provide this hierarchical organization to illustrate that certain fragments are conserved at the species level, some are conserved at the family level and others may be unique to a particular isolate. Then they search against all of that information to make an assessment about where a given fragment fits in that pantheon of previously-seen data.

“That’s what makes this the classic data-intensive computing challenge,” he said. “We have these large query sets but then we also have this similarly-growing reference database as more and more isolates are being sequenced, we’ve got many different hundreds of potential isolates of a similar species that are being sequenced on a steady on-going basis, so we want to be able to take advantage of all of that new genetic diversity that’s being captured, so we can provide a more accurate assessment.”

The efforts of Allen and his Livermore colleagues computer scientists Maya Gokhale and Sasha Ames and bioinformaticist Shea Gardner led to the development of the Livermore Metagenomic Analysis Toolkit (LMAT), a custom reference database with a fast searchable index that addresses the scaling limitations of existing metagenomic classification methods.

When the team gets a new query sequence, they break it into its constitute elements, referred to as k-mers. A look-up key is then assigned for the purpose of tracking where this short fragment has been seen before. One of the primary challenges is the size of the look-up table quickly becomes quite large. The original query database was 620 gigabytes of DRAM, which limited its use to researchers who had access to large memory machines.

Although LLNL researchers have access to terabyte-plus DRAM machines, wider accessibility was being hampered. The team looked at how they could reduce the size of the database and how they could fit it to architectures that would be more scalable and more affordable. The essential innovation of the project was the development of a data structure that was optimized to store the searchable index on flash drives as if it were in memory. So that when they are doing the lookups, they’ve memory mapped the database from the flash drive into a single address space and just treat it as if it were in memory.

By tuning the software to exploit a combination of DRAM and NVRAM, and also reducing the full database size down to 458 gigabytes, the team was moving toward greater accessibility, but the database size was still outside the purview of a very low-cost machine. Dr. Allen explains there are two workarounds for this. One is to build much smaller databases called marker databases, which only retain only the most essential information for identifying which fragments are present. This approach gets the database down to 17 gigabytes, but there is a tradeoff in that it is no longer possible to tag every read in order to separate all the knowns from the unknowns.

The harder problem requires the much larger, complete database. That’s where the Catalyst machine comes in. The Cray CS300 cluster supercomputer was designed with expanded DRAM and fast, persistent NVRAM to be well suited to big data problems.

“The Catalyst cluster with 128 gigabytes of DRAM [per node] has been outstanding in terms of performance,” Allen reports. “We can put an entire database on flash drive, treat that as memory, and just cache what’s used in practice on DRAM and it works very well.”

Accommodating both small- and large-scale deployments creates new avenues for metagenomic sequencing. Appropriately downscaled to one node of the Catalyst cluster, the software can be deployed on substantially lower cost machines, making it possible for LMAT to be used for post-sequencing analysis in tandem with the sequencer.

“We can still have these very large searchable indexes that are stored on single compute nodes with the vision of lower-cost computers that can be widely distributed,” says Dr. Allen. “It doesn’t necessarily have to be a huge compute cluster in the cloud; it could potentially sit out in the field closer to where the sequencing is taking place, but you’d still have this efficient analysis tool that could be available on a relatively lower cost computing platform.”

At the other end, the software can also be scaled up for very large scale analysis. In an upcoming paper, the team reports on how they used the full-scale approach to analyze the entire collection of Human Microbiome Project (HMP) data in about 26 hours. They did this by taking the searchable index and replicating it across the Catalyst cluster with a copy on every flash drive.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

2024 Winter Classic: Meet Team Morehouse

April 17, 2024

Morehouse College? The university is well-known for their long list of illustrious graduates, the rigor of their academics, and the quality of the instruction. They were one of the first schools to sign up for the Winter Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pressing needs and hurdles to widespread AI adoption. The sudde Read more…

Quantinuum Reports 99.9% 2-Qubit Gate Fidelity, Caps Eventful 2 Months

April 16, 2024

March and April have been good months for Quantinuum, which today released a blog announcing the ion trap quantum computer specialist has achieved a 99.9% (three nines) two-qubit gate fidelity on its H1 system. The lates Read more…

Mystery Solved: Intel’s Former HPC Chief Now Running Software Engineering Group 

April 15, 2024

Last year, Jeff McVeigh, Intel's readily available leader of the high-performance computing group, suddenly went silent, with no interviews granted or appearances at press conferences.  It led to questions -- what's Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Institute for Human-Centered AI (HAI) put out a yearly report to t Read more…

Crossing the Quantum Threshold: The Path to 10,000 Qubits

April 15, 2024

Editor’s Note: Why do qubit count and quality matter? What’s the difference between physical qubits and logical qubits? Quantum computer vendors toss these terms and numbers around as indicators of the strengths of t Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pre Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Instit Read more…

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

April 11, 2024

The chip market is facing a crisis: chip development is now concentrated in the hands of the few. A confluence of events this week reminded us how few chips Read more…

The VC View: Quantonation’s Deep Dive into Funding Quantum Start-ups

April 11, 2024

Yesterday Quantonation — which promotes itself as a one-of-a-kind venture capital (VC) company specializing in quantum science and deep physics  — announce Read more…

Nvidia’s GTC Is the New Intel IDF

April 9, 2024

After many years, Nvidia's GPU Technology Conference (GTC) was back in person and has become the conference for those who care about semiconductors and AI. I Read more…

Google Announces Homegrown ARM-based CPUs 

April 9, 2024

Google sprang a surprise at the ongoing Google Next Cloud conference by introducing its own ARM-based CPU called Axion, which will be offered to customers in it Read more…

Computational Chemistry Needs To Be Sustainable, Too

April 8, 2024

A diverse group of computational chemists is encouraging the research community to embrace a sustainable software ecosystem. That's the message behind a recent Read more…

Hyperion Research: Eleven HPC Predictions for 2024

April 4, 2024

HPCwire is happy to announce a new series with Hyperion Research  - a fact-based market research firm focusing on the HPC market. In addition to providing mark Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

Leading Solution Providers

Contributors

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

Intel’s Xeon General Manager Talks about Server Chips 

January 2, 2024

Intel is talking data-center growth and is done digging graves for its dead enterprise products, including GPUs, storage, and networking products, which fell to Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire