Pathogens Can’t Hide from Novel HPC Approach

By Tiffany Trader

January 21, 2015

In the realm of pathogenic threats, there are the usual suspects – anthrax, botulism, tuberculosis – but in reality there are a whole host of bacterial and viral pathogenic microbes that can be problematic to human and animal health. Protecting the public against these threats falls under the domain of biosecurity. While staying ahead of mother nature is a daunting task, scientists are beginning to unlock the secrets of the microbial world thanks to powerful sequencing technology and advanced computing tools.

Metagenomic sequencing is an offshoot of traditional genomic sequencing that is emerging as a key enabler for biosecurity. This application area involves detecting and characterizing potentially dangerous pathogens and assessing the threat potential of the organisms for human health. In order for this research tool to be deployed more widely, however, there are serious data challenges that need to be addressed.

Scientists at Lawrence Livermore National Laboratory (LLNL) are on the cusp of a breakthrough that would bring this problem down to size to facilitate use cases at a range of scales. Led by bioinformatics scientist Jonathan Allen, the team developed a novel approach to metagenomic sequencing using flash drives as a supplemental memory source to more efficiently search very large datasets.

Dr. Allen explains that while conventional sequencing targets a known biological isolate, metagenomic sequencing is applied when the sample contains organisms or DNA of unknown origin. In this scenario, researchers take a biological mass and do a DNA extraction step to determine what DNA fragments can be recovered. “It’s a different tool in the toolbox,” says Allen. “From the pathogen or biosecurity perspective, it’s a last resort for when you don’t know what you might be dealing with.”

Because of the element of the unknown, metagenomic sequencing is orders more challenging than conventional sequencing, says Allen. One reason is just the sheer abundance of microbial life. “Each sample has potentially hundreds or more organisms. In human clinical samples, some of the DNA may come from the host, a lot of it may be benign organisms. So sorting through all of that to understand the key functionally relevant portions of the sample is a major challenge,” he adds.

It’s one of the emerging data-intensive problems in life sciences. A single sequencing run can generate potentially billions of genetic fragments. Then each of these unlabeled fragments needs to be compared with every reference genome independently. The team’s objective was to take this large dataset and provide a fast and efficient and scalable way to perform an accurate assessment of what organisms and genes are present.

By creating a searchable index of everything that’s been previously sequenced, and assigning some organizational information to a given genomic sequence, researchers can provide this hierarchical organization to illustrate that certain fragments are conserved at the species level, some are conserved at the family level and others may be unique to a particular isolate. Then they search against all of that information to make an assessment about where a given fragment fits in that pantheon of previously-seen data.

“That’s what makes this the classic data-intensive computing challenge,” he said. “We have these large query sets but then we also have this similarly-growing reference database as more and more isolates are being sequenced, we’ve got many different hundreds of potential isolates of a similar species that are being sequenced on a steady on-going basis, so we want to be able to take advantage of all of that new genetic diversity that’s being captured, so we can provide a more accurate assessment.”

The efforts of Allen and his Livermore colleagues computer scientists Maya Gokhale and Sasha Ames and bioinformaticist Shea Gardner led to the development of the Livermore Metagenomic Analysis Toolkit (LMAT), a custom reference database with a fast searchable index that addresses the scaling limitations of existing metagenomic classification methods.

When the team gets a new query sequence, they break it into its constitute elements, referred to as k-mers. A look-up key is then assigned for the purpose of tracking where this short fragment has been seen before. One of the primary challenges is the size of the look-up table quickly becomes quite large. The original query database was 620 gigabytes of DRAM, which limited its use to researchers who had access to large memory machines.

Although LLNL researchers have access to terabyte-plus DRAM machines, wider accessibility was being hampered. The team looked at how they could reduce the size of the database and how they could fit it to architectures that would be more scalable and more affordable. The essential innovation of the project was the development of a data structure that was optimized to store the searchable index on flash drives as if it were in memory. So that when they are doing the lookups, they’ve memory mapped the database from the flash drive into a single address space and just treat it as if it were in memory.

By tuning the software to exploit a combination of DRAM and NVRAM, and also reducing the full database size down to 458 gigabytes, the team was moving toward greater accessibility, but the database size was still outside the purview of a very low-cost machine. Dr. Allen explains there are two workarounds for this. One is to build much smaller databases called marker databases, which only retain only the most essential information for identifying which fragments are present. This approach gets the database down to 17 gigabytes, but there is a tradeoff in that it is no longer possible to tag every read in order to separate all the knowns from the unknowns.

The harder problem requires the much larger, complete database. That’s where the Catalyst machine comes in. The Cray CS300 cluster supercomputer was designed with expanded DRAM and fast, persistent NVRAM to be well suited to big data problems.

“The Catalyst cluster with 128 gigabytes of DRAM [per node] has been outstanding in terms of performance,” Allen reports. “We can put an entire database on flash drive, treat that as memory, and just cache what’s used in practice on DRAM and it works very well.”

Accommodating both small- and large-scale deployments creates new avenues for metagenomic sequencing. Appropriately downscaled to one node of the Catalyst cluster, the software can be deployed on substantially lower cost machines, making it possible for LMAT to be used for post-sequencing analysis in tandem with the sequencer.

“We can still have these very large searchable indexes that are stored on single compute nodes with the vision of lower-cost computers that can be widely distributed,” says Dr. Allen. “It doesn’t necessarily have to be a huge compute cluster in the cloud; it could potentially sit out in the field closer to where the sequencing is taking place, but you’d still have this efficient analysis tool that could be available on a relatively lower cost computing platform.”

At the other end, the software can also be scaled up for very large scale analysis. In an upcoming paper, the team reports on how they used the full-scale approach to analyze the entire collection of Human Microbiome Project (HMP) data in about 26 hours. They did this by taking the searchable index and replicating it across the Catalyst cluster with a copy on every flash drive.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, code-named Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from its predecessors, including the red-hot H100 and A100 GPUs. Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. While Nvidia may not spring to mind when thinking of the quant Read more…

2024 Winter Classic: Meet the HPE Mentors

March 18, 2024

The latest installment of the 2024 Winter Classic Studio Update Show features our interview with the HPE mentor team who introduced our student teams to the joys (and potential sorrows) of the HPL (LINPACK) and accompany Read more…

Houston We Have a Solution: Addressing the HPC and Tech Talent Gap

March 15, 2024

Generations of Houstonian teachers, counselors, and parents have either worked in the aerospace industry or know people who do - the prospect of entering the field was normalized for boys in 1969 when the Apollo 11 missi Read more…

Apple Buys DarwinAI Deepening its AI Push According to Report

March 14, 2024

Apple has purchased Canadian AI startup DarwinAI according to a Bloomberg report today. Apparently the deal was done early this year but still hasn’t been publicly announced according to the report. Apple is preparing Read more…

Survey of Rapid Training Methods for Neural Networks

March 14, 2024

Artificial neural networks are computing systems with interconnected layers that process and learn from data. During training, neural networks utilize optimization algorithms to iteratively refine their parameters until Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, code-named Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. Wh Read more…

Houston We Have a Solution: Addressing the HPC and Tech Talent Gap

March 15, 2024

Generations of Houstonian teachers, counselors, and parents have either worked in the aerospace industry or know people who do - the prospect of entering the fi Read more…

Survey of Rapid Training Methods for Neural Networks

March 14, 2024

Artificial neural networks are computing systems with interconnected layers that process and learn from data. During training, neural networks utilize optimizat Read more…

PASQAL Issues Roadmap to 10,000 Qubits in 2026 and Fault Tolerance in 2028

March 13, 2024

Paris-based PASQAL, a developer of neutral atom-based quantum computers, yesterday issued a roadmap for delivering systems with 10,000 physical qubits in 2026 a Read more…

India Is an AI Powerhouse Waiting to Happen, but Challenges Await

March 12, 2024

The Indian government is pushing full speed ahead to make the country an attractive technology base, especially in the hot fields of AI and semiconductors, but Read more…

Charles Tahan Exits National Quantum Coordination Office

March 12, 2024

(March 1, 2024) My first official day at the White House Office of Science and Technology Policy (OSTP) was June 15, 2020, during the depths of the COVID-19 loc Read more…

AI Bias In the Spotlight On International Women’s Day

March 11, 2024

What impact does AI bias have on women and girls? What can people do to increase female participation in the AI field? These are some of the questions the tech Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Analyst Panel Says Take the Quantum Computing Plunge Now…

November 27, 2023

Should you start exploring quantum computing? Yes, said a panel of analysts convened at Tabor Communications HPC and AI on Wall Street conference earlier this y Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Leading Solution Providers

Contributors

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Training of 1-Trillion Parameter Scientific AI Begins

November 13, 2023

A US national lab has started training a massive AI brain that could ultimately become the must-have computing resource for scientific researchers. Argonne N Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

Google Introduces ‘Hypercomputer’ to Its AI Infrastructure

December 11, 2023

Google ran out of monikers to describe its new AI system released on December 7. Supercomputer perhaps wasn't an apt description, so it settled on Hypercomputer Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire