In the realm of pathogenic threats, there are the usual suspects – anthrax, botulism, tuberculosis – but in reality there are a whole host of bacterial and viral pathogenic microbes that can be problematic to human and animal health. Protecting the public against these threats falls under the domain of biosecurity. While staying ahead of mother nature is a daunting task, scientists are beginning to unlock the secrets of the microbial world thanks to powerful sequencing technology and advanced computing tools.
Metagenomic sequencing is an offshoot of traditional genomic sequencing that is emerging as a key enabler for biosecurity. This application area involves detecting and characterizing potentially dangerous pathogens and assessing the threat potential of the organisms for human health. In order for this research tool to be deployed more widely, however, there are serious data challenges that need to be addressed.
Scientists at Lawrence Livermore National Laboratory (LLNL) are on the cusp of a breakthrough that would bring this problem down to size to facilitate use cases at a range of scales. Led by bioinformatics scientist Jonathan Allen, the team developed a novel approach to metagenomic sequencing using flash drives as a supplemental memory source to more efficiently search very large datasets.
Dr. Allen explains that while conventional sequencing targets a known biological isolate, metagenomic sequencing is applied when the sample contains organisms or DNA of unknown origin. In this scenario, researchers take a biological mass and do a DNA extraction step to determine what DNA fragments can be recovered. “It’s a different tool in the toolbox,” says Allen. “From the pathogen or biosecurity perspective, it’s a last resort for when you don’t know what you might be dealing with.”
Because of the element of the unknown, metagenomic sequencing is orders more challenging than conventional sequencing, says Allen. One reason is just the sheer abundance of microbial life. “Each sample has potentially hundreds or more organisms. In human clinical samples, some of the DNA may come from the host, a lot of it may be benign organisms. So sorting through all of that to understand the key functionally relevant portions of the sample is a major challenge,” he adds.
It’s one of the emerging data-intensive problems in life sciences. A single sequencing run can generate potentially billions of genetic fragments. Then each of these unlabeled fragments needs to be compared with every reference genome independently. The team’s objective was to take this large dataset and provide a fast and efficient and scalable way to perform an accurate assessment of what organisms and genes are present.
By creating a searchable index of everything that’s been previously sequenced, and assigning some organizational information to a given genomic sequence, researchers can provide this hierarchical organization to illustrate that certain fragments are conserved at the species level, some are conserved at the family level and others may be unique to a particular isolate. Then they search against all of that information to make an assessment about where a given fragment fits in that pantheon of previously-seen data.
“That’s what makes this the classic data-intensive computing challenge,” he said. “We have these large query sets but then we also have this similarly-growing reference database as more and more isolates are being sequenced, we’ve got many different hundreds of potential isolates of a similar species that are being sequenced on a steady on-going basis, so we want to be able to take advantage of all of that new genetic diversity that’s being captured, so we can provide a more accurate assessment.”
The efforts of Allen and his Livermore colleagues computer scientists Maya Gokhale and Sasha Ames and bioinformaticist Shea Gardner led to the development of the Livermore Metagenomic Analysis Toolkit (LMAT), a custom reference database with a fast searchable index that addresses the scaling limitations of existing metagenomic classification methods.
When the team gets a new query sequence, they break it into its constitute elements, referred to as k-mers. A look-up key is then assigned for the purpose of tracking where this short fragment has been seen before. One of the primary challenges is the size of the look-up table quickly becomes quite large. The original query database was 620 gigabytes of DRAM, which limited its use to researchers who had access to large memory machines.
Although LLNL researchers have access to terabyte-plus DRAM machines, wider accessibility was being hampered. The team looked at how they could reduce the size of the database and how they could fit it to architectures that would be more scalable and more affordable. The essential innovation of the project was the development of a data structure that was optimized to store the searchable index on flash drives as if it were in memory. So that when they are doing the lookups, they’ve memory mapped the database from the flash drive into a single address space and just treat it as if it were in memory.
By tuning the software to exploit a combination of DRAM and NVRAM, and also reducing the full database size down to 458 gigabytes, the team was moving toward greater accessibility, but the database size was still outside the purview of a very low-cost machine. Dr. Allen explains there are two workarounds for this. One is to build much smaller databases called marker databases, which only retain only the most essential information for identifying which fragments are present. This approach gets the database down to 17 gigabytes, but there is a tradeoff in that it is no longer possible to tag every read in order to separate all the knowns from the unknowns.
The harder problem requires the much larger, complete database. That’s where the Catalyst machine comes in. The Cray CS300 cluster supercomputer was designed with expanded DRAM and fast, persistent NVRAM to be well suited to big data problems.
“The Catalyst cluster with 128 gigabytes of DRAM [per node] has been outstanding in terms of performance,” Allen reports. “We can put an entire database on flash drive, treat that as memory, and just cache what’s used in practice on DRAM and it works very well.”
Accommodating both small- and large-scale deployments creates new avenues for metagenomic sequencing. Appropriately downscaled to one node of the Catalyst cluster, the software can be deployed on substantially lower cost machines, making it possible for LMAT to be used for post-sequencing analysis in tandem with the sequencer.
“We can still have these very large searchable indexes that are stored on single compute nodes with the vision of lower-cost computers that can be widely distributed,” says Dr. Allen. “It doesn’t necessarily have to be a huge compute cluster in the cloud; it could potentially sit out in the field closer to where the sequencing is taking place, but you’d still have this efficient analysis tool that could be available on a relatively lower cost computing platform.”
At the other end, the software can also be scaled up for very large scale analysis. In an upcoming paper, the team reports on how they used the full-scale approach to analyze the entire collection of Human Microbiome Project (HMP) data in about 26 hours. They did this by taking the searchable index and replicating it across the Catalyst cluster with a copy on every flash drive.