The importance of the human ‘microbiome’ – the large and diverse population of microbes inside each of us – has become increasingly apparent in recent years. From cancer to diabetes to mental health, the microbiome exerts powerful influences and we are just now learning how to more effectively manage the microbiome to improve health.
It turns out exascale compute capacity may prove decisive in the effort to understand the microbiome and the Exascale Computing Project recently posted a podcast with Lenny Oliker and Kathey Yelick, both of Lawrence Berkeley National Laboratory, outlining the computational challenges and approaches for performing metagenomics analysis.
“It’s estimated that the human body has at least as many bacteria in its microbiome as human cells. This is a pretty significant community living inside each of us. To understand the behavior and application of this rich genomic community, we first have to learn to analyze what’s called the metagenome,” says Oliker (executive director), who along with Yelick (principal investigator), run ECP’s project ExaBiome: Exascale Solutions for Microbiome Analysis.
With respect to computation, genomic analysis is a departure from the traditional approach to simulation problems. “The initial high-level structure of the relationship between the different sequences or of the genomics is unknown,” Oliker said. “So that makes it much more difficult to parallelize, and it requires data structures that are much harder to handle at large scale: hash tables, histograms, graphs, and very sparse unstructured matrices and structures. We also have to worry about dynamic load balancing. We have little locality and unpredictable communication, and the connections between the processors are arbitrary, so there’s irregularity in both space and time. Putting all of those things together creates a very complex computational problem, especially as we scale up toward the exascale regime.”
The ExaBiome project aims to provide scalable tools for three core computational problems in metagenomics.
“The first is genome assembly, which is a problem of turning raw sequencing data into genomes,” Yelick said. “So in this case, we’re examining sequencing data that comes from, say, a scoop of soil or from the human microbiome—where all the microbes are mixed together and we’re trying to then turn those into complete genomes for each species or something that at least has much longer strands so that we can find out what genes they have, what proteins they code for, and so on. The second problem is what we call protein clustering. That’s exploring the relationships between the different proteins that come from those genes. And then the third problem is a comparative metagenome analysis where you have maybe two different samples of soil from different points in time or from nearby locations and you’re trying to understand the similarities or how they may change over time.”
The three core computational problems lead to very fine-grained communication and irregular patterns of communication. “For that reason, we deploy one-sided communication and partition global address space languages, at least in the assembly problem,” Yelick said. “And we work closely with other parts of ECP on the software support for this communication in the algorithms.”
Link to ECP article and podcast: https://exascaleproject.org/providing-exascale-solutions-for-the-assembly-and-analysis-of-metagenomic-data/