The importance of the human microbiome – all of the bacteria inside us – to maintaining health is well established and widely being explored. In fact, the human gut microbiome DNA contains about 100 times as many genes as its human host DNA. These genes not only work for the bacteria but also carry out important functions for the host, such as modulating immune development, amino acid biosynthesis, and energy harvest from food.
Recently, researchers from University of California, San Diego (UCSD) and the J.C. Venter Institute (JCVI) used machine learning to teach a computer to learn to distinguish between healthy and unhealthy gut microbiomes. The new approach shows promise for use in quickly deciphering microbiome genomes, predicting related-health issues, and providing guidance for therapy development. It turns out that accomplishing this task is a big data problem cum HPC project.
A paper on the work – Using Machine Learning to Identify Major Shifts in Human Gut Microbiome Protein Family Abundance in Disease – was presented at the IEEE International Conference on Big Data last month and an article describing the effort was posted on the UCSD web last week. Noteworthy, the software for the study (developed by Weizhong Li, associate professor at JCVI) was run on the data-intensive Gordon supercomputer at the San Diego Supercomputer Center (SDSC) and used 180,000 core-hours – that’s roughly equivalent to running a PC 24 hours a day for about 20 years.
Data from 30 healthy people (using sequencing data from the National Institutes of Health’s Human Microbiome Program) were combined with data from 30 samples from people suffering from the autoimmune Inflammatory Bowel Disease (IBD), including those with ulcerative colitis and with ileal or colonic Crohn’s disease. The mix of roughly 600 billion DNA bases was then fed into the Gordon supercomputer to reconstruct the relative abundance of these species; for instance, how many E. coli are present compared to other bacterial species. Ultimately, the technique demonstrated high accuracy for these data sets.
Here’s an excerpt from the paper’s abstract: “We use machine learning to analyze results obtained previously from computing relative abundance of ⇠10,000 KEGG orthologous protein families in the gut microbiome of a set of healthy individuals and IBD patients. We develop a machine learning pipeline, involving the Kolomogorv-Smirnov test, to identify the 100 most statistically significant entries in the KEGG database. Then we use these 100 as a training set for a Random Forest classifier to determine ⇠5% the KEGGs which are best at separating disease and healthy states. Lastly, we developed a Natural Language Processing classifier of the KEGG description files to predict KEGG relative over- or under-abundance. As we expand our analysis from 10,000 KEGG protein families to one million proteins identified in the gut microbiome, scalable methods for quickly identifying such anomalies between health and disease states will be increasingly valuable for biological interpretation of sequence data.”
In their discussion section, authors note that by looking at the function of specific disease-associated microbial communities, it should be possible to better identify targets for future intervention (i.e. small molecule development to target a specific gene pathway). Using machine learning methods greatly reduces the time required to investigate the immense amounts of data generated from metagenomic sequencing.
Link to the paper: http://lsmarr.calit2.net/repository/IEEE_BigData_KEGGs_CAMERA_READY.pdf
Link to the UCSD article: http://www.sdsc.edu/News%20Items/PR20170118_microbiome.html