Sorting out computational biology’s future is tricky. It likely won’t be singular. First-principle, mechanistic simulation has so far proven challenging but could eventually become game changing. Meanwhile pattern recognition and matching in massive ‘omics’ datasets have been extremely productive and likely to remain dominant at present. Now, an MIT professor and colleagues write that two characteristics of biology datasets – low metric entropy and low fractal dimension – suggest compressive algorithms hold promise for the future.
Bonnie Berger, professor in Computer Science and Artificial Intelligence Laboratory (CSAIL), Department of Mathematics, and Electrical Engineering and Computer Science (EECS), MIT, has co-written an interesting look-ahead article in the August Communications of the ACM, Computational Biology in the 21st Century: Scaling with Compressive Algorithms. Berger and coauthors Noah Daniels and Y. William Yu cover a lot of ground first setting the stage with an overview biological datasets and algorithms commonly used to mine the data and then tackle why and how compressive algorithms are likely to become important.
“On the one hand, the scale and scope of data should allow new insights into genetic and infectious diseases, cancer, basic biology, and even human migration patterns. On the other hand, researchers are generating datasets so massive that it has become difficult to analyze them to discover patterns that give clues to the underlying biological processes,” write the MIT researchers.
We are entering the age of compressive algorithms, contend the authors, which makes use of this completely different paradigm for the structure of biological data. As an example of compressive genomics, they write the new approach “provides orders-of-magnitude runtime improvements to BLAST nucleotide and protein1 search; these runtime improvements increase as databases grow.”
In their conclusion, the authors note:
“The approach of compressive acceleration, and its demonstrated ability to scale with the metric entropy of the data, while providing orthogonal benefits to many other useful indexing techniques, is an important tool for coping with the deluge of data. The extension of this compressive acceleration approach to metagenomics, NGS read mapping, and chemogenomics suggests its flexibility. Likewise, compressive storage for these applications can be shown to scale with the information-theoretic entropy of the dataset.”
Here a link to the ACM article: http://cacm.acm.org/magazines/2016/8/205052-computational-biology-in-the-21st-century/fulltext