“Sailfish” is a new computational method out of Carnegie-Mellon University and the University of Maryland that speeds up RNA sequencing analysis by a factor of 20 or greater.
The method – dubbed Sailfish after the super-speedy fish – provides quantification estimates of gene expression much faster than previous methods such that a job that once took hours can now be completed in a few minutes without loss of accuracy. Details of the research have been published online in the journal Nature Biotechnology.
Gene expression is the process by which genes (stretches of DNA that encode information) interact to produce different traits, such as blue eyes or a predisposition toward cancer. Gene expression occurs in all known life – it’s how the genetic code stored in DNA is “interpreted.”
Along with major advances in genomics, gene expression analysis has grown in importance both for basic researchers and medical practitioners. There now exists large stores of RNA-seq data that scientists are using to re-analyze experiments, however the analysis is notoriously time-intensive with an average run taking about 15 hours.
Fifteen hours might not seem like a lot, but when you multiply that by 100 experiments, it adds up, says paper co-author Carl Kingsford, an associate professor in CMU’s Lane Center for Computational Biology, adding “with Sailfish, we can give researchers everything they got from previous methods, but faster.”
An organism’s genetic makeup is static, but the activity of individual genes varies greatly over time, explains the writeup from Carnegie Mellon. Gene expression is the key – it’s a research area that holds tremendous promise for disease prevention. Although gene activity can’t be measured directly, it can be inferred by tracking RNA, large molecules that perform vital roles in the coding, decoding, regulation, and expression of genes.
To observe RNA, scientists typically use a method called RNA-seq, which has been useful in the field of genomic medicine in the analysis of certain cancers. The process results in short segments of RNA, called “reads.” In previous methods, reconstructing RNA molecules in order to measure them employed a process called mapping where reads were mapped back to their original positions in the larger molecules like pieces in a puzzle. The research team was able to eliminate this time-consuming step by allocating parts of the reads to different types of RNA molecules. Essentially each read provides several up-votes for a given molecule. By leaving out the mapping step, Sailfish is able to perform its RNA analysis 20-30 times faster than previous methods.
The numerical approach will be more familiar to computer scientists than biologists, Kingsford notes, but Sailfish is more robust and better able to tolerate errors. Errors that would disrupt a mapping are not a problem for the “+1” approach. The result is increased accuracy.
“By facilitating frequent reanalysis of data and reducing the need to optimize parameters, Sailfish exemplifies the potential of lightweight algorithms for efficiently processing sequencing reads,” the authors write in the paper abstract.
The Sailfish code is available for download at http://www.cs.cmu.edu/~ckingsf/software/sailfish/.