This blog post was contributed by Guilherme Coppini, Bioinformatician and Javier Quilez, Associate Director – Bioinformatics at G42 Healthcare; and Chris Seymour, Vice President of Advanced Platform Development at Oxford Nanopore; and Doruk Ozturk, Senior Solutions Architect, Container Technologies, and Michael Mueller, Senior Solutions Architect, Genomics at AWS and Stefan Dittforth, Senior Solutions Architect, Healthcare at AWS.
Oxford Nanopore sequencers enables direct, real-time analysis of long DNA or RNA fragments. They work by monitoring changes to an electrical current as nucleic acids are passed through a protein nanopore. The resulting signal is decoded to provide the specific DNA or RNA sequence by virtue of compute-intensive algorithms called basecallers. This blog post presents the benchmarking results for two of those Oxford Nanopore basecallers — Guppy and Dorado — on AWS. This benchmarking project was conducted in collaboration between G42 Healthcare, Oxford Nanopore Technologies and AWS.
We ran Guppy and Dorado on 20 different Amazon Elastic Compute Cloud (Amazon EC2) instance types with GPU accelerators. The top performance was achieved on a p4d.24xlarge instance type which delivered 490 million samples/second with Dorado, and 250 million samples/second with Guppy. A sample is one measurement of the current flowing through the nanopore. Typically, the current signal is sampled at 10 times the speed at which the bases passing through the nanopore. For example, at a rate of 400 bases per second (bps) passing through the nanopore, the sampling rate is 4,000 samples per second. The Dorado basecaller outperformed Guppy by a factor of 3.8 x when performing methylation calling with the 5-hydroxymethylcytosine group (5hmCG). Our cost evaluations revealed that the g5.xlarge instance delivers the lowest cost for basecalling a whole human genome (WHG) with the Guppy tool.