In November 2020, the COVID-19 Genomics UK Consortium (COG-UK) won the HPCwire Readers’ Choice Award for Best HPC Collaboration for its CLIMB-COVID sequencing project. Launched in March 2020, CLIMB-COVID has now resulted in the sequencing of over 675,000 coronavirus genomes – an increasingly critical task as variants like Delta threaten the tenuous prospect of a return to normalcy in much of the world. Now, a new paper from dozens of UK researchers on behalf of the COG-UK Consortium has elaborated on how CLIMB-COVID became one of the most meaningful anti-COVID tools in advanced computing.
The COG-UK Consortium, which was established in March 2020, consists of the UK’s National Health Service (NHS), four public health agencies, the Wellcome Sanger Institute and more than 20 academic partners. From its inception, it was squarely aimed at viral sequencing for UK SARS-CoV-2 samples, which were being collected by the NHS. The Consortium quickly developed CLIMB-COVID, which they outlined in the graphic below.
“A network of sampling sites (e.g. hospitals and testing centers) produce samples and sample metadata which are received by a regional sequencing center,” the authors explained. “The sample is extracted and sequenced and a locally run bioinformatics pipeline generates both a consensus viral genome and an alignment of sequenced read fragments against the SARS-CoV-2 reference genome. The consensus sequence and alignments are uploaded via secure file transfer to be stored on CLIMB-COVID. Metadata is securely transferred over HTTPS to an API that transforms metadata into a model to be stored in a database on CLIMB-COVID. The core quality control pipeline executes every day to integrate newly uploaded samples and metadata into the single canonical dataset of all uploaded sequences.”
“Once this pipeline is finished,” they continued, “it notifies downstream analysis pipelines through a messaging protocol to generate analysis artifacts like phylogenetic trees. Downstream analysis pipelines also automatically deposit genomes in public databases[.]”
In essence, three core functions: produce data through a “distributed, democratized network for sequencing SARS-CoV-2 genomes”; collect data by providing a uniform system for transferring said data; and integrate data into a single dataset through harmonization. CLIMB-COVID also accomplishes most of this in near-real-time, enabling rapid insights for public health workers.
The computing functions of CLIMB-COVID are supported by a variety of HPC resources, including cloud computing from the CLIMB-BIG-DATA project and the BlueBEAR HPC infrastructure at the University of Birmingham. BlueBEAR is equipped with hundreds of Intel-based nodes spanning Broadwell, Haswell and Cascade Lake, as well as two Nvidia-based nodes equipped with P100 GPUs. Together, these cloud and on-premises resources manage the substantial storage and computing needs of the CLIMB-COVID infrastructure.
“The core problem we faced when tasked to build this infrastructure is one of data interoperability,” the authors wrote. “With geographically dispersed sequencing operations and the four public agencies all producing data with a wide variety of different techniques and platforms, it was necessary to deploy an infrastructure to collate this data into a single, consistent, canonical data set, available for everyone within the consortium and for consistent public dissemination.”
CLIMB-COVID was developed as a hub model in the existing Cloud Infrastructure for Microbial Bioinformatics (CLIMB) facility, which had been helping microbiologists analyze genomic data for six years prior. Establishing CLIMB-COVID as a “walled garden” within the CLIMB infrastructure, they ensured that the data would be rapidly accessible and that uploaders retained authority over the data they generated.
The collective genomic data is fed into the Grapevine phylogenetics pipeline, which builds a tree “that captures the evolutionary relationships between the sampled viruses, placing UK sequences in the global context.” Over time, the data load has become so enormous that the researchers have had to restrict the phylogenetics pipeline to just the most recent six months – and more recently, the last hundred days. “To cope with the scale,” they wrote, “a new phylogenetics pipeline is under development,” called Phylopipe. Various other visualizations are produced, including geospatial visualizations that can identify clusters and dominant variants within an area.
The developers also faced challenges in data security, finding that a sample’s identifier within the healthcare system could not be shared and forcing the developers to relabel the samples with a “COG ID” that could exist in more public databases and which, initially, identified the date and county of sample collection (“a necessary but unfortunate compromise”). Later, contractual arrangements were ironed out to incorporate more sample metadata in a manner that could only be accessed by public health agencies.
“CLIMB is probably still the largest dedicated compute infrastructure for microbial genomics in the world,” the researchers concluded. “The shared nature of the platform was critical for immediate sharing and analysis across the four nations in the UK. Within 3 days of booting the first virtual machine, we were receiving uploads of sequence data. Within a week, 260 complete genomes from 7 sequencing centers had been uploaded and processed by our inbound distribution pipeline – already more genomes than any other country in the world other than China at the time. Within 2 months, COG-UK was responsible for half of all the international SARS-CoV-2 sequences deposited into [open-access genomic database] GISAID.”
“Building this kind of decentralized sequencing system has not been possible before now,” said Samuel Nicholls, a researcher in the University of Birmingham’s Institute of Microbiology and Infection and first author on the paper. “By designing that system, we have shown how genetic sequencing can be used as a vital tool in any public health response. … We have never seen such a coordinated, sustained effort to generate real-time genomic surveillance data at this scale and pace, and this is why the UK is world-leading in the genomic sequencing of SARS-CoV-2.”
To learn more, read the paper, which was published in the July 2021 issue of Genome Biology as “CLIMB-COVID: continuous integration supporting decentralized sequencing for SARS-CoV-2 genomic surveillance”.