Computational biology—particularly via combined HPC and AI—has taken the spotlight during the pandemic as pharmaceutical companies and research institutes raced to conduct in silico research to understand the virus and develop effective drugs against. Now, GSK is opening a window into its computational biology research, describing in a new blog post how it worked with AI supercomputing firm Cerebras Systems to power unprecedented epigenomic models.
Cerebras made headlines in 2019 when it launched the largest chip ever built: the AI-focused, 1.2-trillion transistor Wafer-Scale Engine (WSE). This chip later powered Cerebras’ CS-1 system, which was itself first deployed at Argonne National Laboratory for Covid research (and which Cerebras claimed, at the time, was the fastest AI supercomputer in the world). Since then, Cerebras has launched both WSE-2 and the WSE-2-based CS-2 system (which was also deployed at Argonne for Covid research).
GSK was another customer of the CS-1 system, as detailed by the blog post from GSK’s Kim Branson (senior vice president and global head of AI/ML), Meredith Trotter and Stephen Young (both AI/ML researchers), as well as Natalia Vassilieva, director of product for ML at Cerebras.
The work discussed revolves around epigenomics, the function that controls the expression (or lack of expression) of specific genes in the human body. “We need to understand the epigenome to help us understand the genetic data we have in databases,” the authors wrote. “These biobanks give us clues about which genes may be involved in a disease, and the epigenetics help us understand which cell types (i.e. skin, eyes, liver) a gene may be expressed in. This information along with other data helps us work out what our medicine should do, which genes it should target to hopefully treat a disease.”
The problem, of course: the human epigenome is immense, requiring extraordinary computational resources to model or study at a high level with conventional techniques. “Fortunately, AI gives us a shortcut,” the authors said. “We have enough real-world examples of the effects of epigenomics to teach a computer to do the same thing, creating a model that can then be used to predict many important biological processes.”
So the team repurposed the BERT neural network models to create the cutely named EBERT, short for “epigenomic BERT.” Using the same underlying mechanisms as other neural network applications like translation, EBERT predicts biological structures. But even with these optimizations, running EBERT would be prohibitively difficult on traditional computing infrastructure for such large data—for which GSK turned to Cerebras’ hardware.
GSK had purchased a CS-1 system for internal use back in Q4 2020. Using the CS-1, the EBERT training process to 2.5 days versus an estimated 24 days via a 16-node GPU cluster. “The training speedup afforded by the Cerebras system enabled us to explore architecture variations, tokenization schemes and hyperparameter settings in a way that would have been prohibitively time and resource intensive on a typical GPU cluster,” wrote the authors of the research paper.
The researchers say that EBERT, after training, “achieved the highest prediction accuracy on four of the 13 datasets in an industry benchmark called ENCODE-DREAM.” The model ranked third overall on the benchmark’s leaderboard, and the researchers say the results are “very promising.”
Next, the researchers will perform the same work on the much newer CS-2 system, which will be delivered to GSK this quarter. The CS-2 system promises to double throughput relative to the CS-1 and even allow training of a larger version of EBERT (the version used in this research was EBERTBASE, while the larger version is, appropriately, EBERTLARGE).
“AI plays a key role at GSK and we have invested heavily in the intersection of human genetics, functional genomics and AI,” the authors wrote. “AI is what allows us to analyze and understand the data from genetic databases and this means we can take a more predictive approach. Strong evidence has shown that drug targets with genetic validation are twice as likely to succeed.”
Cerebras was featured on HPCwire just a month ago for its work with Argonne National Laboratory to model the replication of SARS-CoV-2. To learn more about that, click here. For more details on Cerebras’ most recent hardware, click here.
About the research
The research discussed in this article is further discussed in a paper, “Epigenomic language models powered by Cerebras,” available to read at this link. The paper was written by Meredith V. Trotter, Cuong Q. Nguyen, Stephen Young, Rob T. Woodruff and Kim M. Branson.