There are hundreds of millions of sequenced proteins and counting—but only 170,000 have had their structures solved by researchers, bottlenecking our understanding of proteins and their functions across organisms’ genomes. Now, researchers led by Oak Ridge National Laboratory (ORNL) and the Georgia Institute of Technology have applied supercomputer-powered deep learning to quickly predict the structures and functions of tens of thousands more proteins.
“We’re now dealing with the amount of data that astrophysicists deal with, all because of the genome sequencing revolution,” said Ada Sedova, one of the researchers on the project, in an interview with ORNL. “We want to be able to use high-performance computing to take that sequencing data and come up with useful inferences to narrow the field for experiments. We want to quickly answer questions such as ‘what does this protein do, and how does it affect the cell? How can we harness proteins to achieve goals such as making needed chemicals, medicines and sustainable fuels, or to engineer organisms that can help mitigate the effects of climate change?’”
The novel pipeline used in the research applies deep learning tools like SAdLSA (short for “sequence alignments from deep-learning of structural alignments”), which predicts protein structures using structures even vaguely ( ≥10 percent) similar, or AlphaFold 2, a DeepMind tool for identifying and modeling predicted structures.
“SAdLSA can detect distantly related proteins that may or may not have the same function,” said Jerry Parks, a computational chemist with ORNL and lead for the research group. “Combine that with AlphaFold, which provides a 3D structural model of the protein, and you can analyze the active site to determine which amino acids are doing the chemistry and how they contribute to the function.”
These intensive tools were deployed on Summit, which (for now) remains the most powerful publicly ranked supercomputer in the United States at 148.6 Linpack petaflops. “This is a technology that is difficult for many research groups to just spin up,” Sedova said. “We hope to make it more accessible now that we’ve formatted it for Summit.”
The research team focused on full protein sets (proteomes) for four microbes, totaling some 20,000 proteins, along with the 24,000 proteins found in sphagnum moss. All of the subjects were chosen for their functions: the microbes for their ability to help manufacture plastics or break down metals, the moss for its ability to store large amounts of carbon in bogs.
“With these kinds of tools in our tool belt that are both structure-based and deep learning-based, this resource can help give us information about these proteins of unknown function — sequences that have no matches to other sequences in the entire repository of known proteins,” Sedova said. “This unlocks a lot of new knowledge and potential to address national priorities through bioengineering. For instance, there are potentially many enzymes with useful functions that have not yet been discovered.”
To learn more, read the reporting from ORNL here.
Header image: a protein that helps to control sulfide use in methane-producing microorganisms. Image courtesy of Ada Sedova/ORNL.