Around the world, innumerable supercomputers are sifting through billions of molecules in a desperate search for a viable therapeutic to treat COVID-19. Those molecules are pulled from enormous databases of known compounds, ranging from preexisting drugs to plants and other natural substances. But now, researchers at the University of Washington are using supercomputing power to revisit a decades-old concept that would allow researchers to design a completely new drug from the ground up.
This approach – called de novo protein design – works by linking amino acids together to create specific proteins. Thus far, de novo design has only been used for a few drugs that are still undergoing trial. In large part, de novo design has been stymied by the extreme difficulty in predicting how the amino acids in a protein would fold, making prediction of the full three-dimensional shape of the protein and other drug-critical factors exceedingly troublesome.
At the University of Washington’s Institute for Protein Design, David Baker – a professor of biochemistry and head of the institute – applied supercomputing to tackle this roadblock. Baker and his colleagues developed methods for the prediction of proteins’ folded forms and for the rapid design of targeted protein binders. The researchers use computer simulations to generate a library of candidates, after which the most promising candidates are tested in-depth in further simulations and wet labs.
For the last six months, the Baker Lab has been using this approach to zero in on COVID-19, predicting the folded shapes of millions of proteins and then matching them with various parts of the SARS-CoV-2 virus. This massive undertaking requires correspondingly massive computing – and for that, the researchers turned to Stampede2 at the Texas Advanced Computing Center (TACC). Stampede2 is a Dell EMC system with Intel Xeon Phi CPUs rated at 10.7 Linpack petaflops, which placed it 21st on the most recent Top500 list of the world’s most powerful supercomputers.
For their COVID-19 efforts, the team started by testing 20,000 “scaffold” proteins – starting points for drug design – each of which possesses more than a thousand possible orientations, with each orientation tested around a thousand times: in total, 20 billion interactions to test. The best million candidates from these went on to the second stage, sequence design, where the “scaffold” is covered in amino acids, with 20 possibilities at each position.
In the third stage, the best hundred thousand protein candidates are forwarded to Agilent, a DNA synthesis firm, which returns physical DNA samples of those proteins that the Baker Lab can test against the real-life virus. Then, the team looks at the results and mutates individual amino acids on the proteins to see if docking performance improves or worsens. Then the proteins undergo a barrage of other tests and modifications, eventually resulting in the 50 promising leads that have been found so far.
“TACC has a lot of computing power and that has been really helpful for us,” said Brian Coventry, a PhD student working on the research, in an interview with TACC’s Aaron Dubrow. “Everything we do is purely parallel. We’re able to rapidly test 20 million different designs and the calculations don’t need to talk to each other.”
“Our goal for the next pandemic will be to have computational methods in place that, coupled with high performance computing centers like TACC, will be able to generate high affinity inhibitors within weeks of determination of the pathogen genome sequence,” Baker said. “To get to this stage will require continued research and development, and centers like TACC will play a critical role in this effort as they do in scientific research generally.”
Header image: antiviral protein binders (blue) targeting the spike proteins of the coronavirus. Image courtesy of Ian Haydon, Institute for Protein Design.
To read the reporting on this research from TACC’s Aaron Dubrow, click here.