It seems, at this moment, that the very ground beneath biology is shaking. A torrent of papers have, in recent months, pushed the limits of artificial intelligence (AI) for protein structure and protein-protein interaction predictions.

In 2020, the DeepMind team traveled from their black steel building, near King’s Cross in London, to the 14th annual CASP (Critical Assessment of Techniques for Protein Structure Prediction) competition. Their model shattered all expectations, achieving a GDT_TS—basically, the fraction of a protein that is correctly predicted when compared to a ‘true’ model—of 92.4. The results were published in Nature the following year.1

Search Mass Spectrometry
Search Now Search our directory to find mass spec tools for your research needs.

Upon seeing the results, many in the field called the protein-folding problem ‘solved.’ Mohammed AlQuraishi, a systems biologist at the University of Columbia, wrote a blog on AlphaFold’s achievement; “It feels like one’s child has left home.”2

CASP14 was just two years ago. Since then, a veritable army of researchers have jumped on the AI train, applying their own ideas to expand the technology’s potential in a whirlwind of progress.

Better structures

CASP14 was a watershed moment in AI protein structure prediction, but the true revolution began several years earlier. Jinbo Xu, a professor at the Toyota Technological Institute at Chicago, used a convolutional neural network—a type of deep learning model—in 2016 to predict the tertiary structure of hundreds of proteins.3 At that time, “the improvement was really significant for single protein structure prediction,” he says. “In CASP13, this deep learning model improved prediction quality from less than 40 points to more than 60 points.”

In early 2020, the first AlphaFold model was published in Nature. It, too, used a neural network to predict structures, achieving an accuracy of 0.7 or higher for 43 protein domains.4 Around that same time, David Baker’s laboratory at the University of Washington released their own model, RoseTTAfold, which could rapidly generate “accurate protein-protein complex models from sequence information alone.”5

Moving from modest results to stunning accuracy—such as that achieved with AlphaFold2—was not easy. Most prediction models, even today, rely upon two broad classes of data: Protein sequences and protein structures. It’s still challenging to predict a natural protein’s structure with high accuracy, says Xu, if that protein does “not have any sequence homologs”—or similar, conserved proteins—“in the database.” In 2021, however, Xu and colleagues showed, in Nature Machine Intelligence, that a large neural network, with only sequence data, could predict the structure of more than half of ‘hard test’ proteins with more than 80 percent precision and without relying upon co-evolution information.6

Today, AlphaFold2 remains the dominant model for structure predictions, but other groups have expanded upon the work. The model is now routinely used to ‘hallucinate,’ or dream up, entirely new proteins with functions not found in nature. For a recent study, researchers developed a large language model—much like DALL-E, the text to image tool—to generate de novo proteins. A software tool called ProteinMPNN could do so in about one second, without any expert training.7 By coupling this tool with AlphaFold, researchers could rapidly generate proteins, simulate their structures, and refine the approach to find proteins with desired properties.

This achievement was, in many ways, the dawn of a new era in AI for biology: Use algorithms not to solve structures for existing proteins, but rather to dream up entirely new possibilities.

Touching proteins

The human genome holds about 20,000 protein-coding genes. It’s thought that there are more than a hundred thousand unique protein-protein interactions, or PPIs, in any given cell.8 Mapping this complexity is a grand challenge in biology, and another problem uniquely suited to AI tools.

Unfortunately, predicting PPIs is more challenging than predicting the binding of single proteins, says Felix Wong, a postdoctoral fellow at MIT. “Even a small molecule could be dozens of atoms, and figuring out where it hits a protein is complicated,” he adds. “There are, possibly, dozens of binding pockets in a single protein.”

AlphaFold currently only predicts a single snapshot of protein structures. But, within a living cell, proteins twist and bend when they come in contact with other proteins. A more useful predictive tool, then, would generate a ‘range’ of potential structures.

Still, many groups have applied AI models specifically to protein interactions and complexes. Last year, AlphaFold released a Multimer model for multi-chain protein complexes.9 It achieved decent improvements over baseline AlphaFold methods. It works thus: First, build a multiple sequence alignment for the complex to infer evolutionary relationships. Then, predict tertiary structures using basically the same deep learning method as AlphaFold2. An open-source Python package, called AlphaPulldown, can also be used to quickly run the AlphaFold-Multimer model.10

Similar models for protein complexes have been used to study the E. coli proteome, for instance, and decipher structures for challenging protein clusters, including those in the cytochrome c biogenesis system.11

As AI models improve for PPIs, Xu intends to use them to run ‘virtual drug’ screenings. “If we had a very good algorithm to predict [protein-protein interactions],” he says, “then we can do virtual screening of antibodies.” A recent study, by Wong and Aarti Krishnan, another postdoctoral fellow at MIT, however, suggested that AlphaFold-based docking models are not currently able to accurately predict protein-antibiotic interactions, even though antibiotics are small molecules with far fewer atoms than antibodies.12

Still, promising advances are on the horizon. AI models can be used to directly improve docking tools, such as AutoDock, DOCK, LeDock or FlexAID, to rapidly screen small molecules against proteins. A recent tool called EquiBind uses specific assumptions about a protein’s geometry and couples it with machine learning models to speed up drug-protein interaction testing,13  which may hold promise in bringing a more deep learning-guided approach to docking. 

A key limitation for future advances, says Krishnan, is data. “We want to use experimental training datasets to improve machine learning models,” she explains. It’d be particularly useful to have, for example, more cryo-electron microscopy structures for proteins in complex with ligands, which may improve training datasets for models that are currently trained mostly on isolated proteins.

References

1. Jumper J. et al. Highly accurate protein structure prediction with AlphaFold. Nature (2021).

2. AlQuraishi M. AlphaFold2 @ CASP14: “It feels like one’s child has left home.” Blog post (2020).

3. Wang S. et al. Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model. PLOS Computational Biology (2017).

4. Senior A.W. et al. Improved protein structure prediction using potentials from deep learning. Nature (2020).

5. Baek M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science (2021)

6. Xu J., McPartlon M & Li J. Improved protein structure prediction by deep learning irrespective of co-evolution information. Nature Machine Intelligence (2021).

7. Wicky B.I.M. et al. Hallucinating symmetric protein assemblies. Science (2022).

8. Hu X. et al. Deep learning frameworks for protein–protein interaction prediction. Computational and Structural Biotechnology Journal (2022).

9. Evans R. et al. Protein complex prediction with AlphaFold-Multimer. bioRxiv (2022).

10. Yu D. et al. AlphaPulldown—a python package for protein–protein interaction screens using AlphaFold-Multimer. Bioinformatics (2022).

11. Gao M. et al. AF2Complex predicts direct physical interactions in multimeric proteins with deep learning. Nature Communications (2022).

12. Wong F., Krishnan A. et al. Benchmarking AlphaFold-enabled molecular docking predictions for antibiotic discovery. Molecular Systems Biology (2022).

13. Stärk H. et al. EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction. arXiv (2022).