DeepMind adds a diffusion engine to latest protein-folding software

image of a complicated mix of lines and ribbons arranged in a complicated 3D structure. — Enlarge / Prediction of the structure of a coronavirus Spike protein from a virus that causes the common cold.

Google DeepMind

Most of the activities that go on inside cells—the activities that keep us living, breathing, thinking animals—are handled by proteins. They allow cells to communicate with each other, run a cell’s basic metabolism, and help convert the information stored in DNA into even more proteins. And all of that depends on the ability of the protein’s string of amino acids to fold up into a complicated yet specific three-dimensional shape that enables it to function.

Up until this decade, understanding that 3D shape meant purifying the protein and subjecting it to a time- and labor-intensive process to determine its structure. But that changed with the work of DeepMind, one of Google’s AI divisions, which released Alpha Fold in 2021, and a similar academic effort shortly afterward. The software wasn’t perfect; it struggled with larger proteins and didn’t offer high-confidence solutions for every protein. But many of its predictions turned out to be remarkably accurate.

Even so, these structures only told half of the story. To function, almost every protein has to interact with something else—other proteins, DNA, chemicals, membranes, and more. And, while the initial version of AlphaFold could handle some protein-protein interactions, the rest remained black boxes. Today, DeepMind is announcing the availability of version 3 of AlphaFold, which has seen parts of its underlying engine either heavily modified or replaced entirely. Thanks to these changes, the software now handles various additional protein interactions and modifications.

Changing parts

The original AlphaFold relied on two underlying software functions. One of those took evolutionary limits on a protein into account. By looking at the same protein in multiple species, you can get a sense for which parts are always the same, and therefore likely to be central to its function. That centrality implies that they’re always likely to be in the same location and orientation in the protein’s structure. To do this, the original AlphaFold found as many versions of a protein as it could and lined up their sequences to look for the portions that showed little variation.

Doing so, however, is computationally expensive since the more proteins you line up, the more constraints you have to resolve. In the new version, the AlphaFold team still identified multiple related proteins but switched to largely performing alignments using pairs of protein sequences from within the set of related ones. This probably isn’t as information-rich as a multi-alignment, but it’s far more computationally efficient, and the lost information doesn’t appear to be critical to figuring out protein structures.

Using these alignments, a separate software module figured out the spatial relationships among pairs of amino acids within the target protein. Those relationships were then translated into spatial coordinates for each atom by code that took into account some of the physical properties of amino acids, like which portions of an amino acid could rotate relative to others, etc.

In AlphaFold 3, the prediction of atomic positions is handled by a diffusion module, which is trained by being given both a known structure and versions of that structure where noise (in the form of shifting the positions of some atoms) has been added. This allows the diffusion module to take the inexact locations described by relative positions and convert them into exact predictions of the location of every atom in the protein. It doesn’t need to be told the physical properties of amino acids, because it can figure out what they normally do by looking at enough structures.

(DeepMind had to train on two different levels of noise to get the diffusion module to work: one in which the locations of atoms were shifted while the general structure was left intact and a second where the noise involved shifting the large-scale structure of the protein, thus affecting the location of lots of atoms.)

During training, the team found that it took about 20,000 instances of protein structures for AlphaFold 3 to get about 97 percent of a set of test structures right. By 60,000 instances, it started getting protein-protein interfaces correct at that frequency, too. And, critically, it started getting proteins complexed with other molecules right, as well.

Accuracy and hallucinations

None of the complexes reached the same level of accuracy as a basic protein structure. But, when looking at proteins complexed with a signaling molecule, about three-quarters of the predictions turned out to be right. Protein-DNA complexes were at about 60 percent accuracy, while protein-RNA complexes were at about 40 percent accuracy. All of those figures are significantly better than other leading prediction software. AlphaFold 3 could also produce predictions for proteins that have been chemically modified, such as by the addition of links to sugars (a very common modification).

The adoption of a diffusion engine was a major source of concern since these tend to be prone to hallucinations. Many proteins have segments where there isn’t a defined structure—a loop of amino acids that flop around in the water that surrounds the protein, for example. Since the diffusion module’s job is to find a structure, it could make one of these up even though it doesn’t exist, an output called a hallucination.

To try to limit hallucinations, the DeepMind team trained the module on structure predictions from an earlier version of its software, which typically puts unstructured pieces of protein into a very easy-to-identify configuration. This helped, and the team found that most hallucinations were labeled as low-confidence predictions, allowing them to be at least identified.

Other problems noted by the team are all very erratic. Sometimes the software didn’t handle chirality, where a molecule could exist in one of two mirror-image configurations (all biomolecules tend to be a single chirality). And it also sometimes places atoms in locations where they’d physically overlap. This could be reduced by lowering the scores of predictions where this took place, but not eliminated entirely.

Finally, the software could be used to predict interactions between proteins and antibodies that recognize them, but it was very computationally expensive since it often required the software to make multiple predictions and rate the probable accuracy of each one. That’s not out of keeping with what others have found but is disappointing considering how useful it can be to understand antibody-target interactions.

What does this mean?

Adding the ability to predict the structure of proteins complexed with the molecules they operate on could ultimately give us a new view into how life operates—as we mentioned up top, these interactions are central to how life operates. And, even with the somewhat limited accuracy of these predictions, they’re potentially useful for developing hypotheses that could be tested using standard biochemical techniques.

These same sorts of interactions are also key to drug development. If you know what the complex between a protein and signaling molecule looks like, then it makes it far easier to develop molecules that disrupt that interaction—an idea that’s behind some key drug developments in recent years. You could also potentially test the strength of interactions between potential drugs identified this way and the proteins they target.

Are the current structural predictions accurate enough for that? At least one company thinks so, based on Google’s announcement. Only pharmaceutical companies can answer whether that’s likely to be the case today. The key determinant in the long run, however, is likely to be whether we’re back here in a few years with a discussion of the improvements found in AlphaFold 4.

Nature, 2024. DOI: 10.1038/s41586-024-07487-w (About DOIs).

Click Here To Read More