Meta also presents a protein oracle
A new artificial intelligence (AI) could significantly accelerate the prediction of high-resolution protein structures. In mid-March, the technology group Meta (Facebook) presented a large language model called “ESMFold” that can determine three-dimensional (3D) shapes up to sixty times faster than the previously leading DeepMind AI “AlphaFold 2.0” from Google Holding Alphabet. Although ESMFold is supposed to deliver inaccurate results, the deviations are relatively small, according to experts.
The 3D structure of proteins is one of the most important pieces of information in biology and pharmacy. Proteins are like tiny bio-machines that shape our bodies and keep them running, for example as building material in hair and nails, as hormones and as antibodies. Knowing the shape of proteins helps to elucidate their biological function in the body, determine their potency as drugs, and test their suitability as drug targets. Like other language models, including ProGen, ESMFold also determines the 3D structure directly from the sequence of the amino acid building blocks. Which amino acids follow each other is encoded in the base sequence of the DNA. The amino acid sequence determines how the chain folds in three dimensions, since each amino acid bears different side groups with different charges that attract or repel each other.
Language model finds patterns in amino acid sequences
However, the language model does not need to know how the amino acids interact with each other. Instead, in training sessions with 138 million proteins from large protein databases, it had learned to find patterns in the amino acid sequences that correlated with specific structures. The AI also learned to fill in gaps in amino acid chains and to determine the most probable amino acid for missing positions.
The new thing is that ESMFold no longer has to compare the investigated amino acid sequence with other amino acid sequences with a known 3D structure (multiple sequence alignment, MSA) to determine the protein structure from similarities. “A very interesting aspect of ESMFold is that this information is no longer used explicitly, but was learned implicitly by the language model,” says Gunnar Schröder, who heads the “Computational Structural Biology” research group at Forschungszentrum Jülich. “It only seems surprising,” adds Alfonso Valencia from the Barcelona Supercomputing Centre. “The logic of the sequence of amino acids in known proteins is the result of an evolutionary process that has led to them having the specific structure with which they fulfill a specific function.”
Freely accessible database with 3D structure predictions
Google’s “AlphaFold” and the “RoseTTAFold” developed by the University of Washington rely on the aforementioned “multiple sequence alignment”. The AIs compare the amino acid sequence of the new protein to be examined with the sequences of proteins with a known 3D structure. If many similar sequences are found in similar places, in the same order, this indicates a structural or functional relationship between the proteins. From this, conclusions can now be drawn about the 3D structure of the examined protein with atomic accuracy. AlphaFold published a freely accessible database last year in which it had collected 3D structure predictions for almost every protein known to science: around 200 million proteins from animals, plants and other organisms. At that time, only 190,000 of these structures had been determined experimentally, i.e. using X-ray crystallography or cryo-electron microscopy.
As the ESMFold researchers led by Alexander Rives write in the journal “Science”, they were able to predict far more protein structures thanks to the higher speed. Meta AI published the so-called “ESM Metagenomics Atlas”, which contains 617 million high-resolution 3D structure simulations, three times as many structures as the AlphaFold database. According to their own statement, the researchers were able to predict around 225 million “with high reliability”. That also convinced Valencia. Even if the results for the complete data set are qualitatively somewhat lower than those obtained using other methods, they are at least comparable for these 225 million structures.
A second major advantage of the ESMFold method is that it also allows the structure of previously unknown proteins from environmental samples to be predicted. In addition, ‘the new method can be applied directly to predicting the consequences of point mutations, which was outside the scope of previous methods and has direct implications for biomedical applications,’ continues Valencia.