Protein folding via machine learning may spawn medical advances

Complex of bacteria-infecting viral proteins modeled in CASP 13

Introduction

In an advance that may presage a dramatic new era of pharmaceuticals and medicine, DeepMind (a subsidiary of Alphabet, Google’s parent company) recently applied their machine learning software to the challenging problem of protein folding, with remarkable success. In the wake of this success, DeepMind and other private companies are racing to further extend these capabilities and apply them to real-world biology and medicine.

The protein folding problem

Protein folding is the name for the physical process in which a protein chain, defined by a linear sequence of amino acids, assumes its equilibrium 3-dimensional structure, a process that in nature typically occurs within a few milliseconds. The equilibrium or “native” structure determines most of the protein’s biological properties. Protein enzymes, for instance, control chemical reactions at the molecular scale. Accurately predicting protein structures is a “holy grail” of modern biology.

As a single example, the protein Cas9 is used to snip DNA in CRISPR gene editing, but the potential of the CRISPR technique is limited by the fact that using Cas9 entails an increased risk of mutations. A better understanding of the structure of Cas9 is thus essential to further advances in gene editing technology.

There are at least 20,000 different proteins in human biology, since there are that many genes in the human genome, but the actual tally is much higher, possibly as many as several billion.

Computationally simulating the protein folding process, and determining the final equilibrium conformation, is a difficult and challenging problem — indeed it has long been listed as one of the grand challenges of the high-performance computing field. Since exhaustively trying all possible conformations for a protein given by an n-long amino acid chain is well-known to be computationally infeasible (Levinthal’s paradox), researchers have broken the folding process into numerous steps, each of which has spawned numerous algorithmic approaches for computation. One recent survey of the computational problem and underlying physics is given in this paper.

The challenge of protein folding has led some to develop new computer architectures specifically devoted to this task. Notable among these efforts is the “Anton” system, which was designed and constructed by a team of researchers led by David E. Shaw, founder of the D.E. Shaw hedge fund. A technical paper they wrote describing the Anton system and one of their landmark calculations was named a winner of the 2009 ACM Gordon Bell Prize.

DeepMind’s prior machine learning achievements

As mentioned above, DeepMind is a research subsidiary of Alphabet (Google’s parent company) devoted to machine learning (ML) and artificial intelligence (AI). In 2016, the “AlphaGo” program developed by DeepMind defeated Lee Se-dol, a Korean Go master, by winning four games of a five-game tournament, surprising observers who had predicted that this would not be done for decades, if ever. A year later, DeepMind’s improved AlphaGo program defeated Ke Jie, a 19-year-old Chinese player thought to be the world’s best.

Then DeepMind tried a new approach — rather than feeding their program over 100,000 published games by human competitors, they merely programmed the system with the rules of Go and a relatively simple reward function, and had it play games against itself. After just three days and 4.9 million training games, the new “AlphaGo Zero” program had advanced to the point that it defeated the earlier program 100 games to zero. After 40 days of training, its measured skill level was as far ahead of Ke Jie as Ke Jie is ahead of a typical amateur. For additional details, see this Math Scholar blog, this Scientific American article, and this DeepMind article.

Other DeepMind programs, using a similar machine learning approach, have conquered Chess and the Japanese game shogi. For details, see DeepMind’s technical paper and this nicely written New York Times article. This and some other recent developments in the ML/AI arena are summarized in this Math Scholar blog.

Deep Mind conquers protein folding

DeepMind’s latest conquest is protein folding: In their first attempt, DeepMind’s team easily took top honors at the 13th Critical Assessment of Structure Prediction (CASP), an international competition of protein structure computer programs. For protein sequences for which no other information was known (43 of the 90 test problems), DeepMind’s AlphaFold program made the most accurate prediction among the 98 competitors in 25 cases. This was far better than the second-place entrant, which won only three of the 43 test cases. On average, AlphaFold was 15% more accurate than its closest competitors on the most rigorous tests.

Diagram of AlphaFold’s strategy; courtesy Exxact.

DeepMind’s team developed AlphaFold by training a neural network program on a large dataset of other known proteins, thus enabling the program to more efficiently predict the distances between pairs of amino acids and the angles between the chemical bonds connecting them. After this they employed a more classical “gradient descent” approach to minimize the overall energy level. In other words, their approach combined sophisticated deep learning models with brute force computational resources. A nice summary of these techniques is given in this Exxact blog (see diagram to right) and in this DeepMind article.

Most researchers in the field were very impressed. One researcher described these results as absolutely stunning. The Guardian predicted that these results would “usher in a new era of medical progress.”

Commercial applications

Large pharmaceutical firms have not traditionally paid much attention to computational approaches, preferring more conventional experimental methods. But the costs of such laboratory work are rising rapidly — by one estimate, large pharma firms spend roughly $2.5 billion bringing a new drug to market, a cost figure that ultimately must be paid for by consumers and their medical insurance companies in the form of sky-high prices for prescription drugs. One major reason for these escalating costs is the embarrassing fact that only about 10% of drugs that enter clinical trails are eventually approved by governmental regulatory agencies. Clearly the pharmaceutical companies must increase this success rate.

And the challenge looming ahead is even more daunting — the 20,000 genes of the human genome can malfunction in at least 100,000 ways, and the total number of interactions between human biology proteins is in the millions, if not higher. As Chris Gibson, founder of Recursion Pharmaceuticals, explains, “If we want to understand the other 97 percent of human biology, we will have to acknowledge it is too complex for humans.”

In the wake of DeepMind’s achievement (December 2018), numerous commercial enterprises are pursuing computational protein folding using ML/AI-based strategies. Venture capital operations are certainly taking notice, having poured more than $1 billion into ML/AI-based startups in the pharmaceutical field during the past year. In addition to Recursion, mentioned above, some other new firms include Insitro, which was recently acquired by Gilead Sciences; Benevolent AI, which has teamed up with AstraZeneca; and a University of California-based team that has partnered with GlaxoSmithKline. Along this line, Juan Alvarez, an associate Vice President for computational chemistry at Merck, says that ML-based methods will be “critical” to the drug discovery and development process in the coming years. See this Bloomberg article for additional details.

Conclusion

So it appears that machine learning and artificial intelligence-based technology is destined to have a major impact in the pharmaceutical-biomedical world in the coming years. The reasons are not hard to find. As GlaxoSmithKline senior Vice President Tony Wood explains, “Where else would you accept a 1-in-10 success rate? … If we could double that to 20% it would be phenomenal.”

[This was also posted to the Math Scholar blog.]

Comments are closed.