Highly accurate protein structure prediction with AlphaFold - PubMed
. 2021 Aug;596(7873):583-589.
doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.
Richard Evans # 2 , Alexander Pritzel # 2 , Tim Green # 2 , Michael Figurnov # 2 , Olaf Ronneberger # 2 , Kathryn Tunyasuvunakool # 2 , Russ Bates # 2 , Augustin Žídek # 2 , Anna Potapenko # 2 , Alex Bridgland # 2 , Clemens Meyer # 2 , Simon A A Kohl # 2 , Andrew J Ballard # 2 , Andrew Cowie # 2 , Bernardino Romera-Paredes # 2 , Stanislav Nikolov # 2 , Rishub Jain # 2 , Jonas Adler 2 , Trevor Back 2 , Stig Petersen 2 , David Reiman 2 , Ellen Clancy 2 , Michal Zielinski 2 , Martin Steinegger 3 4 , Michalina Pacholska 2 , Tamas Berghammer 2 , Sebastian Bodenstein 2 , David Silver 2 , Oriol Vinyals 2 , Andrew W Senior 2 , Koray Kavukcuoglu 2 , Pushmeet Kohli 2 , Demis Hassabis # 5
Affiliations
- PMID: 34265844
- PMCID: PMC8371605
- DOI: 10.1038/s41586-021-03819-2
Highly accurate protein structure prediction with AlphaFold
John Jumper et al. Nature. 2021 Aug.
Abstract
Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort1-4, the structures of around 100,000 unique proteins have been determined5, but this represents a small fraction of the billions of known protein sequences6,7. Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence-the structure prediction component of the 'protein folding problem'8-has been an important open research problem for more than 50 years9. Despite recent progress10-14, existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14)15, demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm.
© 2021. The Author(s).
Conflict of interest statement
J.J., R.E., A. Pritzel, T.G., M.F., O.R., R.B., A.B., S.A.A.K., D.R. and A.W.S. have filed non-provisional patent applications 16/701,070 and PCT/EP2020/084238, and provisional patent applications 63/107,362, 63/118,917, 63/118,918, 63/118,921 and 63/118,919, each in the name of DeepMind Technologies Limited, each pending, relating to machine learning for predicting protein structures. The other authors declare no competing interests.
Figures

a, The performance of AlphaFold on the CASP14 dataset (n = 87 protein domains) relative to the top-15 entries (out of 146 entries), group numbers correspond to the numbers assigned to entrants by CASP. Data are median and the 95% confidence interval of the median, estimated from 10,000 bootstrap samples. b, Our prediction of CASP14 target T1049 (PDB 6Y4F, blue) compared with the true (experimental) structure (green). Four residues in the C terminus of the crystal structure are B-factor outliers and are not depicted. c, CASP14 target T1056 (PDB 6YJ1). An example of a well-predicted zinc-binding site (AlphaFold has accurate side chains even though it does not explicitly predict the zinc ion). d, CASP target T1044 (PDB 6VR4)—a 2,180-residue single chain—was predicted with correct domain packing (the prediction was made after CASP using AlphaFold without intervention). e, Model architecture. Arrows show the information flow among the various components described in this paper. Array shapes are shown in parentheses with s, number of sequences (Nseq in the main text); r, number of residues (Nres in the main text); c, number of channels.

The analysed structures are newer than any structure in the training set. Further filtering is applied to reduce redundancy (see Methods). a, Histogram of backbone r.m.s.d. for full chains (Cα r.m.s.d. at 95% coverage). Error bars are 95% confidence intervals (Poisson). This dataset excludes proteins with a template (identified by hmmsearch) from the training set with more than 40% sequence identity covering more than 1% of the chain (n = 3,144 protein chains). The overall median is 1.46 Å (95% confidence interval = 1.40–1.56 Å). Note that this measure will be highly sensitive to domain packing and domain accuracy; a high r.m.s.d. is expected for some chains with uncertain packing or packing errors. b, Correlation between backbone accuracy and side-chain accuracy. Filtered to structures with any observed side chains and resolution better than 2.5 Å (n = 5,317 protein chains); side chains were further filtered to B-factor <30 Å2. A rotamer is classified as correct if the predicted torsion angle is within 40°. Each point aggregates a range of lDDT-Cα, with a bin size of 2 units above 70 lDDT-Cα and 5 units otherwise. Points correspond to the mean accuracy; error bars are 95% confidence intervals (Student t-test) of the mean on a per-residue basis. c, Confidence score compared to the true accuracy on chains. Least-squares linear fit lDDT-Cα = 0.997 × pLDDT − 1.17 (Pearson’s r = 0.76). n = 10,795 protein chains. The shaded region of the linear fit represents a 95% confidence interval estimated from 10,000 bootstrap samples. In the companion paper, additional quantification of the reliability of pLDDT as a confidence measure is provided. d, Correlation between pTM and full chain TM-score. Least-squares linear fit TM-score = 0.98 × pTM + 0.07 (Pearson’s r = 0.85). n = 10,795 protein chains. The shaded region of the linear fit represents a 95% confidence interval estimated from 10,000 bootstrap samples.

a, Evoformer block. Arrows show the information flow. The shape of the arrays is shown in parentheses. b, The pair representation interpreted as directed edges in a graph. c, Triangle multiplicative update and triangle self-attention. The circles represent residues. Entries in the pair representation are illustrated as directed edges and in each diagram, the edge being updated is ij. d, Structure module including Invariant point attention (IPA) module. The single representation is a copy of the first row of the MSA representation. e, Residue gas: a representation of each residue as one free-floating rigid body for the backbone (blue triangles) and χ angles for the side chains (green circles). The corresponding atomic structure is shown below. f, Frame aligned point error (FAPE). Green, predicted structure; grey, true structure; (Rk, tk), frames; xi, atom positions.

a, Ablation results on two target sets: the CASP14 set of domains (n = 87 protein domains) and the PDB test set of chains with template coverage of ≤30% at 30% identity (n = 2,261 protein chains). Domains are scored with GDT and chains are scored with lDDT-Cα. The ablations are reported as a difference compared with the average of the three baseline seeds. Means (points) and 95% bootstrap percentile intervals (error bars) are computed using bootstrap estimates of 10,000 samples. b, Domain GDT trajectory over 4 recycling iterations and 48 Evoformer blocks on CASP14 targets LmrP (T1024) and Orf8 (T1064) where D1 and D2 refer to the individual domains as defined by the CASP assessment. Both T1024 domains obtain the correct structure early in the network, whereas the structure of T1064 changes multiple times and requires nearly the full depth of the network to reach the final structure. Note, 48 Evoformer blocks comprise one recycling iteration.

a, Backbone accuracy (lDDT-Cα) for the redundancy-reduced set of the PDB after our training data cut-off, restricting to proteins in which at most 25% of the long-range contacts are between different heteromer chains. We further consider two groups of proteins based on template coverage at 30% sequence identity: covering more than 60% of the chain (n = 6,743 protein chains) and covering less than 30% of the chain (n = 1,596 protein chains). MSA depth is computed by counting the number of non-gap residues for each position in the MSA (using the Neff weighting scheme; see Methods for details) and taking the median across residues. The curves are obtained through Gaussian kernel average smoothing (window size is 0.2 units in log10(Neff)); the shaded area is the 95% confidence interval estimated using bootstrap of 10,000 samples. b, An intertwined homotrimer (PDB 6SK0) is correctly predicted without input stoichiometry and only a weak template (blue is predicted and green is experimental).
Comment in
-
Protein-structure prediction revolutionized.
AlQuraishi M. AlQuraishi M. Nature. 2021 Aug;596(7873):487-488. doi: 10.1038/d41586-021-02265-4. Nature. 2021. PMID: 34426694 No abstract available.
-
Agard DA, Bowman GR, DeGrado W, Dokholyan NV, Zhou HX. Agard DA, et al. Fac Rev. 2022 Dec 14;11:38. doi: 10.12703/r-01-0000020. eCollection 2022. Fac Rev. 2022. PMID: 36644294 Free PMC article.
Similar articles
-
Depressing time: Waiting, melancholia, and the psychoanalytic practice of care.
Salisbury L, Baraitser L. Salisbury L, et al. In: Kirtsoglou E, Simpson B, editors. The Time of Anthropology: Studies of Contemporary Chronopolitics. Abingdon: Routledge; 2020. Chapter 5. In: Kirtsoglou E, Simpson B, editors. The Time of Anthropology: Studies of Contemporary Chronopolitics. Abingdon: Routledge; 2020. Chapter 5. PMID: 36137063 Free Books & Documents. Review.
-
Ryan R, Hill S. Ryan R, et al. Cochrane Database Syst Rev. 2019 Oct 23;10(10):ED000141. doi: 10.1002/14651858.ED000141. Cochrane Database Syst Rev. 2019. PMID: 31643081 Free PMC article.
-
Using Experience Sampling Methodology to Capture Disclosure Opportunities for Autistic Adults.
Love AMA, Edwards C, Cai RY, Gibbs V. Love AMA, et al. Autism Adulthood. 2023 Dec 1;5(4):389-400. doi: 10.1089/aut.2022.0090. Epub 2023 Dec 12. Autism Adulthood. 2023. PMID: 38116059 Free PMC article.
-
Enabling Systemic Identification and Functionality Profiling for Cdc42 Homeostatic Modulators.
Malasala S, Azimian F, Chen YH, Twiss JL, Boykin C, Akhtar SN, Lu Q. Malasala S, et al. bioRxiv [Preprint]. 2024 Jan 8:2024.01.05.574351. doi: 10.1101/2024.01.05.574351. bioRxiv. 2024. PMID: 38260445 Free PMC article. Updated. Preprint.
-
Triana L, Palacios Huatuco RM, Campilgio G, Liscano E. Triana L, et al. Aesthetic Plast Surg. 2024 Oct;48(20):4217-4227. doi: 10.1007/s00266-024-04260-2. Epub 2024 Aug 5. Aesthetic Plast Surg. 2024. PMID: 39103642 Review.
Cited by
-
Lampinen V, Ojanen MJT, Caro FM, Gröhn S, Hankaniemi MM, Pesu M, Hytönen VP. Lampinen V, et al. Nanoscale Adv. 2024 Oct 9;6(24):6239-52. doi: 10.1039/d4na00483c. Online ahead of print. Nanoscale Adv. 2024. PMID: 39430302 Free PMC article.
-
Evaluation of the Role of Tanshinone I in an In Vitro System of Charcot-Marie-Tooth Disease Type 2N.
Zhang J, Meng X, Qin Q, Liang Y, Yang G, Li S, Li X, Zhou JC, Sun L. Zhang J, et al. Int J Mol Sci. 2024 Oct 17;25(20):11184. doi: 10.3390/ijms252011184. Int J Mol Sci. 2024. PMID: 39456965 Free PMC article.
-
Zhang YC, Li XY, Deng Q, Ge YJ, Yi RR, Wang HJ, Wang JT, Zhou H, Kong XF, Liu RJ, Zhang YT, Li XP, He XW, Zhu HY. Zhang YC, et al. Theranostics. 2024 Sep 30;14(16):6249-6267. doi: 10.7150/thno.97590. eCollection 2024. Theranostics. 2024. PMID: 39431011 Free PMC article.
-
Toner CM, Hoitsma NM, Weerawarana S, Luger K. Toner CM, et al. Nat Commun. 2024 Oct 23;15(1):9138. doi: 10.1038/s41467-024-53364-5. Nat Commun. 2024. PMID: 39443461 Free PMC article.
-
Cagape CMS, Seng R, Saiprom N, Tandhavanant S, Chewapreecha C, Boonyuen U, West TE, Chantratita N. Cagape CMS, et al. Sci Rep. 2024 Oct 23;14(1):24966. doi: 10.1038/s41598-024-74922-3. Sci Rep. 2024. PMID: 39443499 Free PMC article.
References
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases
Miscellaneous