Observations of amino acid gain and loss during protein evolution are explained by statistical bias - PubMed
Observations of amino acid gain and loss during protein evolution are explained by statistical bias
Richard A Goldstein et al. Mol Biol Evol. 2006 Jul.
Abstract
The authors of a recent manuscript in "Nature" claim to have discovered "universal trends" of amino acid gain and loss in protein evolution. Here, we show that this universal trend can be simply explained by a bias that is unavoidable with the 3-taxon trees used in the original analysis. We demonstrate that a rigorously reversible equilibrium model, when analyzed with the same methods as the "Nature" manuscript, yields identical (and in this case, clearly erroneous) conclusions. A main source of the bias is the division of the sequence data into "informative" and "noninformative" sites, which favors the observation of certain transitions.
Figures

A simple 3-taxon tree. In (a), the 2 closest “sister” taxa (S1 and S2) diverge at an internal node (I), which represents the most recent common ancestor of these 2 taxa. At some unknown root point lies the common ancestor of these 2 taxa and the outgroup (O). The distances, or branch lengths, separating I from S1, S2, and O are labeled b1, b2, and b3, respectively. In (b) parsimony reconstruction of an “informative” site is depicted. The amino acid i (dashed lines) is inferred to exist at the root and in the common ancestor of the sister taxa because it is in one of the sisters and in the outgroup. The amino acid in the ancestral node is in parentheses to emphasize that the state here (and therefore the substitution from i to j along the branch leading to S2) is inferred rather than observed as in the 2 sister taxa and the outgroup.

Normalized flux difference versus length of branch to sister taxa for 2 amino acid types. The normalized flux for residue A, dA, was calculated with b1=b2=1=3b3 for 3 different equilibrium frequencies for the majority amino acid: πA = 0.6 (blue), 0.7 (red), and 0.8 (green). Results are shown for homogeneous (solid lines) or variable (dashed lines) rates across sites, for which 40% of the sites are invariant, 40% are slowly varying, and 20% vary 10 times faster than the slowly varying sites. Results are also shown (dotted red line) when πA = 0.6 for the invariant sites, πA = 0.7 for the slowly varying sites, and πA = 0.9 for the rapidly varying sites, for an average equilibrium frequency of πA = 0.7.

Apparent normalized flux for the various amino acids as reported by Jordan and colleagues (solid) versus those obtained by applying their analysis to synthetic data prepared from a rigorously reversible model, both without (striped) and with (crosshatched) their recommended corrections for correcting statistical bias. The model was based on the symmetric JTT matrix with 5 site classes, each defined by a relative substitution rate, a relative fraction of all locations, and the equilibrium frequencies of the 20 amino acids. Overall equilibrium frequencies were constrained to the JTT values. Branch lengths were b1 = b2 = 0.05, b3 = 0.15, corresponding to 8% sequence difference between sister taxa and 15% between either sister taxon and the outgroup. Seventy-eight percentage of the locations where the 2 sister taxa are different are informative, similar to the values for the real data analyzed by Jordan and colleagues. Published values represent the average over all data sets.

Root mean square (rms) flux imbalance for the site-class model as a function of branch lengths. The lengths between both sister taxa and the internal node (b1 and b2) were equal, and the length between the internal node and the outgroup (b3) was constrained to be longer. The flux imbalance was measured as rms values [〈d2〉], where the average is over all residue types. Other details of the model were the same as in figure 3. The corresponding quantity for the genome data was 0.165, which corresponds approximately to the turquoise stripe. Because of the different rate of change for different locations, the center point on the plot (b1 = b2 = 0.10, b3 = 0.25) corresponds to sister taxa with a sequence difference of 15%, whereas the outgroup sequence differs from each of the sister taxa by 21%.
Similar articles
-
ProtTest: selection of best-fit models of protein evolution.
Abascal F, Zardoya R, Posada D. Abascal F, et al. Bioinformatics. 2005 May 1;21(9):2104-5. doi: 10.1093/bioinformatics/bti263. Epub 2005 Jan 12. Bioinformatics. 2005. PMID: 15647292
-
Modeling evolution at the protein level using an adjustable amino acid fitness model.
Dimmic MW, Mindell DP, Goldstein RA. Dimmic MW, et al. Pac Symp Biocomput. 2000:18-29. doi: 10.1142/9789814447331_0003. Pac Symp Biocomput. 2000. PMID: 10902153
-
A universal trend of amino acid gain and loss in protein evolution.
Jordan IK, Kondrashov FA, Adzhubei IA, Wolf YI, Koonin EV, Kondrashov AS, Sunyaev S. Jordan IK, et al. Nature. 2005 Feb 10;433(7026):633-8. doi: 10.1038/nature03306. Epub 2005 Jan 19. Nature. 2005. PMID: 15660107
-
Evolution as a Guide to Designing xeno Amino Acid Alphabets.
Mayer-Bacon C, Agboha N, Muscalli M, Freeland S. Mayer-Bacon C, et al. Int J Mol Sci. 2021 Mar 10;22(6):2787. doi: 10.3390/ijms22062787. Int J Mol Sci. 2021. PMID: 33801827 Free PMC article. Review.
-
How similar are amino acid mutations in human genetic diseases and evolution.
Wu H, Ma BG, Zhao JT, Zhang HY. Wu H, et al. Biochem Biophys Res Commun. 2007 Oct 19;362(2):233-7. doi: 10.1016/j.bbrc.2007.07.141. Epub 2007 Aug 2. Biochem Biophys Res Commun. 2007. PMID: 17681277 Review.
Cited by
-
A universal trend among proteomes indicates an oily last common ancestor.
Mannige RV, Brooks CL, Shakhnovich EI. Mannige RV, et al. PLoS Comput Biol. 2012;8(12):e1002839. doi: 10.1371/journal.pcbi.1002839. Epub 2012 Dec 27. PLoS Comput Biol. 2012. PMID: 23300421 Free PMC article.
-
The universal trend of amino acid gain-loss is caused by CpG hypermutability.
Misawa K, Kamatani N, Kikuno RF. Misawa K, et al. J Mol Evol. 2008 Oct;67(4):334-42. doi: 10.1007/s00239-008-9141-1. Epub 2008 Sep 23. J Mol Evol. 2008. PMID: 18810523
-
Evolution of proteomes: fundamental signatures and global trends in amino acid compositions.
Tekaia F, Yeramian E. Tekaia F, et al. BMC Genomics. 2006 Dec 5;7:307. doi: 10.1186/1471-2164-7-307. BMC Genomics. 2006. PMID: 17147802 Free PMC article.
-
Matsumoto T, Akashi H, Yang Z. Matsumoto T, et al. Genetics. 2015 Jul;200(3):873-90. doi: 10.1534/genetics.115.177386. Epub 2015 May 6. Genetics. 2015. PMID: 25948563 Free PMC article.
-
Gaffney DJ, Keightley PD. Gaffney DJ, et al. BMC Evol Biol. 2008 Sep 30;8:265. doi: 10.1186/1471-2148-8-265. BMC Evol Biol. 2008. PMID: 18826599 Free PMC article.
References
-
- Akaike H. A Bayesian analysis of the minimum AIC procedure. Ann Inst Stat Math. 1978;30:9–14.
-
- Bruno WJ. Modeling residue usage in aligned protein sequences via Maximum likelihood. Mol Biol Evol. 1996;13:1368–74. - PubMed
-
- Dayhoff MO, Schwartz RM, Orcutt BC. Atlas of protein sequence and structure. In: Dayhoff MO, editor. A model of evolutionary change in proteins. Washington, DC: National Biomedical Research Foundation; 1978. p. 345.
-
- Dimmic MW, Mindell DP, Goldstein RA. Modeling evolution at the protein level using an adjustable amino acid fitness model. Pac Symp Biocomput. 2000:18–29. - PubMed
-
- Goldman N, Yang Z. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol. 1994;11:725–36. - PubMed
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources