pubmed.ncbi.nlm.nih.gov

Observations of amino acid gain and loss during protein evolution are explained by statistical bias - PubMed

Observations of amino acid gain and loss during protein evolution are explained by statistical bias

Richard A Goldstein et al. Mol Biol Evol. 2006 Jul.

Abstract

The authors of a recent manuscript in "Nature" claim to have discovered "universal trends" of amino acid gain and loss in protein evolution. Here, we show that this universal trend can be simply explained by a bias that is unavoidable with the 3-taxon trees used in the original analysis. We demonstrate that a rigorously reversible equilibrium model, when analyzed with the same methods as the "Nature" manuscript, yields identical (and in this case, clearly erroneous) conclusions. A main source of the bias is the division of the sequence data into "informative" and "noninformative" sites, which favors the observation of certain transitions.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1

A simple 3-taxon tree. In (a), the 2 closest “sister” taxa (S1 and S2) diverge at an internal node (I), which represents the most recent common ancestor of these 2 taxa. At some unknown root point lies the common ancestor of these 2 taxa and the outgroup (O). The distances, or branch lengths, separating I from S1, S2, and O are labeled b1, b2, and b3, respectively. In (b) parsimony reconstruction of an “informative” site is depicted. The amino acid i (dashed lines) is inferred to exist at the root and in the common ancestor of the sister taxa because it is in one of the sisters and in the outgroup. The amino acid in the ancestral node is in parentheses to emphasize that the state here (and therefore the substitution from i to j along the branch leading to S2) is inferred rather than observed as in the 2 sister taxa and the outgroup.

Fig. 2
Fig. 2

Normalized flux difference versus length of branch to sister taxa for 2 amino acid types. The normalized flux for residue A, dA, was calculated with b1=b2=1=3b3 for 3 different equilibrium frequencies for the majority amino acid: πA = 0.6 (blue), 0.7 (red), and 0.8 (green). Results are shown for homogeneous (solid lines) or variable (dashed lines) rates across sites, for which 40% of the sites are invariant, 40% are slowly varying, and 20% vary 10 times faster than the slowly varying sites. Results are also shown (dotted red line) when πA = 0.6 for the invariant sites, πA = 0.7 for the slowly varying sites, and πA = 0.9 for the rapidly varying sites, for an average equilibrium frequency of πA = 0.7.

Fig. 3
Fig. 3

Apparent normalized flux for the various amino acids as reported by Jordan and colleagues (solid) versus those obtained by applying their analysis to synthetic data prepared from a rigorously reversible model, both without (striped) and with (crosshatched) their recommended corrections for correcting statistical bias. The model was based on the symmetric JTT matrix with 5 site classes, each defined by a relative substitution rate, a relative fraction of all locations, and the equilibrium frequencies of the 20 amino acids. Overall equilibrium frequencies were constrained to the JTT values. Branch lengths were b1 = b2 = 0.05, b3 = 0.15, corresponding to 8% sequence difference between sister taxa and 15% between either sister taxon and the outgroup. Seventy-eight percentage of the locations where the 2 sister taxa are different are informative, similar to the values for the real data analyzed by Jordan and colleagues. Published values represent the average over all data sets.

Fig. 4
Fig. 4

Root mean square (rms) flux imbalance for the site-class model as a function of branch lengths. The lengths between both sister taxa and the internal node (b1 and b2) were equal, and the length between the internal node and the outgroup (b3) was constrained to be longer. The flux imbalance was measured as rms values [〈d2〉], where the average is over all residue types. Other details of the model were the same as in figure 3. The corresponding quantity for the genome data was 0.165, which corresponds approximately to the turquoise stripe. Because of the different rate of change for different locations, the center point on the plot (b1 = b2 = 0.10, b3 = 0.25) corresponds to sister taxa with a sequence difference of 15%, whereas the outgroup sequence differs from each of the sister taxa by 21%.

Similar articles

Cited by

References

    1. Akaike H. A Bayesian analysis of the minimum AIC procedure. Ann Inst Stat Math. 1978;30:9–14.
    1. Bruno WJ. Modeling residue usage in aligned protein sequences via Maximum likelihood. Mol Biol Evol. 1996;13:1368–74. - PubMed
    1. Dayhoff MO, Schwartz RM, Orcutt BC. Atlas of protein sequence and structure. In: Dayhoff MO, editor. A model of evolutionary change in proteins. Washington, DC: National Biomedical Research Foundation; 1978. p. 345.
    1. Dimmic MW, Mindell DP, Goldstein RA. Modeling evolution at the protein level using an adjustable amino acid fitness model. Pac Symp Biocomput. 2000:18–29. - PubMed
    1. Goldman N, Yang Z. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol. 1994;11:725–36. - PubMed

Publication types

MeSH terms

Substances