The effect of ambiguous data on phylogenetic estimates obtained by maximum likelihood and Bayesian inference - PubMed
- ️Invalid Date
The effect of ambiguous data on phylogenetic estimates obtained by maximum likelihood and Bayesian inference
Alan R Lemmon et al. Syst Biol. 2009 Feb.
Abstract
Although an increasing number of phylogenetic data sets are incomplete, the effect of ambiguous data on phylogenetic accuracy is not well understood. We use 4-taxon simulations to study the effects of ambiguous data (i.e., missing characters or gaps) in maximum likelihood (ML) and Bayesian frameworks. By introducing ambiguous data in a way that removes confounding factors, we provide the first clear understanding of 1 mechanism by which ambiguous data can mislead phylogenetic analyses. We find that in both ML and Bayesian frameworks, among-site rate variation can interact with ambiguous data to produce misleading estimates of topology and branch lengths. Furthermore, within a Bayesian framework, priors on branch lengths and rate heterogeneity parameters can exacerbate the effects of ambiguous data, resulting in strongly misleading bipartition posterior probabilities. The magnitude and direction of the ambiguous data bias are a function of the number and taxonomic distribution of ambiguous characters, the strength of topological support, and whether or not the model is correctly specified. The results of this study have major implications for all analyses that rely on accurate estimates of topology or branch lengths, including divergence time estimation, ancestral state reconstruction, tree-dependent comparative methods, rate variation analysis, phylogenetic hypothesis testing, and phylogeographic analysis.
Figures

Simulation design. Among-site rate variation was simulated using 6 rates of evolution (chosen to produce the desired PP for the true tree with 500 sites) combined across 2 genes to form 36 rate combinations. Gene A contained unambiguous sites, whereas Gene B contained ambiguous sites. Ambiguous characters were present for either sister or nonsister taxa. Although Gene A always contained 500 sites, the length of Gene B varied from 0 to 500 sites. Note that Gene B contained no topological information, regardless of the rate of evolution. PP = posterior probabilities.

The effect of ambiguous characters on topological support when both genes are effectively invariable (rate = 0.000015 substitutions per site per My). On each graph, we plot the support for the true tree as a function of the length of Gene B. Each point represents the mean across 100 replicate data sets. In an ML framework (a), ambiguous characters do not affect topological support (Pr: calculated as the proportion of 100 replicates in which the true tree was chosen, with a value of 1/3 given to replicates with equal support across topologies) when branch lengths can be collapsed to polytomies. In a Bayesian framework (b), topological support (PP) changes as a function of the length of Gene B and whether Gene B is ambiguous for sister (black) or nonsister (dark gray) taxa. When Gene B is unambiguous (light gray), topological support is unaffected by the length of Gene B. When branch lengths are forced to take a small but nonzero value (10−6 substitutions/site) in an ML framework (c), ambiguous characters bias topological support (measured as the ratio of likelihood scores for the true to one of the false trees) in the manner seen in the Bayesian framework. Note that in a Bayesian framework, the flat prior on bifurcating topologies requires branch lengths to take a nonzero value. PP = posterior probability; Pr = probability.

The effect of ambiguous characters on topological support when both genes are evolving at the same rate. Axes and shades of gray are the same as in Figure 2. Note that the graphs in the left column of (a), (b), and (d) are identical to those presented in Figure 2. In an ML framework (a), ambiguous characters do not lead to a systematic bias in topological support, regardless of the rate of evolution (increasing from left to right columns). In a Bayesian framework (b), however, the magnitude and direction of the bias are a function of the rate of evolution. This bias is strongest when the rate of evolution is low or high and weakest when the rate of evolution is intermediate (e.g., when Gene A provides strong support for the true tree). When the rate of evolution is high, the bias exists when an exponential branch-length prior is assumed (b) but is absent when a uniform branch-length prior is assumed (c). The type of bias seen in the Bayesian framework can be demonstrated in the ML framework (d) if branch lengths are fixed at an arbitrarily low value (results for 10−6 substitutions per site per My are shown in the lower left graph) or a very high value (results for 1.0 substitutions per site per My shown in the lower right graph) data set. Note that in the Bayesian framework, the flat topological prior prohibits zero-length branches and the exponential branch-length prior penalizes long branches.

The effect of ambiguous characters on Bayesian posterior probabilities when rates differ between unambiguous (A) and ambiguous (B) genes. In each graph, the average posterior probability of the true tree (y-axis) is plotted as a function of the length of Gene B (x-axis), pattern of ambiguous characters (shade of gray), rate of Gene A (column), and rate of Gene B (row). Graphs show results from analyses in which rate variation was modeled in a partitioned analysis (partitioned by gene) with a dirichlet(1,1) rate prior. Therefore, the model of evolution is overparameterized along the diagonal (equal rates; analogous to Figure 3b) and correctly parameterized off the diagonal. Note that the magnitude and direction of bias are a function of the relative rates of the ambiguous and unambiguous genes. Also note that in some cases, the bias is strongest when the number of ambiguous sites is low.

The effect of ambiguous characters on Bayesian posterior probabilities under different models of among-site rate variation. Axes and shades of gray have the same meaning as in Figure 2b. Each row corresponds to the top row of Figure 4 (Gene B is effectively invariable). Results for models of rate variation are shown: discrete gamma with 4 rate categories (Γ), invariable sites (I), and partitioned with variable rate prior (P). The subscript f indicates that the rate variation parameter was fixed at the true value, removing the effect of the prior on that parameter. Note that in each case, the light gray lines show the results from analyses of data sets in which both Genes A and B were completely unambiguous (i.e., the control). Under the (incorrect) gamma model of rate heterogeneity, the posterior probabilities were slightly biased even for the unambiguous data sets.

The effect of ambiguous characters on the probability of incorrectly rejecting a molecular clock model of evolution in an ML framework. The proportion of 100 replicates in which the clock model was rejected in a χ2 test (df = 2; y-axis) is plotted against the length of Gene B (x-axis), distribution of ambiguous characters (shade of gray), the rate of Gene A (columns), and the rate of Gene B (rows). Because rate heterogeneity was not accommodated in these analyses (see text), the model of evolution was underparameterized in analyses presented off the diagonal. Note that substantial inflation of Type I error requires both rate variation (off diagonal) and ambiguous characters (black or dark gray points).

The effect of ambiguous characters on estimates of an empirical phylogeny estimated in a Bayesian framework. In (a), we present results based on an empirical data set with up to 1000 variable sites appended. The character state at each appended site was unambiguous but different for the sister taxa Hydromantes brunus (Hb) and Hydromantes italicus (Hi) and was ambiguous (“?”) for the other 6 taxa: Aneides flavipunctatus (Af), Aneides hardii (Ah), Desmognathus fucus (Df), Desmognathus wrighti (Dw), Ensatina eschscholtzii (Ee), and Phaeognathus hubrichti (Ph). The number of appended sites is given above each phylogeny, and the bipartition posterior probability estimate is given at each internal branch. In (b), we present results based on the same empirical data set but with up to 1000 invariable sites appended. Here, the character state at each appended site was identical for the distant taxa Df and Ee and was ambiguous for the other 6 taxa: Af, Ah, Dw, Hb, Hi, and Ph. Note that when variable sites are added, taxa with unambiguous characters are pushed apart on the phylogeny, whereas when invariable sites are added, taxa with unambiguous characters are pulled together. Topologies estimated in an ML framework matched those estimated in a Bayesian framework.

Ambiguous characters interact with model misspecification to produce misleading branch-length estimates. A data set is simulated using an ultrametric tree. Some sites evolve under a slow rate, whereas others evolve under a fast rate. Ambiguous characters are introduced nonrandomly with respect to rate and taxon. If rate variation is correctly modeled, the estimated tree is ultrametric. If rate variation is not correctly modeled, the estimated tree is non-ultrametric. The interaction between ambiguous characters and model misspecification causes among-site rate variation to be manifested as among-branch rate variation. Note that the pattern of branch lengths inferred depend on the taxonomic distribution of the ambiguous characters, even though the ambiguous sites contain no topological information.
Similar articles
-
Schwartz RS, Mueller RL. Schwartz RS, et al. BMC Evol Biol. 2010 Jan 11;10:5. doi: 10.1186/1471-2148-10-5. BMC Evol Biol. 2010. PMID: 20064267 Free PMC article.
-
Mar JC, Harlow TJ, Ragan MA. Mar JC, et al. BMC Evol Biol. 2005 Jan 28;5:8. doi: 10.1186/1471-2148-5-8. BMC Evol Biol. 2005. PMID: 15676079 Free PMC article.
-
Using models of nucleotide evolution to build phylogenetic trees.
Bos DH, Posada D. Bos DH, et al. Dev Comp Immunol. 2005;29(3):211-27. doi: 10.1016/j.dci.2004.07.007. Dev Comp Immunol. 2005. PMID: 15572070 Review.
-
Bayesian tests of topology hypotheses with an example from diving beetles.
Bergsten J, Nilsson AN, Ronquist F. Bergsten J, et al. Syst Biol. 2013 Sep;62(5):660-73. doi: 10.1093/sysbio/syt029. Epub 2013 Apr 28. Syst Biol. 2013. PMID: 23628960 Free PMC article. Review.
Cited by
-
A phylogeny and revised classification of Squamata, including 4161 species of lizards and snakes.
Pyron RA, Burbrink FT, Wiens JJ. Pyron RA, et al. BMC Evol Biol. 2013 Apr 29;13:93. doi: 10.1186/1471-2148-13-93. BMC Evol Biol. 2013. PMID: 23627680 Free PMC article.
-
An estimation of Erinaceidae phylogeny: a combined analysis approach.
He K, Chen JH, Gould GC, Yamaguchi N, Ai HS, Wang YX, Zhang YP, Jiang XL. He K, et al. PLoS One. 2012;7(6):e39304. doi: 10.1371/journal.pone.0039304. Epub 2012 Jun 20. PLoS One. 2012. PMID: 22745729 Free PMC article.
-
Sohn JC, Regier JC, Mitter C, Davis D, Landry JF, Zwick A, Cummings MP. Sohn JC, et al. PLoS One. 2013;8(1):e55066. doi: 10.1371/journal.pone.0055066. Epub 2013 Jan 31. PLoS One. 2013. PMID: 23383061 Free PMC article.
-
Baczyński J, Sauquet H, Spalik K. Baczyński J, et al. Am J Bot. 2022 Mar;109(3):437-455. doi: 10.1002/ajb2.1819. Epub 2022 Mar 20. Am J Bot. 2022. PMID: 35112711 Free PMC article.
-
Cheng RY, Xie DF, Zhang XY, Fu X, He XJ, Zhou SD. Cheng RY, et al. Biomed Res Int. 2022 Mar 24;2022:3909596. doi: 10.1155/2022/3909596. eCollection 2022. Biomed Res Int. 2022. PMID: 35372568 Free PMC article.
References
-
- Armbruster WS. Phylogeny and the evolution of plant-animal interactions. BioScience. 1992;42:12–20.
-
- Avise J. Evolutionary pathways in nature: a phylogenetic approach. New York: Cambridge University Press; 2006. pp. 1–298.
-
- Bowers JE, Chapman BA, Paterson AH. Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature. 2003;422:433–438. - PubMed
-
- Brown JM, Lemmon AR. The importance of data partitioning and the utility of Bayes factors in Bayesian phylogenetics. Syst. Biol. 2007;56:643–655. - PubMed
-
- Bull JJ, Cunningham CW, Molineux IJ, Badgett MR, Hillis DM. Experimental molecular evolution of bacteriophage T7. Evolution. 1993;47:993–1007. - PubMed