pubmed.ncbi.nlm.nih.gov

Resolving tricky nodes in the tree of life through amino acid recoding - PubMed

  • ️Sat Jan 01 2022

Resolving tricky nodes in the tree of life through amino acid recoding

Mattia Giacomelli et al. iScience. 2022.

Abstract

Genomic data allowed a detailed resolution of the Tree of Life, but "tricky nodes" such as the root of the animals remain unresolved. Genome-scale datasets are heterogeneous as genes and species are exposed to different pressures, and this can negatively impacts phylogenetic accuracy. We use simulated genomic-scale datasets and show that recoding amino acid data improves accuracy when the model does not account for the compositional heterogeneity of the amino acid alignment. We apply our findings to three datasets addressing the root of the animal tree, where the debate centers on whether sponges (Porifera) or comb jellies (Ctenophora) represent the sister of all other animals. We show that results from empirical data follow predictions from simulations and suggest that, at the least in phylogenies inferred from amino acid sequences, a placement of the ctenophores as sister to all the other animals is best explained as a tree reconstruction artifact.

Keywords: Biological sciences; Evolutionary biology; Phylogenetics.

© 2022 The Authors.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

None
Graphical abstract
Figure 1
Figure 1

Recoding data The six-bin Dayhoff-6 recoding scheme, see STAR methods, is the most widely used recoding strategy, and is the one primarily tested in this study. Dayhoff-6 recoding partitions amino acids into six differently sized bins (see STAR methods), based on how frequently they are expected to exchange with each other. (A) The bins of the Dayhoff-6 scheme and the biochemical properties of the amino acids in each bin. (B) An exemplar amino acid dataset and its Dayhoff-6 recoded representation. Dayhoff-6 recoding is achieved by replacing, in multiple sequence alignments, one letter amino acid codes with one letter codes representing the bin where the considered amino acid is clustered.

Figure 2
Figure 2

Recoding the data improves accuracy when the model fails to fit the amino acid alignment (A) Accuracy of amino acids and recoded data as models that can account for more across-sites compositional heterogeneity are used. (B) Table summarizing the Total Accuracy (TA) for amino acids and recoded data under each model. TA is calculated (from the values in A) as the percentage of accurate trees (see STAR methods) under both Porifera- and Ctenophora-sister. (C) Change in the fit (expressed as Z-scores) of the model to the data (estimated using PPA-Div) as models that can account for more across-sites compositional heterogeneity are used. In Orange amino acid datasets; in Blue recoded datasets. (D) Correlation between the difference in Z-scores achieved by each considered model on the amino acid and recoded datasets (δPPA-Div), against the difference in TA achieved before and after recoding (δTA). See Figures S4–S7 for sensitivity tests showing that our conclusions would not have changed if we used Maximum Likelihood instead of Bayesian analyses, if we run our Bayesian analyses 8,000 more generations (convergence was achieved before 1000 cycles), if we used a more stringent threshold to define success (PP = 0.95 instead of PP = 0.5), and if we used alternative recoding schemes (SR6 and KGB6 see STAR methods for details) instead of Dayhoff-6.

Figure 3
Figure 3

Recoding as a tool in green phylogenomics Time taken to complete 50 GTR and CAT-GTR analyses of amino acid and recoded datasets. Recoded analyses are invariably completed in a shorter time, with the difference becoming significantly more marked when using the complex CAT-GTR model. Given the high accuracy of recoded analyses this result suggests that recoding could play a significant role in the development of “green phylogenomics”.

Figure 4
Figure 4

Dayhoff-6 outperforms random recoding schemes (A) Boxplot representing the distribution of tree lengths for 0%-recoded, 90%-recoded, and Dayhoff-6 recoded dataset. (B) Comparison of PPA-Div scores for 0%-recoded, 90%-recoded, and Dayhoff-6 recoded datasets. The figure indicates that PPA-Div scores of Dayhoff-6 recoded data are significantly better than PPA-Div scores from 0%-recoded data – the distributions do not overlap. (C) A comparison of the TA values achieved by amino acids data, 0%-recoded, 90-%recoded and Dayhoff-6 recoded data, indicating that Dayhoff-6 outperforms both amino acids and randomly recoded data.

Figure 5
Figure 5

Accuracy of recoded data increases with alignment size (A) Accuracy of amino acid and recoded datasets of 1,000, 5,000, 10,000 and 30,000 sites analyzed under nCAT10, when the generating tree assumes Ctenophora-sister to be true. (B) Success rate of amino acid and Dayhoff-6 datasets of 1,000, 5,000, 10,000 and 30,000 sites analyzed under nCAT10, when the generating tree assumes Porifera-sister to be true. Analyses performed in Phylobayes. In Green: Correct trees; Dark Orange: Incorrect Trees; Light Orange: Uncertain trees.

Figure 6
Figure 6

Results from empirical datasets follow predictions from simulations (A) PPA-Div and PPA-Mean scores for all three empirical datasets when the alignments are analyzed as amino acids and Dayhoff-6 recoded data. Average PPA scores for our simulated data are also reported (under nCAT10), indicating that PPA scores achieved for the simulated data under nCAT10 are comparable to those achieved under nCAT60 for the empirical datasets. (B) Support for Ctenophora-sister as different data types (amino acids, 0%-recodings, 90%-recodings, and Dayhoff-6) are used. (C) Support for Porifera-sister as different data types (amino acids, 0%-recodings, 90%-recodings, and Dayhoff-6) are used. In Orange: reference support values obtained for the two considered clades in our simulations. Light Green: Whelan2015; Dark Green Whelan2017; Blue Laumer2019. Note: Support values for 0%-recodings and 90%-recodings represent average values calculated over all random recoding generated.

Similar articles

Cited by

References

    1. Metzker M.L. Sequencing technologies — the next generation. Nat. Rev. Genet. 2010;11:31–46. doi: 10.1038/nrg2626. - DOI - PubMed
    1. Dunn C.W., Giribet G., Edgecombe G.D., Hejnol A. Animal phylogeny and its evolutionary implications. Annu. Rev. Ecol. Evol. Syst. 2014;45:371–395. doi: 10.1146/annurev-ecolsys-120213-091627. - DOI
    1. Kapli P., Yang Z., Telford M.J. Phylogenetic tree building in the genomic age. Nat. Rev. Genet. 2020;21:428–444. doi: 10.1038/s41576-020-0233-0. - DOI - PubMed
    1. Tihelka E., Cai C., Giacomelli M., Lozano-Fernandez J., Rota-Stabelli O., Huang D., Engel M.S., Donoghue P.C.J., Pisani D. The evolution of insect biodiversity. Curr. Biol. 2021;31:R1299–R1311. doi: 10.1016/j.cub.2021.08.057. - DOI - PubMed
    1. Laumer C.E., Fernández R., Lemer S., Combosch D., Kocot K.M., Riesgo A., Andrade S.C.S., Sterrer W., Sørensen M.V., Giribet G. Revisiting metazoan phylogeny with genomic sampling of all phyla. Proc. Biol. Sci. 2019;286:20190831. doi: 10.1098/rspb.2019.0831. - DOI - PMC - PubMed

LinkOut - more resources