pubmed.ncbi.nlm.nih.gov

Can deliberately incomplete gene sample augmentation improve a phylogeny estimate for the advanced moths and butterflies (Hexapoda: Lepidoptera)? - PubMed

  • ️Mon Nov 18 2148

Can deliberately incomplete gene sample augmentation improve a phylogeny estimate for the advanced moths and butterflies (Hexapoda: Lepidoptera)?

Soowon Cho et al. Syst Biol. 2011 Dec.

Abstract

This paper addresses the question of whether one can economically improve the robustness of a molecular phylogeny estimate by increasing gene sampling in only a subset of taxa, without having the analysis invalidated by artifacts arising from large blocks of missing data. Our case study stems from an ongoing effort to resolve poorly understood deeper relationships in the large clade Ditrysia ( > 150,000 species) of the insect order Lepidoptera (butterflies and moths). Seeking to remedy the overall weak support for deeper divergences in an initial study based on five nuclear genes (6.6 kb) in 123 exemplars, we nearly tripled the total gene sample (to 26 genes, 18.4 kb) but only in a third (41) of the taxa. The resulting partially augmented data matrix (45% intentionally missing data) consistently increased bootstrap support for groupings previously identified in the five-gene (nearly) complete matrix, while introducing no contradictory groupings of the kind that missing data have been predicted to produce. Our results add to growing evidence that data sets differing substantially in gene and taxon sampling can often be safely and profitably combined. The strongest overall support for nodes above the family level came from including all nucleotide changes, while partitioning sites into sets undergoing mostly nonsynonymous versus mostly synonymous change. In contrast, support for the deepest node for which any persuasive molecular evidence has yet emerged (78-85% bootstrap) was weak or nonexistent unless synonymous change was entirely excluded, a result plausibly attributed to compositional heterogeneity. This node (Gelechioidea + Apoditrysia), tentatively proposed by previous authors on the basis of four morphological synapomorphies, is the first major subset of ditrysian superfamilies to receive strong statistical support in any phylogenetic study. A "more-genes-only" data set (41 taxa×26 genes) also gave strong signal for a second deep grouping (Macrolepidoptera) that was obscured, but not strongly contradicted, in more taxon-rich analyses.

PubMed Disclaimer

Figures

F<sc>IGURE</sc> 1.
FIGURE 1.

Diagram of gene and taxon sampling design, showing relationships among the three data sets analyzed. a) Five-gene complete matrix (123 taxa; from Regier et al. 2009). b) Partially augmented matrix, deliberately incomplete, created by adding 21 genes for just 41 of the 123 taxa in the five-gene complete matrix. c) More-genes-only matrix, consisting of just the 41 species sequenced for all 26 genes.

F<sc>IGURE</sc> 2.
FIGURE 2.

Comparison of ML trees of family relationships inferred from the five-gene complete matrix (left column) to those from the partially augmented matrix (123 taxa×5 or 26 genes; right column), simplified from full 123-taxon trees shown in Figures S1–S4. Black triangles denote families with multiple exemplars. Numbers in parentheses after family name represent number of exemplars for the five-gene complete matrix, number with 26 genes/total number for partially augmented matrix). ### denotes families with one or more exemplars scored for 26 genes for partially augmented matrix. BPs > 50% are shown above branches; number of replicates is 1000 for a) and b), 2000 for c) and d). a) nt123 partitioned, five-gene complete matrix; b) nt123 partitioned, partially augmented matrix; c) all nonsynonymous coding, five-gene complete matrix; d) all nonsynonymous coding, partially augmented matrix.

F<sc>IGURE</sc> 3.
FIGURE 3.

a)–c) ML trees of family relationships inferred from more-genes-only data set (41 taxa×26 genes), simplified (except in c) from full 41-taxon trees shown in Figures S5 and S6. Black triangles denote families with multiple exemplars; number of exemplars shown in parentheses after family name. a) all nonsynonymous coding, phylogram; b) all nonsynonymous coding, cladogram; c) nt123 partitioned. BPs > 50% are shown above branches; number of replicates is 1000 for a), 2000 for b). d) Relationships among the sampled families (only) according to the morphology-based working hypothesis of Kristensen and Skalski (1998).

Similar articles

Cited by

References

    1. Bazinet AL, Cummings MP. The lattice project: a grid research and production environment combining multiple grid computing models. In: Weber MHW, editor. Distributed and grid computing—science made transparent for everyone. Principles, applications and supporting communities. Marburg (Germany): Tectum Publishing House; 2009. pp. 2–13.
    1. Brower AVZ, DeSalle R. Mitochondrial vs. nuclear DNA sequence evolution among nymphalid butterflies: the utility of wingless as a source of characters for phylogenetic inference. Insect Mol. Biol. 1998;7:1–10. - PubMed
    1. Burleigh JG, Hilu KW, Soltis DE. Inferring phylogenies with incomplete data sets: a 5-gene, 567-taxon analysis of angiosperms. BMC Evol. Biol. 2009;9:61. - PMC - PubMed
    1. Cummings MP, Huskamp JC. Grid computing. EDUCAUSE Rev. 2005;40:116–117.
    1. de Queiroz A, Gatesy J. The supermatrix approach to systematics. Trends Ecol. Evol. 2007;22:34–41. - PubMed

Publication types

MeSH terms

Substances