pubmed.ncbi.nlm.nih.gov

Using classical population genetics tools with heterochroneous data: time matters! - PubMed

  • ️Thu Oct 10 2148

Using classical population genetics tools with heterochroneous data: time matters!

Frantz Depaulis et al. PLoS One. 2009.

Abstract

Background: New polymorphism datasets from heterochroneous data have arisen thanks to recent advances in experimental and microbial molecular evolution, and the sequencing of ancient DNA (aDNA). However, classical tools for population genetics analyses do not take into account heterochrony between subsets, despite potential bias on neutrality and population structure tests. Here, we characterize the extent of such possible biases using serial coalescent simulations.

Methodology/principal findings: We first use a coalescent framework to generate datasets assuming no or different levels of heterochrony and contrast most classical population genetic statistics. We show that even weak levels of heterochrony ( approximately 10% of the average depth of a standard population tree) affect the distribution of polymorphism substantially, leading to overestimate the level of polymorphism theta, to star like trees, with an excess of rare mutations and a deficit of linkage disequilibrium, which are the hallmark of e.g. population expansion (possibly after a drastic bottleneck). Substantial departures of the tests are detected in the opposite direction for more heterochroneous and equilibrated datasets, with balanced trees mimicking in particular population contraction, balancing selection, and population differentiation. We therefore introduce simple corrections to classical estimators of polymorphism and of the genetic distance between populations, in order to remove heterochrony-driven bias. Finally, we show that these effects do occur on real aDNA datasets, taking advantage of the currently available sequence data for Cave Bears (Ursus spelaeus), for which large mtDNA haplotypes have been reported over a substantial time period (22-130 thousand years ago (KYA)).

Conclusions/significance: Considering serial sampling changed the conclusion of several tests, indicating that neglecting heterochrony could provide significant support for false past history of populations and inappropriate conservation decisions. We therefore argue for systematically considering heterochroneous models when analyzing heterochroneous samples covering a large time scale.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Heterochrony effects on gene genealogies.

(A) Contemporaneous dataset. (B) Heterochroneous dataset. Lineages of sequences cannot reach a common ancestor before they are contemporaneous.

Figure 2
Figure 2. Outline of the main models simulated.

(A) Single panmitic population with variable proportions of ancient data (n 1/n) (one third in this example) of moderate age (0.1) with respect to the time unit of 2Ne generations (i.e. the average age of the root of a population tree in the homogeneous, contemporaneous case; see figure 3). (B) Corresponding simulations for the population differentiation (Fst) analysis; a single homogenous set of individual, but labeled as randomly split into two populations equally sampled, one showing variable proportions of ancient data (n 1/n again, one third in this example) of moderate age (again t 1 = 0.1; see figure 3, Fst). (C) Single panmitic population with equal proportions of ancient and modern data (n 1/n = 1/2, equivalent to the two population samples for the Fst analysis;), the age of the ancient samples ranges from 0 to 20 Ne generations; see figure 4). See text for more explanations.

Figure 3
Figure 3. Effect of subset size on statistical tests.

The temporal spacing between the two subsets is set to 0.2 Ne generations. Ten thousands runs were simulated for each set of parameter values. The X axis corresponds to the proportion of the ancient subset. DT: Tajima's D ; D*FL: Fu and Li's D* ; HFW: Fay and Wu's H ; Note that this statistics is not standardized by its variance and can thus potentially show high absolute values, hence a rather erratic behavior on fig. 2a]; ZnS: Kelly's ZnS ; K and H: Depaulis and Veuille's haplotype tests (; K is scaled to the expected S+1, its expected maximal value in the absence of recombination and homoplasy); Slope: recombination test, pearson correlation coefficient between pairwise allelic correlation and distance between mutations tested by permutations according to Awaddala and colleagues ; Fst: Hudson, Slatkin and Maddison's Fst between two population subsamples of equal size 50∶50, then the X axis corresponds to the proportion of ancient sequences in the second subset. This Fst is tested by permutations between subsets . Five hundred permutations were used in these last two tests. (A) Mean (bias) and (B) Proportion of significant runs that show deviation from the standard coalescent expectations (rate of false positives). Only portions of curves above 6% (as an arbitrary threshold of marginal significance) are shown for clarity. Note the different scale of the Y axis on the top part of figure B. Empty symbols: deficit of the statistics; filled symbols: excess of the statistics.

Figure 4
Figure 4. Effect of the time spacing with a 50% subset on statistical tests.

ni = 50, whole second population subsample in the Fst analysis. The X axis is expressed in units of 2Ne generations. Same labeling and other parameter values as in figure 3.

Figure 5
Figure 5. Effect of time sampling schemes on the statistics.

(a) Means. For comparison, statistics with non-null means in the contemporaneous case are scaled to the upper bound of their confidence interval under such null hypothesis. (b) Proportions of significant runs only in the direction of deviation potentially leading to deviation (if any) in the heterochroneous case are shown (the other one remaining below 5%). ‘inf’: deficit of the statistic; ‘sup’: excess of the statistic. Open bars: contemporaneous; stripped bars: regular in the range [0–0.2]; homogeneous gray bars: uniform, same range; gradient-filled bars: exponential with mean 0.1 (truncated at 10 to limit CPU and assuming that there was no chance at all to obtain as old DNA for a species that may not have even existed at that time).

Figure 6
Figure 6. Heterochrony-driven biases on summary statistics: a synthesis.

(A): contemporaneous case. (B) Heterochroneous dataset with limited time range. Lineages of sequences cannot reach a common ancestor before they are contemporaneous, leading to genealogies with proportionally longer external branches and excess of rare mutations thus mimicking bottlenecks, expansions or tightly linked selection. (c) Two subsets separated by a large time lapse. The coalescence process is finished within the most recent subset before reaching the ancient subset sampling point, leading to a genealogy with a long internal branch, more variation, especially for intermediate frequency mutations, and a genetically isolated subset, thus mimicking simple population structure or contraction. t 1: time lapse; n 1: oldest subset's size.

Similar articles

Cited by

References

    1. Hudson RR. The how and why of generating gene genealogies. In: Takahata N, Clark AG, editors. Mechanism of molecular evolution. Japan Scientific Societies Press, Sinauer Associates; 1993. pp. 23–36.
    1. Wright S. Evolution in Mendelian populations. Genetics. 1931;16:97–159. - PMC - PubMed
    1. Williamson EG, Slatkin M. Using maximum likelihood to estimate population size from temporal changes in allele frequencies. Genetics. 1999;152:755–761. - PMC - PubMed
    1. Raquin AL, Depaulis F, Lambert A, Galic N, Brabant P, et al. Experimental estimation of mutation rates in a wheat population with gene genealogy approach. Genetics. 2008;179:2195–2211. - PMC - PubMed
    1. Drummond AJ, Pybus OG, Rambaut A, Forsberg R, Rodrigo AG. Measurably evolving populations. Trends Ecol Evol. 2003;18:481–488.

Publication types

MeSH terms

Substances