pubmed.ncbi.nlm.nih.gov

Principal components analysis of population admixture - PubMed

Principal components analysis of population admixture

Jianzhong Ma et al. PLoS One. 2012.

Abstract

With the availability of high-density genotype information, principal components analysis (PCA) is now routinely used to detect and quantify the genetic structure of populations in both population genetics and genetic epidemiology. An important issue is how to make appropriate and correct inferences about population relationships from the results of PCA, especially when admixed individuals are included in the analysis. We extend our recently developed theoretical formulation of PCA to allow for admixed populations. Because the sampled individuals are treated as features, our generalized formulation of PCA directly relates the pattern of the scatter plot of the top eigenvectors to the admixture proportions and parameters reflecting the population relationships, and thus can provide valuable guidance on how to properly interpret the results of PCA in practice. Using our formulation, we theoretically justify the diagnostic of two-way admixture. More importantly, our theoretical investigations based on the proposed formulation yield a diagnostic of multi-way admixture. For instance, we found that admixed individuals with three parental populations are distributed inside the triangle formed by their parental populations and divide the triangle into three smaller triangles whose areas have the same proportions in the big triangle as the corresponding admixture proportions. We tested and illustrated these findings using simulated data and data from HapMap III and the Human Genome Diversity Project.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Simulation of genetically homogeneous admixed populations: two-way admixture.

The first two eigenvectors are shown for a simulated data set with five populations. P1 and P2, the two parental populations of sample size formula image each, were both simulated using formula image. P3 is an admixture of P1 and P2 with proportion 0.5∶0.5. P4 is an admixture of P1 and P2 with proportion 0.3∶0.7. P5 is an admixture of P1 and P2 with proportion 0.7∶0.3. The sample size for the three admixed populations is 35. The ratios of the distances between the centroids of P3, P4, and P5 and those of P1 and P2 were found to be approximately equal to the corresponding admixture proportions.

Figure 2
Figure 2. Simulation of genetically homogeneous admixed populations with an additional population: two-way admixture.

The first two eigenvectors are shown for a simulated data set with five populations. P4 and P5, each with sample size formula image, were simulated as admixed populations of P1 and P2 with admixture proportions formula image:formula image and formula image:formula image, respectively. P3 was simulated as an additional population. P1, P2, and P3, each with sample size formula image, were simulated using formula image. The clusters of P4 and P5 were found to lie on the line segment connecting the centroids of P1 and P2, and they divided the segment according to ratios that are approximately equal to the corresponding simulating values of the admixture proportions. The third eigenvector in the left panel addresses the within-population variations.

Figure 3
Figure 3. Simulation of genetically recently admixed populations with an additional population: two-way admixture.

The first two eigenvectors are shown for a simulated data set with five populations. P4 and P5, with sample size formula image, were simulated as admixed populations of P1 and P2 with admixture proportions drawn from a beta distribution with shape parameters formula image and formula image. P3 was simulated as an additional population. P1, P2, and P3, each with sample size formula image, were simulated using formula image. Samples from P4 and P5 were distributed along the line connecting the centroids of P1 and P2. Because there were only three independent populations (P1, P2, and P3), only two eigenvectors are needed to address the population variations. This is why along the third eigenvector, only the within-population variations were addressed.

Figure 4
Figure 4. An example of two-way admixture from HapMap data.

The first two eigenvectors are shown for the four HapMap populations ASW, CEU, CHB, and YRI. A dispersion, or gradient, was formed by the ASW samples as a recently admixed population. CEU and YRI served as the proxy parental populations of ASW. CHB was included in the analysis to introduce an additional dimension of variation, so that the dispersion can be seen in the two-dimensional space. The third eigenvector addresses the within-population variation.

Figure 5
Figure 5. Theoretical prediction of PCA: three-way admixture.

The eigenvectors for the four hypothetical populations defined in Table 1 were calculated from the reduced eigenequation (1). In the plane spanned by the first two eigenvectors (left panel), the representative point of the admixed population, P4, was located inside the triangle formed by the representative points of the three parental populations, P1, P2, and P3, and divided the triangle into three small triangles with areas according to the admixture proportions. On the right panel, P4 was outside the triangle because the third eigenvector, corresponding to a small eigenvalue, did not reflect population structure.

Figure 6
Figure 6. Simulation of a genetically homogeneous admixed population with an additional population: three-way admixture.

The first three eigenvectors are shown for a simulated data set with five populations. P5, with sample size formula image, was an admixed population of P1, P2, and P3 with admixture proportions formula image:formula image:formula image. P4 was an additional population. P1, P2, P3, and P4 were simulated using formula image, with sample sizes formula image, formula image, formula image, and formula image, respectively. In the three-dimensional space, the samples from P5 were found to cluster around a point inside the triangle formed by the centroids of the three parental populations, and they divided the triangle into three small triangles, the ratio of the areas of which was approximately equal to the corresponding ratio of the simulating admixture proportions. Population P4 was included to introduce an additional dimension of variation, so that the admixed population and the parental populations formed an inclined triangle.

Figure 7
Figure 7. An example of three-way admixture from the HapMap and HGDP samples.

The first three eigenvectors are shown for pooled data of the HapMap populations CEU, CHB, MEX, and YRI and the HGDP population Pima. Samples from MEX were found to be distributed around the inclined triangular plane formed by the clusters of CEU, Pima, and YRI, and most of them were inside the triangle. CEU, Pima, and YRI served as the proxy parental populations of the MEX population. CHB was included to introduce an additional dimension of variation, so that the three-way admixture-related populations formed an inclined triangle.

Similar articles

Cited by

References

    1. Patterson N, Price A, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;4:2074–2093. - PMC - PubMed
    1. Price A, Patterson N, Plenge R, Weinblatt M, Shadick N, et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38:904–909. - PubMed
    1. Zhu X, Zhang S, Zhao H, Cooper RS. Association mapping, using a mixture model for complex traits. Genet Epidemiol. 2002;23:184–196. - PubMed
    1. Yu K, Wang Z, Li Q, Wacholder S, Hunter D, et al. Population substructure and control selection in genome-wide association studies. PLoS ONE. 2008;3:e2551. - PMC - PubMed
    1. Tian C, Plenge RM, Ransom M, Lee A, Villoslada P, et al. Analysis and application of European genetic substructure using 300K SNP information. PLoS Genet. 2008;4:e4. - PMC - PubMed

Publication types

MeSH terms