Principal components analysis of population admixture - PubMed
Principal components analysis of population admixture
Jianzhong Ma et al. PLoS One. 2012.
Abstract
With the availability of high-density genotype information, principal components analysis (PCA) is now routinely used to detect and quantify the genetic structure of populations in both population genetics and genetic epidemiology. An important issue is how to make appropriate and correct inferences about population relationships from the results of PCA, especially when admixed individuals are included in the analysis. We extend our recently developed theoretical formulation of PCA to allow for admixed populations. Because the sampled individuals are treated as features, our generalized formulation of PCA directly relates the pattern of the scatter plot of the top eigenvectors to the admixture proportions and parameters reflecting the population relationships, and thus can provide valuable guidance on how to properly interpret the results of PCA in practice. Using our formulation, we theoretically justify the diagnostic of two-way admixture. More importantly, our theoretical investigations based on the proposed formulation yield a diagnostic of multi-way admixture. For instance, we found that admixed individuals with three parental populations are distributed inside the triangle formed by their parental populations and divide the triangle into three smaller triangles whose areas have the same proportions in the big triangle as the corresponding admixture proportions. We tested and illustrated these findings using simulated data and data from HapMap III and the Human Genome Diversity Project.
Conflict of interest statement
Competing Interests: The authors have declared that no competing interests exist.
Figures

The first two eigenvectors are shown for a simulated data set with five populations. P1 and P2, the two parental populations of sample size each, were both simulated using
. P3 is an admixture of P1 and P2 with proportion 0.5∶0.5. P4 is an admixture of P1 and P2 with proportion 0.3∶0.7. P5 is an admixture of P1 and P2 with proportion 0.7∶0.3. The sample size for the three admixed populations is 35. The ratios of the distances between the centroids of P3, P4, and P5 and those of P1 and P2 were found to be approximately equal to the corresponding admixture proportions.

The first two eigenvectors are shown for a simulated data set with five populations. P4 and P5, each with sample size , were simulated as admixed populations of P1 and P2 with admixture proportions
:
and
:
, respectively. P3 was simulated as an additional population. P1, P2, and P3, each with sample size
, were simulated using
. The clusters of P4 and P5 were found to lie on the line segment connecting the centroids of P1 and P2, and they divided the segment according to ratios that are approximately equal to the corresponding simulating values of the admixture proportions. The third eigenvector in the left panel addresses the within-population variations.

The first two eigenvectors are shown for a simulated data set with five populations. P4 and P5, with sample size , were simulated as admixed populations of P1 and P2 with admixture proportions drawn from a beta distribution with shape parameters
and
. P3 was simulated as an additional population. P1, P2, and P3, each with sample size
, were simulated using
. Samples from P4 and P5 were distributed along the line connecting the centroids of P1 and P2. Because there were only three independent populations (P1, P2, and P3), only two eigenvectors are needed to address the population variations. This is why along the third eigenvector, only the within-population variations were addressed.

The first two eigenvectors are shown for the four HapMap populations ASW, CEU, CHB, and YRI. A dispersion, or gradient, was formed by the ASW samples as a recently admixed population. CEU and YRI served as the proxy parental populations of ASW. CHB was included in the analysis to introduce an additional dimension of variation, so that the dispersion can be seen in the two-dimensional space. The third eigenvector addresses the within-population variation.

The eigenvectors for the four hypothetical populations defined in Table 1 were calculated from the reduced eigenequation (1). In the plane spanned by the first two eigenvectors (left panel), the representative point of the admixed population, P4, was located inside the triangle formed by the representative points of the three parental populations, P1, P2, and P3, and divided the triangle into three small triangles with areas according to the admixture proportions. On the right panel, P4 was outside the triangle because the third eigenvector, corresponding to a small eigenvalue, did not reflect population structure.

The first three eigenvectors are shown for a simulated data set with five populations. P5, with sample size , was an admixed population of P1, P2, and P3 with admixture proportions
:
:
. P4 was an additional population. P1, P2, P3, and P4 were simulated using
, with sample sizes
,
,
, and
, respectively. In the three-dimensional space, the samples from P5 were found to cluster around a point inside the triangle formed by the centroids of the three parental populations, and they divided the triangle into three small triangles, the ratio of the areas of which was approximately equal to the corresponding ratio of the simulating admixture proportions. Population P4 was included to introduce an additional dimension of variation, so that the admixed population and the parental populations formed an inclined triangle.

The first three eigenvectors are shown for pooled data of the HapMap populations CEU, CHB, MEX, and YRI and the HGDP population Pima. Samples from MEX were found to be distributed around the inclined triangular plane formed by the clusters of CEU, Pima, and YRI, and most of them were inside the triangle. CEU, Pima, and YRI served as the proxy parental populations of the MEX population. CHB was included to introduce an additional dimension of variation, so that the three-way admixture-related populations formed an inclined triangle.
Similar articles
-
Beleza S, Campos J, Lopes J, Araújo II, Hoppfer Almada A, Correia e Silva A, Parra EJ, Rocha J. Beleza S, et al. PLoS One. 2012;7(11):e51103. doi: 10.1371/journal.pone.0051103. Epub 2012 Nov 30. PLoS One. 2012. PMID: 23226471 Free PMC article.
-
Ma J, Amos CI. Ma J, et al. PLoS One. 2010 Sep 17;5(9):e12510. doi: 10.1371/journal.pone.0012510. PLoS One. 2010. PMID: 20862251 Free PMC article.
-
Sohn KA, Ghahramani Z, Xing EP. Sohn KA, et al. Genetics. 2012 Aug;191(4):1295-308. doi: 10.1534/genetics.112.140228. Epub 2012 May 29. Genetics. 2012. PMID: 22649082 Free PMC article.
-
A coalescent-based estimator of admixture from DNA sequences.
Wang J. Wang J. Genetics. 2006 Jul;173(3):1679-92. doi: 10.1534/genetics.105.054130. Epub 2006 Apr 19. Genetics. 2006. PMID: 16624918 Free PMC article.
-
Genetic admixture and obesity: recent perspectives and future applications.
Fernández JR, Pearson KE, Kell KP, Bohan Brown MM. Fernández JR, et al. Hum Hered. 2013;75(2-4):98-105. doi: 10.1159/000353180. Epub 2013 Sep 27. Hum Hered. 2013. PMID: 24081225 Free PMC article. Review.
Cited by
-
Genomic assortative mating in marriages in the United States.
Guo G, Wang L, Liu H, Randall T. Guo G, et al. PLoS One. 2014 Nov 10;9(11):e112322. doi: 10.1371/journal.pone.0112322. eCollection 2014. PLoS One. 2014. PMID: 25384046 Free PMC article.
-
Eigenanalysis of SNP data with an identity by descent interpretation.
Zheng X, Weir BS. Zheng X, et al. Theor Popul Biol. 2016 Feb;107:65-76. doi: 10.1016/j.tpb.2015.09.004. Epub 2015 Oct 23. Theor Popul Biol. 2016. PMID: 26482676 Free PMC article.
-
Liao P, Satten GA, Hu YJ. Liao P, et al. Bioinformatics. 2018 Apr 1;34(7):1157-1163. doi: 10.1093/bioinformatics/btx708. Bioinformatics. 2018. PMID: 29186324 Free PMC article.
-
Soto-Cerda BJ, Duguid S, Booker H, Rowland G, Diederichsen A, Cloutier S. Soto-Cerda BJ, et al. Theor Appl Genet. 2014 Apr;127(4):881-96. doi: 10.1007/s00122-014-2264-4. Epub 2014 Jan 26. Theor Appl Genet. 2014. PMID: 24463785 Free PMC article.
-
Kim HI, Ye B, Gosalia N; Regeneron Genetics Center; Köroğlu Ç, Hanson RL, Hsueh WC, Knowler WC, Baier LJ, Bogardus C, Shuldiner AR, Van Hout CV. Kim HI, et al. Am J Hum Genet. 2020 Aug 6;107(2):251-264. doi: 10.1016/j.ajhg.2020.06.009. Epub 2020 Jul 7. Am J Hum Genet. 2020. PMID: 32640185 Free PMC article.
References
-
- Price A, Patterson N, Plenge R, Weinblatt M, Shadick N, et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38:904–909. - PubMed
-
- Zhu X, Zhang S, Zhao H, Cooper RS. Association mapping, using a mixture model for complex traits. Genet Epidemiol. 2002;23:184–196. - PubMed
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources