pubmed.ncbi.nlm.nih.gov

Comprehensive characterization, annotation and innovative use of Infinium DNA methylation BeadChip probes - PubMed

  • ️Sun Jan 01 2017

Comprehensive characterization, annotation and innovative use of Infinium DNA methylation BeadChip probes

Wanding Zhou et al. Nucleic Acids Res. 2017.

Abstract

Illumina Infinium DNA Methylation BeadChips represent the most widely used genome-scale DNA methylation assays. Existing strategies for masking Infinium probes overlapping repeats or single nucleotide polymorphisms (SNPs) are based largely on ad hoc assumptions and subjective criteria. In addition, the recently introduced MethylationEPIC (EPIC) array expands on the utility of this platform, but has not yet been well characterized. We present in this paper an extensive characterization of probes on the EPIC and HM450 microarrays, including mappability to the latest genome build, genomic copy number of the 3΄ nested subsequence and influence of polymorphisms including a previously unrecognized color channel switch for Type I probes. We show empirical evidence for exclusion criteria for underperforming probes, providing a sounder basis than current ad hoc criteria for exclusion. In addition, we describe novel probe uses, exemplified by the addition of a total of 1052 SNP probes to the existing 59 explicit SNP probes on the EPIC array and the use of these probes to predict ethnicity. Finally, we present an innovative out-of-band color channel application for the dual use of 62 371 probes as internal bisulfite conversion controls.

© The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.

Influence of SNPs on probe performance. (A) Illustration of the Infinium probe design. Type I probes utilize a pair of methylated (M) and unmethylated (U) probes, designed against the methylated and unmethylated versions of the target DNA, respectively. Signals representing these two alleles are measured in the same color channel, determined by the base incorporated (nucleotides A and T are labeled red, and C labeled green) complementary to the extension base (D = A, T, G in IUPAC code) (top). Type II probes uses a single probe and the extension occurs at the target CpG, with a red-labeled A measuring the unmethylated allele and green-labeled G measuring the methylated allele (middle). For Type I probes, color-chanel-switching (CCS) SNPs at the extension base can cause signals to come from the alternative color channel (bottom). Green and red signals reflect different alleles of the SNP. (B) Inter-individual variation (calculated as standard deviations; SDs) in beta values measured by Type I probes associated with a SNP located within the probe at a given distance from the 3΄-end of the probe (the target CpG). Normal samples (n = 705) studied in the TCGA project were used to calculate the variation. SD was first calculated within each tissue type (to avoid variance introduced by tissue-specific methylation) and averaged over 13 tissue types; (C) The averaged SD for beta values measured by Type II probes with a SNP present at a given distance from the 3΄ end of the probe, measured in the normal samples studied in the TCGA project; (D) Variation in beta values for Type I probes with CCS SNPs (red), non-channel-switching SNPs (blue) and beta values rescue for CCS SNPs by combining two color channels (green, see ‘Materials and Methods’ section), stratified by minor allele frequency.

Figure 2.
Figure 2.

(A) Distribution of 8620 TCGA samples on the first and the second principal components identified from beta values measured by explicit SNP probes (rs probes designated by the manufacturer). Samples are colored by self-reported ethnicity; (B) Distribution of 8620 TCGA samples on the first and the second principal components identified from allele frequency recovered from CCS SNP probes; (C) Concordance between self-reported ethnicity and predicted ethnicity using explicit SNP probes on the test dataset (n = 1103, methods); (D) Concordance between self-reported ethnicity and predicted ethnicity using the recovered CCS SNP probes on the test dataset.

Figure 3.
Figure 3.

(A) Box plots showing copy numbers of 3΄ nested subsequence of the EPIC probes in the bisulfite-converted genome, with varying length of the 3΄ subsequence (x axis). (B) The fraction of probes with a unique 3΄ subsequence at a given length from all the probes; (C) Total signal intensities (sum of methylated and unmethylated alleles; y axis) for uniquely mappable Type I probes (see Supplementary Data for Type II probes) with varying copy numbers of the 3΄ subsequence (x axis) of 15, 30 and 40 bases long respectively. The signal intensities are measured in a normal colon sample; (D) Linear regression lines showing the dependence of total signal intensities on the genomic copy number of 3΄-subsequences of different lengths, for Type I probes (left panel) and Type II probes (right panel); (E) Association between the copy number of 3΄ subsequence and measurement accuracy. Averaged absolute differences in beta value measurement between HM450 and WGBS measurements for the same set of samples (n = 18) is plotted against different ranges of the copy number of the 3΄ subsequence of lengths 22, 30 and 40 bases.

Figure 4.
Figure 4.

(A) Illustration of the use of CpC and TpC probes as bisulfite conversion controls. Incomplete conversion at C extension sites (CpC probes, 5΄ CYG in template DNA, left) leads to hybridization of a green-fluorescent G with the retained C. Successful bisulfite conversion for CpC probes should be equivalent to a TpC probe (5΄ TYG in template DNA), and lead to hybridization of a red-fluorescent A with T (right panel). In contrast, probes with a T reference allele at the extension site (TpC probes) should not have green signals in the absence of SNPs, and any green signal for these probes should reflect background. (B) Mean green signal across all 46 733 probes for CpC (y axis) versus TpC probes (x axis) in 8652 TCGA samples; (C) Mean CpH beta values vs Green CpC to TpC (GCT) score for TCGA samples (n = 8652). Testicular Germ Cell Tumors (TGCT) within the TCGA datasets are colored by histology, while other tumors are black. Overall Pearson's correlation coefficient = 0.2, P < 1e-10; (D) Top left: density plot of CpG beta value distribution for cell line replicate with the highest GCT score (green) and the lowest GCT score (red); Top right: density plot of CpG beta value distribution for ten samples in Lung adenocarcinoma (LUAD) with the highest GCT score (green) and ten LUAD samples with the lowest GCT score. The same is repeated for Prostate Adenocarcinoma (PRAD, bottom left) and Bladder Urothelial Carcinoma (BLCA, bottom right).

Figure 5.
Figure 5.

(A) Numbers of probes in each masking category, including previous practices that we do not recommend. Black boxes suggest the enclosed masking is recommended; (B) Circos plot of the distribution of (i) number of EPIC probes in each 1Mb bin; (ii) number of NCBI RefGene; (iii) number of masks because of mapping issues; (iv) number of probes masked because of non-unique 30-base or longer 3΄ subsequence; (v) number of masks for SNPs using global allele frequency from the entire 1000 Genome Project population. (C) The resulting population-specific masking with merged population-specific SNP masking and other recommended masking (shown in panel A).

Similar articles

Cited by

References

    1. Shen H., Laird P.W.. Interplay between the cancer genome and epigenome. Cell. 2013; 153:38–55. - PMC - PubMed
    1. Horvath S. DNA methylation age of human tissues and cell types. Genome Biol. 2013; 14:R115. - PMC - PubMed
    1. Bibikova M., Barnes B., Tsan C., Ho V., Klotzle B., Le J.M., Delano D., Zhang L., Schroth G.P., Gunderson K.L. et al. . High density DNA methylation array with single CpG site resolution. Genomics. 2011; 98:288–295. - PubMed
    1. TCGA. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N. Engl. J. Med. 2013; 368:2059–2074. - PMC - PubMed
    1. Rakyan V.K., Down T.A., Balding D.J., Beck S.. Epigenome-wide association studies for common human diseases. Nat. Rev. Genet. 2011; 12:529–541. - PMC - PubMed

Publication types

MeSH terms

Substances