pubmed.ncbi.nlm.nih.gov

Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys - PubMed

Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys

Jeffrey J Werner et al. ISME J. 2012 Jan.

Abstract

Taxonomic classification of the thousands-millions of 16S rRNA gene sequences generated in microbiome studies is often achieved using a naïve Bayesian classifier (for example, the Ribosomal Database Project II (RDP) classifier), due to favorable trade-offs among automation, speed and accuracy. The resulting classification depends on the reference sequences and taxonomic hierarchy used to train the model; although the influence of primer sets and classification algorithms have been explored in detail, the influence of training set has not been characterized. We compared classification results obtained using three different publicly available databases as training sets, applied to five different bacterial 16S rRNA gene pyrosequencing data sets generated (from human body, mouse gut, python gut, soil and anaerobic digester samples). We observed numerous advantages to using the largest, most diverse training set available, that we constructed from the Greengenes (GG) bacterial/archaeal 16S rRNA gene sequence database and the latest GG taxonomy. Phylogenetic clusters of previously unclassified experimental sequences were identified with notable improvements (for example, 50% reduction in reads unclassified at the phylum level in mouse gut, soil and anaerobic digester samples), especially for phylotypes belonging to specific phyla (Tenericutes, Chloroflexi, Synergistetes and Candidate phyla TM6, TM7). Trimming the reference sequences to the primer region resulted in systematic improvements in classification depth, and greatest gains at higher confidence thresholds. Phylotypes unclassified at the genus level represented a greater proportion of the total community variation than classified operational taxonomic units in mouse gut and anaerobic digester samples, underscoring the need for greater diversity in existing reference databases.

PubMed Disclaimer

Figures

Figure 1
Figure 1

Relative abundance of the 10 major phyla identified by naïve Bayesian classification using five different training sets: three of approximately the same size: GG91.3, SILVA98.7, and RDP TS6, and two larger training sets: GG99 and the SILVA subset for Mothur. Relative abundances were averaged for samples of five different studies (note that human gut is shown apart from non-gut samples from the sample study): (a) human gut, (b) non-gut human body locations, (c) mouse gut, (d) anaerobic digester, (e) python gut and (f) soils.

Figure 2
Figure 2

Summary of OTU classification depth using each of the three training sets for two of the four studies: (a) human body OTUs, and (b) soil OTUs (other three data sets shown in Supplementary Figure S6). OTUs are organized according to evolutionary history, as determined by the FastTree approximately-maximum-likelihood tree constructed in the default QIIME pipeline. Inset charts summarize the total number of OTUs classified at each taxonomic level (GG99=dark blue, GG91.3=light blue, SILVA=green, RDP TS6=orange).

Figure 3
Figure 3

The effect of trimming the GG99 training set on classification depth, for each of the five data sets: (a) the total number of OTUs classified at each taxonomic level (Tr=trimmed; full=full length; color key indicates different percentage confidence thresholds applied to the naïve Bayesian model: 60%, 80% and 95%), (b) the total number of classified OTUs gained as a consequence of trimming the training set and (c) the percentage gain of total OTUs classified as a consequence of trimming the training set. Note that all values in (b) and (c) are positive, indicating that trimming always afforded a net gain in classification precision.

Figure 4
Figure 4

Ability of either classified or unclassified OTUs alone to recapture the variance of the whole data set. OTUs were classified using the GG99 training set. Each of the plots represent the phylogenetic variation among all soil samples, calculated using all OTUs, along the x axis (UniFrac principal coordinate 1; PC1), correlated to either OTUs that were classified to the genus level (a) or to unclassified OTUs at the genus level (b), on the y axis. If the variance in the OTU subsets (classified or unclassified) explains as much variance as the whole set, then a straight diagonal line is expected. The summary of R2 values for a similar analysis of each of the five sample types is shown for comparison (c). Error bars, and errors on R2 values, represent the s.d. of 10 rarefactions, 200 sequences each. Soil sample data are shown in (a) and (b); other four UniFrac data sets available in Supplementary Figure S7.

Similar articles

Cited by

References

    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. - PubMed
    1. Binladen J, Gilbert MT, Bollback JP Panitz F, Bendixen C, Nielsen R, et al. The use of coded PCR primers enables high-throughput sequencing of multiple homolog amplification products by 454 parallel sequencing. PLoS One. 2007;2:e197. - PMC - PubMed
    1. Caporaso JG, Bittinger K, Bushman FD, DeSantis TZ, Andersen GL, Knight R. PyNAST: a flexible tool for aligning sequences to a template alignment. Bioinformatics. 2010a;26:266–267. - PMC - PubMed
    1. Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, et al. QIIME allows analysis of high-throughput community sequencing data. Nat Methods. 2010b;7:335–336. - PMC - PubMed
    1. Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ, et al. The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucl Acids Res. 2009;37 (Database issue:D141–D145. - PMC - PubMed

Publication types

MeSH terms

Substances