pubmed.ncbi.nlm.nih.gov

Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform - PubMed

  • ️Thu Jan 01 2015

Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform

Melanie Schirmer et al. Nucleic Acids Res. 2015.

Abstract

With read lengths of currently up to 2 × 300 bp, high throughput and low sequencing costs Illumina's MiSeq is becoming one of the most utilized sequencing platforms worldwide. The platform is manageable and affordable even for smaller labs. This enables quick turnaround on a broad range of applications such as targeted gene sequencing, metagenomics, small genome sequencing and clinical molecular diagnostics. However, Illumina error profiles are still poorly understood and programs are therefore not designed for the idiosyncrasies of Illumina data. A better knowledge of the error patterns is essential for sequence analysis and vital if we are to draw valid conclusions. Studying true genetic variation in a population sample is fundamental for understanding diseases, evolution and origin. We conducted a large study on the error patterns for the MiSeq based on 16S rRNA amplicon sequencing data. We tested state-of-the-art library preparation methods for amplicon sequencing and showed that the library preparation method and the choice of primers are the most significant sources of bias and cause distinct error patterns. Furthermore we tested the efficiency of various error correction strategies and identified quality trimming (Sickle) combined with error correction (BayesHammer) followed by read overlapping (PANDAseq) as the most successful approach, reducing substitution error rates on average by 93%.

© The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.

Nucleotide-specific substitution error profiles for data set DS 35: each graph shows the substitution rates for a specific original nucleotide and the colours indicate the substituting nucleotide. The first four graphs show the R1 profiles and the last four graphs show the R2 profiles.

Figure 2.
Figure 2.

Error profiles for insertions, deletions and unknown nucleotides (Ns): the first three graphs show the R1 error profiles. For insertions the colour identifies the inserted nucleotide and for deletions the colour refers to the type of nucleotide that was deleted. The lower three graphs display the error profiles for the R2 reads, respectively.

Figure 3.
Figure 3.

Quality profiles for R1 and R2 reads: the boxplots in the first column display the distribution of quality scores for all reads. The second column shows the distribution of quality scores associated with errors and the last column shows the average quality score of substitution errors for each position across the read.

Figure 4.
Figure 4.

The figure compares the theoretical accuracy (blue) of the quality scores to the actual accuracy (red) observed for data set DS 35.

Figure 5.
Figure 5.

Comparison of error distributions for all data sets. We used the Hellinger distance to construct similarity matrices for the error distributions and summed over all types of substitutions, insertions and deletions, respectively. The colour indicates the library preparation method (see the legend) and the shape indicates different runs.

Figure 6.
Figure 6.

We compared the overall error rates for each data set. The lower x-axis indicates the name of the data set and the upper x-axis specifies the library preparation method. The error bars show the extent that each original nucleotide contributed to the overall error rate. Data sets are grouped by library preparation (solid lines) and primers (dashed lines).

Figure 7.
Figure 7.

We recorded all 3 mers preceding substitution, insertion or deletion errors, respectively. The first column displays the three most common motifs for each data set and the second column illustrates the percentage of errors that were associated with the respective motif. The solid lines separate the data sets according to library preparation methods and the dashed lines further divide them according to the different forward primers that were used.

Figure 8.
Figure 8.

Overview of 50th and 75th quartile of quality scores associated with errors across all data sets. The results for the R1 reads are displayed on the left and the results for the R2 reads are on the right. Data sets were grouped by library preparation method (DI = dual index; SI = single index; FG = Fusion Golay; XT = NexteraXT) and substitution, insertion and deletion errors are displayed separately. Note, that for none of the single index data sets enough R1 reads aligned to construct meaningful quality profiles (threshold = 1000 reads).

Figure 9.
Figure 9.

Insertion and deletion rates of raw reads, after trimming up to 10 bp off the read start and after additionally trimming up to 10 bp off the read end. (Note: none of the R1 single index data sets contained ≥1000 reads after alignment.)

Figure 10.
Figure 10.

The figure compares the error rates of the raw reads (R1+R2 rates) to different error corrections approaches including Trimming+BayesHammer, overlapping reads with PANDAseq and overlapping reads with PEAR. We only included data sets for which at least 1000 reads aligned for all methods. Data sets not included: 19–26, 52+53 (not enough raw R1 reads aligned), 39–45+47 (not enough raw R2 reads aligned).

Figure 11.
Figure 11.

The figure compares the number of aligned reads relative to the initial number of raw reads. For the raw reads and for reads that were processed with Sickle plus BayesHammer, we summed the R1 and R2 rates. We also included trimming plus BayesHammer and overlapping with PANDAseq and PEAR, respectively, as those combination of approaches returned the lowest error rates. Data sets are grouped by library preparation (solid line) and primers (dashed line).

Figure 12.
Figure 12.

The figure shows the range of average error rates for the different library preparation methods (indicated on the upper x-axis). The grey bar plots show the error rates for the raw reads, in red are the error rates after trimming and error correction, in blue and yellow are the error rates after additionally overlapping reads with PANDAseq and PEAR, respectively.

Similar articles

Cited by

References

    1. Huse S.M., Huber J.A., Morrison H.G., Sogin M.L., Welch D.M. Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol. 2007;8:R143. - PMC - PubMed
    1. Archer J., Baillie G., Watson S., Kellam P., Rambaut A., Robertson D. Analysis of high-depth sequence data for studying viral diversity: a comparison of next generation sequencing platforms using Segminator II. BMC Bioinformatics. 2012;13:47. - PMC - PubMed
    1. Kircher M., Stenzel U., Kelso J. Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome Biol. 2009;10:R83. - PMC - PubMed
    1. Wang X.V., Blades N., Ding J., Sultana R., Parmigiani G. Estimation of sequencing error rates in short reads. BMC Bioinformatics. 2012;13:185. - PMC - PubMed
    1. Minoche A.E., Dohm J.C., Himmelbauer H. Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems. Genome Biol. 2011;12:R112. - PMC - PubMed

Publication types

MeSH terms

Substances