Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform - PubMed
- ️Thu Jan 01 2015
Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform
Melanie Schirmer et al. Nucleic Acids Res. 2015.
Abstract
With read lengths of currently up to 2 × 300 bp, high throughput and low sequencing costs Illumina's MiSeq is becoming one of the most utilized sequencing platforms worldwide. The platform is manageable and affordable even for smaller labs. This enables quick turnaround on a broad range of applications such as targeted gene sequencing, metagenomics, small genome sequencing and clinical molecular diagnostics. However, Illumina error profiles are still poorly understood and programs are therefore not designed for the idiosyncrasies of Illumina data. A better knowledge of the error patterns is essential for sequence analysis and vital if we are to draw valid conclusions. Studying true genetic variation in a population sample is fundamental for understanding diseases, evolution and origin. We conducted a large study on the error patterns for the MiSeq based on 16S rRNA amplicon sequencing data. We tested state-of-the-art library preparation methods for amplicon sequencing and showed that the library preparation method and the choice of primers are the most significant sources of bias and cause distinct error patterns. Furthermore we tested the efficiency of various error correction strategies and identified quality trimming (Sickle) combined with error correction (BayesHammer) followed by read overlapping (PANDAseq) as the most successful approach, reducing substitution error rates on average by 93%.
© The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Figures

Nucleotide-specific substitution error profiles for data set DS 35: each graph shows the substitution rates for a specific original nucleotide and the colours indicate the substituting nucleotide. The first four graphs show the R1 profiles and the last four graphs show the R2 profiles.

Error profiles for insertions, deletions and unknown nucleotides (Ns): the first three graphs show the R1 error profiles. For insertions the colour identifies the inserted nucleotide and for deletions the colour refers to the type of nucleotide that was deleted. The lower three graphs display the error profiles for the R2 reads, respectively.

Quality profiles for R1 and R2 reads: the boxplots in the first column display the distribution of quality scores for all reads. The second column shows the distribution of quality scores associated with errors and the last column shows the average quality score of substitution errors for each position across the read.

The figure compares the theoretical accuracy (blue) of the quality scores to the actual accuracy (red) observed for data set DS 35.

Comparison of error distributions for all data sets. We used the Hellinger distance to construct similarity matrices for the error distributions and summed over all types of substitutions, insertions and deletions, respectively. The colour indicates the library preparation method (see the legend) and the shape indicates different runs.

We compared the overall error rates for each data set. The lower x-axis indicates the name of the data set and the upper x-axis specifies the library preparation method. The error bars show the extent that each original nucleotide contributed to the overall error rate. Data sets are grouped by library preparation (solid lines) and primers (dashed lines).

We recorded all 3 mers preceding substitution, insertion or deletion errors, respectively. The first column displays the three most common motifs for each data set and the second column illustrates the percentage of errors that were associated with the respective motif. The solid lines separate the data sets according to library preparation methods and the dashed lines further divide them according to the different forward primers that were used.

Overview of 50th and 75th quartile of quality scores associated with errors across all data sets. The results for the R1 reads are displayed on the left and the results for the R2 reads are on the right. Data sets were grouped by library preparation method (DI = dual index; SI = single index; FG = Fusion Golay; XT = NexteraXT) and substitution, insertion and deletion errors are displayed separately. Note, that for none of the single index data sets enough R1 reads aligned to construct meaningful quality profiles (threshold = 1000 reads).

Insertion and deletion rates of raw reads, after trimming up to 10 bp off the read start and after additionally trimming up to 10 bp off the read end. (Note: none of the R1 single index data sets contained ≥1000 reads after alignment.)

The figure compares the error rates of the raw reads (R1+R2 rates) to different error corrections approaches including Trimming+BayesHammer, overlapping reads with PANDAseq and overlapping reads with PEAR. We only included data sets for which at least 1000 reads aligned for all methods. Data sets not included: 19–26, 52+53 (not enough raw R1 reads aligned), 39–45+47 (not enough raw R2 reads aligned).

The figure compares the number of aligned reads relative to the initial number of raw reads. For the raw reads and for reads that were processed with Sickle plus BayesHammer, we summed the R1 and R2 rates. We also included trimming plus BayesHammer and overlapping with PANDAseq and PEAR, respectively, as those combination of approaches returned the lowest error rates. Data sets are grouped by library preparation (solid line) and primers (dashed line).

The figure shows the range of average error rates for the different library preparation methods (indicated on the upper x-axis). The grey bar plots show the error rates for the raw reads, in red are the error rates after trimming and error correction, in blue and yellow are the error rates after additionally overlapping reads with PANDAseq and PEAR, respectively.
Similar articles
-
Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data.
Schirmer M, D'Amore R, Ijaz UZ, Hall N, Quince C. Schirmer M, et al. BMC Bioinformatics. 2016 Mar 11;17:125. doi: 10.1186/s12859-016-0976-y. BMC Bioinformatics. 2016. PMID: 26968756 Free PMC article.
-
de Muinck EJ, Trosvik P, Gilfillan GD, Hov JR, Sundaram AYM. de Muinck EJ, et al. Microbiome. 2017 Jul 6;5(1):68. doi: 10.1186/s40168-017-0279-1. Microbiome. 2017. PMID: 28683838 Free PMC article.
-
Mysara M, Leys N, Raes J, Monsieurs P. Mysara M, et al. BMC Bioinformatics. 2016 Apr 29;17(1):192. doi: 10.1186/s12859-016-1061-2. BMC Bioinformatics. 2016. PMID: 27130479 Free PMC article.
-
MiSeq: A Next Generation Sequencing Platform for Genomic Analysis.
Ravi RK, Walton K, Khosroheidari M. Ravi RK, et al. Methods Mol Biol. 2018;1706:223-232. doi: 10.1007/978-1-4939-7471-9_12. Methods Mol Biol. 2018. PMID: 29423801 Review.
-
Challenges of next-generation sequencing targeting anaerobes.
Conrads G, Abdelbary MMH. Conrads G, et al. Anaerobe. 2019 Aug;58:47-52. doi: 10.1016/j.anaerobe.2019.02.006. Epub 2019 Feb 12. Anaerobe. 2019. PMID: 30769104 Review.
Cited by
-
STROBE-metagenomics: a STROBE extension statement to guide the reporting of metagenomics studies.
Bharucha T, Oeser C, Balloux F, Brown JR, Carbo EC, Charlett A, Chiu CY, Claas ECJ, de Goffau MC, de Vries JJC, Eloit M, Hopkins S, Huggett JF, MacCannell D, Morfopoulou S, Nath A, O'Sullivan DM, Reoma LB, Shaw LP, Sidorov I, Simner PJ, Van Tan L, Thomson EC, van Dorp L, Wilson MR, Breuer J, Field N. Bharucha T, et al. Lancet Infect Dis. 2020 Oct;20(10):e251-e260. doi: 10.1016/S1473-3099(20)30199-7. Epub 2020 Aug 5. Lancet Infect Dis. 2020. PMID: 32768390 Free PMC article. Review.
-
Valori M, Lehikoinen J, Jansson L, Clancy J, Lundgren SA, Mustjoki S, Tienari P. Valori M, et al. PLoS One. 2022 Nov 28;17(11):e0278245. doi: 10.1371/journal.pone.0278245. eCollection 2022. PLoS One. 2022. PMID: 36441748 Free PMC article.
-
Smith S, Bongrand C, Lawhorn S, Ruby EG, Septer AN. Smith S, et al. bioRxiv [Preprint]. 2024 Sep 25:2024.09.23.614625. doi: 10.1101/2024.09.23.614625. bioRxiv. 2024. PMID: 39386430 Free PMC article. Preprint.
-
Trego A, Keating C, Nzeteu C, Graham A, O'Flaherty V, Ijaz UZ. Trego A, et al. Microorganisms. 2022 Oct 1;10(10):1961. doi: 10.3390/microorganisms10101961. Microorganisms. 2022. PMID: 36296237 Free PMC article. Review.
-
Recent developments in detection and enumeration of waterborne bacteria: a retrospective minireview.
Deshmukh RA, Joshi K, Bhand S, Roy U. Deshmukh RA, et al. Microbiologyopen. 2016 Dec;5(6):901-922. doi: 10.1002/mbo3.383. Epub 2016 Jul 10. Microbiologyopen. 2016. PMID: 27397728 Free PMC article. Review.
References
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources