Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0 - PubMed
Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0
Matthew The et al. J Am Soc Mass Spectrom. 2016 Nov.
Abstract
Percolator is a widely used software tool that increases yield in shotgun proteomics experiments and assigns reliable statistical confidence measures, such as q values and posterior error probabilities, to peptides and peptide-spectrum matches (PSMs) from such experiments. Percolator's processing speed has been sufficient for typical data sets consisting of hundreds of thousands of PSMs. With our new scalable approach, we can now also analyze millions of PSMs in a matter of minutes on a commodity computer. Furthermore, with the increasing awareness for the need for reliable statistics on the protein level, we compared several easy-to-understand protein inference methods and implemented the best-performing method-grouping proteins by their corresponding sets of theoretical peptides and then considering only the best-scoring peptide for each protein-in the Percolator package. We used Percolator 3.0 to analyze the data from a recent study of the draft human proteome containing 25 million spectra (PM:24870542). The source code and Ubuntu, Windows, MacOS, and Fedora binary packages are available from http://percolator.ms/ under an Apache 2.0 license. Graphical Abstract ᅟ.
Keywords: Data processing and analysis; Large scale studies; Mass spectrometry - LC-MS/MS; Protein inference; Statistical analysis.
Figures

ᅟ

SVM training on downsampled data retains the performance achieved using the full data set. From the full Kim data set of 73 million target+decoy PSMs, we evaluated subset sizes of 100,000, 500,000, 1,000,000, and 5,000,000 PSMs to train the SVMs, repeating this for 10 randomized sets, and scored all 73 million PSMs using the resulting support vectors. The figure plots, as a function of data set size, the ratio of significant peptides (left) and PSMs (right) at a q value threshold of 0.01 over the same number when using the full training set of 73 million PSMs. The number of significant PSMs and unique peptides does not drop significantly, even for subsets of 100,000 PSMs

Retaining shared peptides leads to poor calibration of the decoy model for all the tested protein inference methods. The figure plots reported q values from the decoy model, the decoy FDR, against the fraction of entrapment proteins in the set of identified target proteins, the observed entrapment FDR using a peptide-level FDR threshold of 10% (a), 5% (b), and 1% (c). Dotted lines correspond to y = 1.5x and y = 0.67x. For a peptide-level FDR threshold of 10%, all five methods produce anti-conservative FDR estimates, with Fisher’s method and product of PEPs achieving reasonable accuracy above 3% decoy FDR. For the stricter thresholds of 5% and 1%, the FDR estimates of those two methods are anti-convervative for very low FDRs, but quickly become conservative for higher FDRs. In comparison, the FDR estimates produced by Fido are better calibrated in the very low FDR range, but show rather erratic behavior by suddenly switching from conservative to anti-conservative estimates around 6% entrapment FDR

Using only protein-unique peptides gives accurate estimates of the protein-level FDR. (a) The figure plots the decoy FDR against observed entrapment FDR. All four methods produce accurate FDR estimates. (b) A logarithmic plot of the region [0.001,0.1] with the same axes as in (a)

Comparison of protein inference methods. (a) The figure plots the number of accepted protein groups against the observed entrapment FDR for the hm_yeast set. Fisher’s method, the product of peptide-level PEPs, and the best-scoring peptide approach all perform about equally, whereas Fido and the two-peptide rule are much less sensitive. We used a peptide-level threshold of 5% for Fido in this plot, but thresholds of 10% and 1% gave very similar results. (b) A plot of the number of accepted protein groups against the decoy FDR for the Wu data set. The product of peptide-level PEPs and the best-scoring peptide approach perform best, whereas the two-peptide rule and Fisher’s method inferred far fewer protein groups. (c) (d) Same axes as in (b) but for the Kim data set, searched using the Swiss-Prot (c) and Swiss-Prot+TrEMBL (d) databases. The best-scoring peptide approach inferred the most protein groups for both databases

Sample size dependence of protein inference methods. We plotted the average number of accepted protein groups at 1% protein-level FDR over triplicate random subsets of using different subset sizes of the Kim set matched to the Swiss-Prot database. The number of PSMs was reduced from the original 73 million PSMs with factors of two until 18 K PSMs. Fisher’s method, the product of peptide-level PEPs and the best-scoring peptide approach all perform about equally until 200 K PSMs. Above this number of PSMs, the best-scoring peptide approach outperforms all the other methods
Similar articles
-
Spivak M, Weston J, Bottou L, Käll L, Noble WS. Spivak M, et al. J Proteome Res. 2009 Jul;8(7):3737-45. doi: 10.1021/pr801109k. J Proteome Res. 2009. PMID: 19385687 Free PMC article.
-
Improving X!Tandem on peptide identification from mass spectrometry by self-boosted Percolator.
Yang P, Ma J, Wang P, Zhu Y, Zhou BB, Yang YH. Yang P, et al. IEEE/ACM Trans Comput Biol Bioinform. 2012 Sep-Oct;9(5):1273-80. doi: 10.1109/TCBB.2012.86. IEEE/ACM Trans Comput Biol Bioinform. 2012. PMID: 22689082
-
A Matter of Time: Faster Percolator Analysis via Efficient SVM Learning for Large-Scale Proteomics.
Halloran JT, Rocke DM. Halloran JT, et al. J Proteome Res. 2018 May 4;17(5):1978-1982. doi: 10.1021/acs.jproteome.7b00767. Epub 2018 Apr 6. J Proteome Res. 2018. PMID: 29607643 Free PMC article.
-
Fondrie WE, Noble WS. Fondrie WE, et al. J Proteome Res. 2020 Mar 6;19(3):1267-1274. doi: 10.1021/acs.jproteome.9b00780. Epub 2020 Feb 17. J Proteome Res. 2020. PMID: 32009418 Free PMC article.
-
Wright JC, Collins MO, Yu L, Käll L, Brosch M, Choudhary JS. Wright JC, et al. Mol Cell Proteomics. 2012 Aug;11(8):478-91. doi: 10.1074/mcp.O111.014522. Epub 2012 Apr 6. Mol Cell Proteomics. 2012. PMID: 22493177 Free PMC article.
Cited by
-
SMITER-A Python Library for the Simulation of LC-MS/MS Experiments.
Kösters M, Leufken J, Leidel SA. Kösters M, et al. Genes (Basel). 2021 Mar 11;12(3):396. doi: 10.3390/genes12030396. Genes (Basel). 2021. PMID: 33799543 Free PMC article.
-
Guo G, Papanicolaou M, Demarais NJ, Wang Z, Schey KL, Timpson P, Cox TR, Grey AC. Guo G, et al. Nat Commun. 2021 May 28;12(1):3241. doi: 10.1038/s41467-021-23461-w. Nat Commun. 2021. PMID: 34050164 Free PMC article.
-
Cocozza F, Martin-Jaular L, Lippens L, Di Cicco A, Arribas YA, Ansart N, Dingli F, Richard M, Merle L, Jouve San Roman M, Poullet P, Loew D, Lévy D, Hendrix A, Kassiotis G, Joliot A, Tkach M, Théry C. Cocozza F, et al. EMBO J. 2023 Dec 11;42(24):e113590. doi: 10.15252/embj.2023113590. EMBO J. 2023. PMID: 38073509 Free PMC article.
-
Wu M, McCain JSP, Rowland E, Middag R, Sandgren M, Allen AE, Bertrand EM. Wu M, et al. Nat Commun. 2019 Aug 8;10(1):3582. doi: 10.1038/s41467-019-11426-z. Nat Commun. 2019. PMID: 31395884 Free PMC article.
-
FineFDR: Fine-grained Taxonomy-specific False Discovery Rates Control in Metaproteomics.
Wang S, Feng S, Pan C, Guo X. Wang S, et al. Proceedings (IEEE Int Conf Bioinformatics Biomed). 2022 Dec;2022:287-292. doi: 10.1109/bibm55620.2022.9995401. Epub 2023 Jan 2. Proceedings (IEEE Int Conf Bioinformatics Biomed). 2022. PMID: 36910011 Free PMC article.
References
-
- Craig, R., Beavis, R.C.: TANDEM: matching proteins with tandem mass spectra. Bioinformatics 20(9), 1466–1467 (2004) - PubMed
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources