pubmed.ncbi.nlm.nih.gov

Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0 - PubMed

Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0

Matthew The et al. J Am Soc Mass Spectrom. 2016 Nov.

Abstract

Percolator is a widely used software tool that increases yield in shotgun proteomics experiments and assigns reliable statistical confidence measures, such as q values and posterior error probabilities, to peptides and peptide-spectrum matches (PSMs) from such experiments. Percolator's processing speed has been sufficient for typical data sets consisting of hundreds of thousands of PSMs. With our new scalable approach, we can now also analyze millions of PSMs in a matter of minutes on a commodity computer. Furthermore, with the increasing awareness for the need for reliable statistics on the protein level, we compared several easy-to-understand protein inference methods and implemented the best-performing method-grouping proteins by their corresponding sets of theoretical peptides and then considering only the best-scoring peptide for each protein-in the Percolator package. We used Percolator 3.0 to analyze the data from a recent study of the draft human proteome containing 25 million spectra (PM:24870542). The source code and Ubuntu, Windows, MacOS, and Fedora binary packages are available from http://percolator.ms/ under an Apache 2.0 license. Graphical Abstract ᅟ.

Keywords: Data processing and analysis; Large scale studies; Mass spectrometry - LC-MS/MS; Protein inference; Statistical analysis.

PubMed Disclaimer

Figures

Graphical Abstract
Graphical Abstract

Figure 1
Figure 1

SVM training on downsampled data retains the performance achieved using the full data set. From the full Kim data set of 73 million target+decoy PSMs, we evaluated subset sizes of 100,000, 500,000, 1,000,000, and 5,000,000 PSMs to train the SVMs, repeating this for 10 randomized sets, and scored all 73 million PSMs using the resulting support vectors. The figure plots, as a function of data set size, the ratio of significant peptides (left) and PSMs (right) at a q value threshold of 0.01 over the same number when using the full training set of 73 million PSMs. The number of significant PSMs and unique peptides does not drop significantly, even for subsets of 100,000 PSMs

Figure 2
Figure 2

Retaining shared peptides leads to poor calibration of the decoy model for all the tested protein inference methods. The figure plots reported q values from the decoy model, the decoy FDR, against the fraction of entrapment proteins in the set of identified target proteins, the observed entrapment FDR using a peptide-level FDR threshold of 10% (a), 5% (b), and 1% (c). Dotted lines correspond to y = 1.5x and y = 0.67x. For a peptide-level FDR threshold of 10%, all five methods produce anti-conservative FDR estimates, with Fisher’s method and product of PEPs achieving reasonable accuracy above 3% decoy FDR. For the stricter thresholds of 5% and 1%, the FDR estimates of those two methods are anti-convervative for very low FDRs, but quickly become conservative for higher FDRs. In comparison, the FDR estimates produced by Fido are better calibrated in the very low FDR range, but show rather erratic behavior by suddenly switching from conservative to anti-conservative estimates around 6% entrapment FDR

Figure 3
Figure 3

Using only protein-unique peptides gives accurate estimates of the protein-level FDR. (a) The figure plots the decoy FDR against observed entrapment FDR. All four methods produce accurate FDR estimates. (b) A logarithmic plot of the region [0.001,0.1] with the same axes as in (a)

Figure 4
Figure 4

Comparison of protein inference methods. (a) The figure plots the number of accepted protein groups against the observed entrapment FDR for the hm_yeast set. Fisher’s method, the product of peptide-level PEPs, and the best-scoring peptide approach all perform about equally, whereas Fido and the two-peptide rule are much less sensitive. We used a peptide-level threshold of 5% for Fido in this plot, but thresholds of 10% and 1% gave very similar results. (b) A plot of the number of accepted protein groups against the decoy FDR for the Wu data set. The product of peptide-level PEPs and the best-scoring peptide approach perform best, whereas the two-peptide rule and Fisher’s method inferred far fewer protein groups. (c) (d) Same axes as in (b) but for the Kim data set, searched using the Swiss-Prot (c) and Swiss-Prot+TrEMBL (d) databases. The best-scoring peptide approach inferred the most protein groups for both databases

Figure 5
Figure 5

Sample size dependence of protein inference methods. We plotted the average number of accepted protein groups at 1% protein-level FDR over triplicate random subsets of using different subset sizes of the Kim set matched to the Swiss-Prot database. The number of PSMs was reduced from the original 73 million PSMs with factors of two until 18 K PSMs. Fisher’s method, the product of peptide-level PEPs and the best-scoring peptide approach all perform about equally until 200 K PSMs. Above this number of PSMs, the best-scoring peptide approach outperforms all the other methods

Similar articles

Cited by

References

    1. Käll L, Canterbury JD, Weston J, Noble WS, MacCoss MJ. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat. Methods. 2007;4(11):923–925. doi: 10.1038/nmeth1113. - DOI - PubMed
    1. Eng JK, McCormack AL, Yates JR. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 1994;5(11):976–989. doi: 10.1016/1044-0305(94)80016-2. - DOI - PubMed
    1. Perkins DN, Pappin DJ, Creasy DM, Cottrell JS. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 1999;20(18):3551–3567. doi: 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2. - DOI - PubMed
    1. Craig, R., Beavis, R.C.: TANDEM: matching proteins with tandem mass spectra. Bioinformatics 20(9), 1466–1467 (2004) - PubMed
    1. Kim S, Gupta N, Pevzner PA. Spectral probabilities and generating functions of tandem mass spectra: a strike against decoy databases. J. Proteome Res. 2008;7(8):3354–3363. doi: 10.1021/pr8001244. - DOI - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources