pubmed.ncbi.nlm.nih.gov

Proteogenomics from a bioinformatics angle: A growing field - PubMed

Review

. 2017 Sep;36(5):584-599.

doi: 10.1002/mas.21483. Epub 2015 Dec 15.

Affiliations

Review

Proteogenomics from a bioinformatics angle: A growing field

Gerben Menschaert et al. Mass Spectrom Rev. 2017 Sep.

Abstract

Proteogenomics is a research area that combines areas as proteomics and genomics in a multi-omics setup using both mass spectrometry and high-throughput sequencing technologies. Currently, the main goals of the field are to aid genome annotation or to unravel the proteome complexity. Mass spectrometry based identifications of matching or homologues peptides can further refine gene models. Also, the identification of novel proteoforms is also made possible based on detection of novel translation initiation sites (cognate or near-cognate), novel transcript isoforms, sequence variation or novel (small) open reading frames in intergenic or un-translated genic regions by analyzing high-throughput sequencing data from RNAseq or ribosome profiling experiments. Other proteogenomics studies using a combination of proteomics and genomics techniques focus on antibody sequencing, the identification of immunogenic peptides or venom peptides. Over the years, a growing amount of bioinformatics tools and databases became available to help streamlining these cross-omics studies. Some of these solutions only help in specific steps of the proteogenomics studies, e.g. building custom sequence databases (based on next generation sequencing output) for mass spectrometry fragmentation spectrum matching. Over the last few years a handful integrative tools also became available that can execute complete proteogenomics analyses. Some of these are presented as stand-alone solutions, whereas others are implemented in a web-based framework such as Galaxy. In this review we aimed at sketching a comprehensive overview of all the bioinformatics solutions that are available for this growing research area. © 2015 Wiley Periodicals, Inc. Mass Spec Rev 36:584-599, 2017.

Keywords: bioinformatics; gene annotation; mass spectrometry; next-generation sequencing; proteoform; proteogenomics; ribosome profiling.

© 2015 Wiley Periodicals, Inc.

PubMed Disclaimer

Figures

Figure 1
Figure 1

Classes of peptides identified in proteogenomics. A. A division op proteo-genomics peptide types can be made based on the genomics region where these map. The majority of enzymatically cleaved peptides map to coding genic locations (intragenic), whereas a small amount also maps to non-coding RNA and pseudogenes or intergenic regions. Exceptionally, peptides can point to chimeric proteins (fusion products in for example oncoproteogenomics studies) or could lead to gene fusion in the case of identification of gene-fusion peptides. Of the intragenic subclass, the majority will map to one exon and a minority can overspan exon splice sites (possibly leading to alternative splice isoform identification). Proteogenomics can lead to the identification of novel peptides located in untranslated regions (5’ and 3’UTR) or intronic regions, internal out-of-frame peptides, peptides that resided at the reverse strand or single amino acid variant (SAV) peptides (introduced through genetic variation or RNA editing). Other novel findings can point to exon-intron junction (cross-junction peptides). B. Another application of proteogenomics is the study of antibody or nanobody peptides in the highly variable regions. Here a combination is made of sequencing of B-cells and mass spectrometry of the blood anti/nano-bodies after affinity selection. C. Venomics is another research field wherein proteogenomics can be extremely useful. Here a combination of RNAseq of the venom gland (of for example cone snails, spiders, snails) and matching mass spectrometry boosts the identification rate of the impressively divers arsenal of toxin peptides that mostly carry multiple post-translational modifications.

Figure 2
Figure 2

Comprehensive overview of proteogenomics workflow. A typical proteogenomics strategy consists of different steps. Novel peptide identifications are mostly obtained by scanning comprehensive, custom-build protein sequence databases using database search engines. The search database creation step (1. DB Creation) is very important and can hold sequences from annotated protein repositories, translated genomic and/or RNA sequences or processed sequences from sequence read archives. Furthermore specific protein-related information can be incorporated in the search space. Routinely, the (very large) custom database is filtered to help manage its size. This filtering can be pursued based on different criteria. Fragmentation spectra can be experimentally obtained or downloaded from public repositories. The upper green box, connected to the arrow, gives an overview of the different proteogenomics tools reported to create such custom databases. (2. MS/MS data): PRIDE, PeptideAtlas, Massive, ProteomicsDB, Chorus, CPTAC all hold MS/MS data of MS-based proteomics experiments on different cell/tissue types and species. (3. Peptide identification): As mentioned, the identification is mostly obtained using database searching (multiple search algorithms are available). Other so-called hybrid tag-based or de novo methods have also been successfully applied in many proteogenomics experiments. (4. Validation & interpretation): After the identification step, validation and interpretation of the PSM/peptide/proteins remains indispensable, using appropriate statistical significance estimation (FDR/PEP calculation). Further global annotation analysis based on gene ontology or protein interaction is also routinely performed. (5. Mapping & visualization): The vast amount of multi-omics data can be further mapped and visualized in a genome-centric way, many tools are available for both the mapping and visualization step. Also, further integrative visualization of protein interaction networks based on this cross-omics data is also possible using several tools as Circos and Cytoscape. The lower green box, connected to the arrow, gives an overview of reported pipelines or solutions that combine the different necessary steps in a complete proteogenomics workflow.

Similar articles

Cited by

References

    1. Jaffe JD, Berg HC, Church GM. Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics. 2004;4:59–77. - PubMed
    1. Castellana NE, et al. An automated proteogenomic method uses mass spectrometry to reveal novel genes in Zea mays. Molecular & cellular proteomics : MCP. 2014;13:157–167. - PMC - PubMed
    1. Baerenfaller K, et al. Genome-scale proteomics reveals Arabidopsis thaliana gene models and proteome dynamics. Science. 2008;320:938–941. - PubMed
    1. Brunner E, et al. A high-quality catalog of the Drosophila melanogaster proteome. Nature biotechnology. 2007;25:576–583. - PubMed
    1. Fermin D, et al. Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics. Genome biology. 2006;7:R35. - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources