pubmed.ncbi.nlm.nih.gov

Detecting sequence signals in targeting peptides using deep learning - PubMed

  • ️Tue Jan 01 2019

Detecting sequence signals in targeting peptides using deep learning

Jose Juan Almagro Armenteros et al. Life Sci Alliance. 2019.

Abstract

In bioinformatics, machine learning methods have been used to predict features embedded in the sequences. In contrast to what is generally assumed, machine learning approaches can also provide new insights into the underlying biology. Here, we demonstrate this by presenting TargetP 2.0, a novel state-of-the-art method to identify N-terminal sorting signals, which direct proteins to the secretory pathway, mitochondria, and chloroplasts or other plastids. By examining the strongest signals from the attention layer in the network, we find that the second residue in the protein, that is, the one following the initial methionine, has a strong influence on the classification. We observe that two-thirds of chloroplast and thylakoid transit peptides have an alanine in position 2, compared with 20% in other plant proteins. We also note that in fungi and single-celled eukaryotes, less than 30% of the targeting peptides have an amino acid that allows the removal of the N-terminal methionine compared with 60% for the proteins without targeting peptide. The importance of this feature for predictions has not been highlighted before.

© 2019 Armenteros et al.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no conflict of interest.

Figures

Figure S1.
Figure S1.. Representation of the attention weights for a few randomly chosen proteins.

The height of the letter represents the attention weight in that position and the letter the type of amino acid. The shaded area corresponds to the predicted targeting peptide (SP, mTP, cTP, or luTP).

Figure 1.
Figure 1.. This figure depicts the frequencies of the second residue in proteins with different targeting peptides.

The proteins are divided into their respective type of targeting peptide: signal peptide (SP), mitochondrial transit peptides (mTPs), chloroplast transit peptides (cTPs), luminal transit peptides (luTPs), and noTPs. Furthermore, the proteins were divided into their kingdom: Viridiplantae (P), Metazoa (M), Fungi (F), and other eukaryotic organisms (O) sequences. Inspired by sequence LOGOs, the height of each letter corresponds to the frequency of that amino acid. Only the frequencies for the short side-chained amino acids that allow the cleavage of the N-terminal methionine are shown.

Figure 2.
Figure 2.. The TargetP 2.0 architecture.
Figure 3.
Figure 3.. Receiver operator curves for identification of SPs, mitochondrial-, chloroplast-, and luminal transit peptides.
Figure 4.
Figure 4.. Recall (or accuracy) for the CS prediction in SPs, mTPs, cTPs, and luTPs by the different prediction methods. Note that not all methods can predict all types of targeting peptides.
Figure 5.
Figure 5.. Attention layer LOGOs showing the impact strength of the attention layer and the frequency of amino acids. All sequences are aligned at the predicted CS.
Figure 6.
Figure 6.. Attention layer LOGOs showing the impact strength of the attention layer and the frequency of amino acids. All sequences are aligned at the N terminus.
Figure 7.
Figure 7.. Sequence LOGOs showing the amino acid frequencies in the pre-sequences.

All sequences are aligned according to the predicted CS.

Figure S2.
Figure S2.. Sequence LOGOs showing the experimental amino-terminal pre-sequences.

Sequences are aligned according to the annotated CS.

Figure S3.
Figure S3.. Distribution of the distance from true and predicted CSs to the nearest arginine in mTPs.
Figure 8.
Figure 8.. Sequence LOGOs showing the amino-terminal pre-sequences.

All sequences are aligned at the N terminus.

Figure S4.
Figure S4.. Sequence LOGOs showing the experimental amino-terminal pre-sequences.

Sequences are aligned at the N terminus.

Figure S5.
Figure S5.. Log odds ratio of secondary structure preferences for the different targeting peptides.

Upper two rows show the peptides aligned at the N terminus and the lower two rows show the peptides aligned at the CS.

Similar articles

Cited by

References

    1. Almagro Armenteros J, Sønderby CK, Sønderby SK, Nielsen H, Winther O (2017) DeepLoc: Prediction of protein subcellular localization using deep learning. Bioinformatics 33: 3387–3395. 10.1093/bioinformatics/btx431 - DOI - PubMed
    1. Almagro Armenteros JJ, Tsirigos KD, Sønderby CK, Petersen TN, Winther O, Brunak S, von Heijne G, Nielsen H (2019) SignalP 5.0 improves signal peptide predictions using deep neural networks. Nat Biotechnol 37: 420–423. 10.1038/s41587-019-0036-z - DOI - PubMed
    1. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv Preprint posted September 1, 2014.
    1. Baldi P, Brunak S, Chauvin Y, Andersen CAF, Nielsen H (2000) Assessing the accuracy of prediction algorithms for classification: An overview. Bioinformatics 16: 412–424. 10.1093/bioinformatics/16.5.412 - DOI - PubMed
    1. Basile W, Sachenkova O, Light S, Elofsson A (2017) High GC content causes orphan proteins to be intrinsically disordered. PLoS Comput Biol 13: e1005375 10.1371/journal.pcbi.1005375 - DOI - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources