Natively unstructured loops differ from other loops - PubMed
Natively unstructured loops differ from other loops
Avner Schlessinger et al. PLoS Comput Biol. 2007 Jul.
Abstract
Natively unstructured or disordered protein regions may increase the functional complexity of an organism; they are particularly abundant in eukaryotes and often evade structure determination. Many computational methods predict unstructured regions by training on outliers in otherwise well-ordered structures. Here, we introduce an approach that uses a neural network in a very different and novel way. We hypothesize that very long contiguous segments with nonregular secondary structure (NORS regions) differ significantly from regular, well-structured loops, and that a method detecting such features could predict natively unstructured regions. Training our new method, NORSnet, on predicted information rather than on experimental data yielded three major advantages: it removed the overlap between testing and training, it systematically covered entire proteomes, and it explicitly focused on one particular aspect of unstructured regions with a simple structural interpretation, namely that they are loops. Our hypothesis was correct: well-structured and unstructured loops differ so substantially that NORSnet succeeded in their distinction. Benchmarks on previously used and new experimental data of unstructured regions revealed that NORSnet performed very well. Although it was not the best single prediction method, NORSnet was sufficiently accurate to flag unstructured regions in proteins that were previously not annotated. In one application, NORSnet revealed previously undetected unstructured regions in putative targets for structural genomics and may thereby contribute to increasing structural coverage of large eukaryotic families. NORSnet found unstructured regions more often in domain boundaries than expected at random. In another application, we estimated that 50%-70% of all worm proteins observed to have more than seven protein-protein interaction partners have unstructured regions. The comparative analysis between NORSnet and DISOPRED2 suggested that long unstructured loops are a major part of unstructured regions in molecular networks.
Conflict of interest statement
Competing interests. The authors have declared that no competing interests exist.
Figures
![Figure 1](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fae9/1933475/d29891b6ff91/pcbi.0030140.g001.gif)
Proteins with unstructured regions are likely to occupy large portions of sequence space [7,24,27,42] as sketched by the light-gray inner rectangle. The space of all proteins with unstructured regions is likely to be considerably larger than what today's experimental techniques capture. The rounded darker gray rectangle labeled experiment sketches proteins for which some experimental method annotated natively unstructured regions. While most NORS regions (predicted long loops, striped gray ellipse) are likely to be natively unstructured, many unstructured regions are not NORS; i.e., they contain helices and strands even in their native form. Previous methods for the prediction of unstructured regions (left lens) are optimized to somehow reflect today's experiments. In contrast, the method introduced here (NORSnet, right lens) is developed based on predictions. This is an advantage because it avoids the bias of today's experimental techniques in a field that is just beginning to grasp its own dimensions, and it is a disadvantage because performance on today's datasets appears somehow limited.
![Figure 2](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fae9/1933475/32f7323762b3/pcbi.0030140.g002.gif)
We compared the amino acid compositions between four different subsets representing four types of “loops” (nonhelix/nonstrand): loops from regular, well-ordered structures; i.e., from proteins without natively unstructured regions (states T, S, L from the Dictionary of Secondary Structure of Proteins; in blue); unstructured loops as predicted by NORSnet (in green); “flexible loops” from regular structures (TSL states with normalized B-factors ≥1 [82]; in red); and unstructured regions as predicted by DISOPRED2 (in orange). The sign of the bar corresponds to overrepresentation (positive) or underrepresentation (negative) of amino acids in a subset with respect to the PDB. The NORS and DISOPRED2 residue subsets were taken from the worm genome (from the IntAct database [67]) and were predicted to be unstructured by NORSnet and DISOPRED2. Flexible loops were enriched in amino acids with net charges such as lysine and glutamate (as described before [16,39]). Predicted unstructured regions by NORSnet, however, differed in their composition from regular loops, flexible loops, and from any type of disorder that has been described previously (unpublished data) [39,44]. Cysteines were not overabundant in the unstructured regions predicted by DISOPRED2. Overall, these data suggested that NORSnet captured something other than just “loop” and other than what is captured by methods such as DISOPRED2.
![Figure 3](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fae9/1933475/d5feea5f28ba/pcbi.0030140.g003.gif)
(A) ROC-like curve for NORSnet (green), DISOPRED2 (orange), and their combination (through arithmetic average; gray). While the performance of NORSnet and DISOPRED2 were similar, the combined method seemed to outperform both methods. Particularly, at accuracy = 100% (inset), the combined method covers significantly more sequences than each one of the methods individually. IUPred (purple) outperformed all other methods on this dataset. Note that IUPred was optimized on a set similar to the one used in this study. In contrast, NORSnet and DISOPRED2 were optimized on different sets defining disorder differently. (B) Venn diagram of overlap between very accurate predictions by NORSnet, DISOPRED2, and the combined method. The numbers in the circles are mutually exclusive; for instance, two proteins were identified only by DISOPRED2 to have an unstructured region, and 17 proteins were identified by both NORSnet and by the combined method to have an unstructured region.
![Figure 4](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fae9/1933475/c70276b8209d/pcbi.0030140.g004.gif)
(A) The NESG set contains many proteins with unstructured regions that are not in DisProt and have never been used for method optimization. We compared NORSnet (in green), DISOPRED2 (in orange), their combined method (in gray), and IUPred (in purple) on these proteins. While DISOPRED2 performed better than all other methods in the low accuracy/high coverage region (top left), the combined method, NORSnet, and IUPred individually excelled in the high accuracy/low coverage region (lower right). (B) Venn diagram of overlap between very accurate predictions by NORSnet, DISOPRED2, and IUPred. The numbers in the circles are mutually exclusive. Note that five proteins were identified only by NORSnet to have an unstructured region.
![Figure 5](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fae9/1933475/513006a0d9c6/pcbi.0030140.g005.gif)
Kappa-casein precursor has been shown to be unstructured by different experiments [50]. Despite its low content in predicted helices and strands, not all prediction methods identify it as unstructured. We compared outputs of DISOPRED2 (A), NORSnet (B), and FoldIndex (C) for this protein. For DISOPRED2 and NORSnet, higher values indicate unstructured regions; for FoldIndex, low values indicate unstructured regions (red). Note that FoldIndex and DISOPRED2 do not use any explicit information about secondary structure. DISOPRED2 disorder probability, however, is somewhat correlated with coil predictions (Figure S1). DISOPRED2 was not able to distinguish these loops from structured loops. Only NORSnet clearly picked up the strong signal for unstructured regions for most of the protein.
![Figure 6](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fae9/1933475/0b3fe7f22107/pcbi.0030140.g006.gif)
DFF45 (white, yellow, and red) becomes structured upon complex formation with DFF40 (purple; [55]). The interface includes a buried hydrophobic patch surrounded by hydrophilic interactions. Usually, charged residues disrupt the formation of tertiary structure; in this case, however, when the complex is formed, the negative charge of the Asp groups in DFF45 is cancelled out, with the positive charges of DFF40 allowing the protein to be folded. Visualization was done using GRASP2 [85]. Since DFF45 has high secondary structure content, it is a relatively hard target for NORSnet prediction. However, NORSnet correctly identified its unstructured region at a rather stringent cutoff.
![Figure 7](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fae9/1933475/74ed1b24f57f/pcbi.0030140.g007.gif)
We ran both NORSnet and DISOPRED2 on worm proteins that are involved in protein–protein interactions (as identified by yeast two-hybrid [66]). The number of proteins that are predicted to be either unstructured or well-structured is plotted against the number of interacting partners for two different thresholds of reliability of the two methods: (A) and (B) were compiled for thresholds at which both methods maintained 100% accuracy for the NESG data (Figure 4), while (C) and (D) were compiled for 100% accuracy on DisProt (Figure 3). Since the number of observed interaction partners falls off dramatically, we had to group the data into bins of roughly equal sizes (x-axes). (A) and (C) show the results for the number of proteins predicted in each bin of interaction partners, while (B) and (D) show the normalized ratios to zoom into the difference between unstructured and structured proteins in each bin. These ratios were compiled as Ratio(bin) = {#unstructured(bin) / #structured(bin)} / {#unstructured(1) / #structured(1)}. As all ratios are greater than 1, proteins with more than one interaction partner have more unstructured regions than proteins with one partner. (A) These graphs were compiled with the reliability threshold at which each method achieved 100% accuracy by the NESG data (Figure 4). Overall, this threshold resulted in NORSnet (filled bars) predicting many more proteins with unstructured regions than DISOPRED2 (hatched bars). The difference was particularly relevant for proteins with more interacting partners. (B) NORSnet (filled, dark green) predicted many more unstructured regions in proteins with seven or more interaction partners than did DISOPRED2 (hatched, light green). (C) For the thresholds at which both methods achieved 100% accuracy on the DisProt dataset, DISOPRED2 identified more proteins with unstructured regions than did NORSnet. In contrast to the situation for the NESG set (A), the difference was not as significant for promiscuous proteins (ten or more partners). (D) Although NORSnet (filled, dark green) predicted as many unstructured as structured regions in hubs (seven or more), this ratio was significantly smaller than the one for proteins with a single interaction partner. In other words, even on this dataset NORSnet picked up a much stronger overrepresentation of unstructured regions in hubs than did DISOPRED2 (hatched, light green).
Similar articles
-
Natively unstructured regions in proteins identified from contact predictions.
Schlessinger A, Punta M, Rost B. Schlessinger A, et al. Bioinformatics. 2007 Sep 15;23(18):2376-84. doi: 10.1093/bioinformatics/btm349. Epub 2007 Aug 20. Bioinformatics. 2007. PMID: 17709338
-
Loopy proteins appear conserved in evolution.
Liu J, Tan H, Rost B. Liu J, et al. J Mol Biol. 2002 Sep 6;322(1):53-64. doi: 10.1016/s0022-2836(02)00736-2. J Mol Biol. 2002. PMID: 12215414
-
Prediction of unfolded segments in a protein sequence based on amino acid composition.
Coeytaux K, Poupon A. Coeytaux K, et al. Bioinformatics. 2005 May 1;21(9):1891-900. doi: 10.1093/bioinformatics/bti266. Epub 2005 Jan 18. Bioinformatics. 2005. PMID: 15657106
-
A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction.
Moult J. Moult J. Curr Opin Struct Biol. 2005 Jun;15(3):285-9. doi: 10.1016/j.sbi.2005.05.011. Curr Opin Struct Biol. 2005. PMID: 15939584 Review.
-
A practical overview of protein disorder prediction methods.
Ferron F, Longhi S, Canard B, Karlin D. Ferron F, et al. Proteins. 2006 Oct 1;65(1):1-14. doi: 10.1002/prot.21075. Proteins. 2006. PMID: 16856179 Review.
Cited by
-
Dark Proteome Database: Studies on Disorder.
Perdigão N, Pina PMC, Rocha C, Tavares JMRS, Rosa A. Perdigão N, et al. High Throughput. 2020 Jun 30;9(3):15. doi: 10.3390/ht9030015. High Throughput. 2020. PMID: 32629790 Free PMC article.
-
Predicting Protein Conformational Disorder and Disordered Binding Sites.
Tamburrini KC, Pesce G, Nilsson J, Gondelaud F, Kajava AV, Berrin JG, Longhi S. Tamburrini KC, et al. Methods Mol Biol. 2022;2449:95-147. doi: 10.1007/978-1-0716-2095-3_4. Methods Mol Biol. 2022. PMID: 35507260
-
Mizianty MJ, Peng Z, Kurgan L. Mizianty MJ, et al. Intrinsically Disord Proteins. 2013 Apr 1;1(1):e24428. doi: 10.4161/idp.24428. eCollection 2013 Jan-Dec. Intrinsically Disord Proteins. 2013. PMID: 28516009 Free PMC article.
-
Sun X, Jones WT, Harvey D, Edwards PJ, Pascal SM, Kirk C, Considine T, Sheerin DJ, Rakonjac J, Oldfield CJ, Xue B, Dunker AK, Uversky VN. Sun X, et al. J Biol Chem. 2010 Apr 9;285(15):11557-71. doi: 10.1074/jbc.M109.027011. Epub 2010 Jan 26. J Biol Chem. 2010. PMID: 20103592 Free PMC article.
-
Improved disorder prediction by combination of orthogonal approaches.
Schlessinger A, Punta M, Yachdav G, Kajan L, Rost B. Schlessinger A, et al. PLoS One. 2009;4(2):e4433. doi: 10.1371/journal.pone.0004433. Epub 2009 Feb 11. PLoS One. 2009. PMID: 19209228 Free PMC article.
References
-
- Lesk AM. Introduction to protein architecture: The structural biology of proteins. Oxford: Oxford University Press; 2004. 347
-
- Brändén C, Tooze J. Introduction to protein structure. New York: Garland; 1991. 302
-
- Dyson HJ, Wright PE. Coupling of folding and binding for unstructured proteins. Curr Opin Struct Biol. 2002;12:54–60. - PubMed
-
- Fuxreiter M, Simon I, Friedrich P, Tompa P. Preformed structural elements feature in partner recognition by intrinsically unstructured proteins. J Mol Biol. 2004;338:1015–1026. - PubMed
-
- Dyson HJ, Wright PE. Intrinsically unstructured proteins and their functions. Nat Rev Mol Cell Biol. 2005;6:197–208. - PubMed
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Molecular Biology Databases