pubmed.ncbi.nlm.nih.gov

Natively unstructured loops differ from other loops - PubMed

Natively unstructured loops differ from other loops

Avner Schlessinger et al. PLoS Comput Biol. 2007 Jul.

Abstract

Natively unstructured or disordered protein regions may increase the functional complexity of an organism; they are particularly abundant in eukaryotes and often evade structure determination. Many computational methods predict unstructured regions by training on outliers in otherwise well-ordered structures. Here, we introduce an approach that uses a neural network in a very different and novel way. We hypothesize that very long contiguous segments with nonregular secondary structure (NORS regions) differ significantly from regular, well-structured loops, and that a method detecting such features could predict natively unstructured regions. Training our new method, NORSnet, on predicted information rather than on experimental data yielded three major advantages: it removed the overlap between testing and training, it systematically covered entire proteomes, and it explicitly focused on one particular aspect of unstructured regions with a simple structural interpretation, namely that they are loops. Our hypothesis was correct: well-structured and unstructured loops differ so substantially that NORSnet succeeded in their distinction. Benchmarks on previously used and new experimental data of unstructured regions revealed that NORSnet performed very well. Although it was not the best single prediction method, NORSnet was sufficiently accurate to flag unstructured regions in proteins that were previously not annotated. In one application, NORSnet revealed previously undetected unstructured regions in putative targets for structural genomics and may thereby contribute to increasing structural coverage of large eukaryotic families. NORSnet found unstructured regions more often in domain boundaries than expected at random. In another application, we estimated that 50%-70% of all worm proteins observed to have more than seven protein-protein interaction partners have unstructured regions. The comparative analysis between NORSnet and DISOPRED2 suggested that long unstructured loops are a major part of unstructured regions in molecular networks.

PubMed Disclaimer

Conflict of interest statement

Competing interests. The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Putative “Map” of Unstructured Regions

Proteins with unstructured regions are likely to occupy large portions of sequence space [7,24,27,42] as sketched by the light-gray inner rectangle. The space of all proteins with unstructured regions is likely to be considerably larger than what today's experimental techniques capture. The rounded darker gray rectangle labeled experiment sketches proteins for which some experimental method annotated natively unstructured regions. While most NORS regions (predicted long loops, striped gray ellipse) are likely to be natively unstructured, many unstructured regions are not NORS; i.e., they contain helices and strands even in their native form. Previous methods for the prediction of unstructured regions (left lens) are optimized to somehow reflect today's experiments. In contrast, the method introduced here (NORSnet, right lens) is developed based on predictions. This is an advantage because it avoids the bias of today's experimental techniques in a field that is just beginning to grasp its own dimensions, and it is a disadvantage because performance on today's datasets appears somehow limited.

Figure 2
Figure 2. Regular, Flexible, and Predicted-To-Be Unstructured Loops Differed

We compared the amino acid compositions between four different subsets representing four types of “loops” (nonhelix/nonstrand): loops from regular, well-ordered structures; i.e., from proteins without natively unstructured regions (states T, S, L from the Dictionary of Secondary Structure of Proteins; in blue); unstructured loops as predicted by NORSnet (in green); “flexible loops” from regular structures (TSL states with normalized B-factors ≥1 [82]; in red); and unstructured regions as predicted by DISOPRED2 (in orange). The sign of the bar corresponds to overrepresentation (positive) or underrepresentation (negative) of amino acids in a subset with respect to the PDB. The NORS and DISOPRED2 residue subsets were taken from the worm genome (from the IntAct database [67]) and were predicted to be unstructured by NORSnet and DISOPRED2. Flexible loops were enriched in amino acids with net charges such as lysine and glutamate (as described before [16,39]). Predicted unstructured regions by NORSnet, however, differed in their composition from regular loops, flexible loops, and from any type of disorder that has been described previously (unpublished data) [39,44]. Cysteines were not overabundant in the unstructured regions predicted by DISOPRED2. Overall, these data suggested that NORSnet captured something other than just “loop” and other than what is captured by methods such as DISOPRED2.

Figure 3
Figure 3. Predictions for DisProt

(A) ROC-like curve for NORSnet (green), DISOPRED2 (orange), and their combination (through arithmetic average; gray). While the performance of NORSnet and DISOPRED2 were similar, the combined method seemed to outperform both methods. Particularly, at accuracy = 100% (inset), the combined method covers significantly more sequences than each one of the methods individually. IUPred (purple) outperformed all other methods on this dataset. Note that IUPred was optimized on a set similar to the one used in this study. In contrast, NORSnet and DISOPRED2 were optimized on different sets defining disorder differently. (B) Venn diagram of overlap between very accurate predictions by NORSnet, DISOPRED2, and the combined method. The numbers in the circles are mutually exclusive; for instance, two proteins were identified only by DISOPRED2 to have an unstructured region, and 17 proteins were identified by both NORSnet and by the combined method to have an unstructured region.

Figure 4
Figure 4. Predictions for NESG Data

(A) The NESG set contains many proteins with unstructured regions that are not in DisProt and have never been used for method optimization. We compared NORSnet (in green), DISOPRED2 (in orange), their combined method (in gray), and IUPred (in purple) on these proteins. While DISOPRED2 performed better than all other methods in the low accuracy/high coverage region (top left), the combined method, NORSnet, and IUPred individually excelled in the high accuracy/low coverage region (lower right). (B) Venn diagram of overlap between very accurate predictions by NORSnet, DISOPRED2, and IUPred. The numbers in the circles are mutually exclusive. Note that five proteins were identified only by NORSnet to have an unstructured region.

Figure 5
Figure 5. Different Prediction Method Outputs for Kappa-Casein Precursor

Kappa-casein precursor has been shown to be unstructured by different experiments [50]. Despite its low content in predicted helices and strands, not all prediction methods identify it as unstructured. We compared outputs of DISOPRED2 (A), NORSnet (B), and FoldIndex (C) for this protein. For DISOPRED2 and NORSnet, higher values indicate unstructured regions; for FoldIndex, low values indicate unstructured regions (red). Note that FoldIndex and DISOPRED2 do not use any explicit information about secondary structure. DISOPRED2 disorder probability, however, is somewhat correlated with coil predictions (Figure S1). DISOPRED2 was not able to distinguish these loops from structured loops. Only NORSnet clearly picked up the strong signal for unstructured regions for most of the protein.

Figure 6
Figure 6. NORSnet Captured Unstructured Regions Related to High Net Charge/Low Hydrophobicity

DFF45 (white, yellow, and red) becomes structured upon complex formation with DFF40 (purple; [55]). The interface includes a buried hydrophobic patch surrounded by hydrophilic interactions. Usually, charged residues disrupt the formation of tertiary structure; in this case, however, when the complex is formed, the negative charge of the Asp groups in DFF45 is cancelled out, with the positive charges of DFF40 allowing the protein to be folded. Visualization was done using GRASP2 [85]. Since DFF45 has high secondary structure content, it is a relatively hard target for NORSnet prediction. However, NORSnet correctly identified its unstructured region at a rather stringent cutoff.

Figure 7
Figure 7. Unstructured Regions Overrepresented in Protein–Protein Hubs of the Worm

We ran both NORSnet and DISOPRED2 on worm proteins that are involved in protein–protein interactions (as identified by yeast two-hybrid [66]). The number of proteins that are predicted to be either unstructured or well-structured is plotted against the number of interacting partners for two different thresholds of reliability of the two methods: (A) and (B) were compiled for thresholds at which both methods maintained 100% accuracy for the NESG data (Figure 4), while (C) and (D) were compiled for 100% accuracy on DisProt (Figure 3). Since the number of observed interaction partners falls off dramatically, we had to group the data into bins of roughly equal sizes (x-axes). (A) and (C) show the results for the number of proteins predicted in each bin of interaction partners, while (B) and (D) show the normalized ratios to zoom into the difference between unstructured and structured proteins in each bin. These ratios were compiled as Ratio(bin) = {#unstructured(bin) / #structured(bin)} / {#unstructured(1) / #structured(1)}. As all ratios are greater than 1, proteins with more than one interaction partner have more unstructured regions than proteins with one partner. (A) These graphs were compiled with the reliability threshold at which each method achieved 100% accuracy by the NESG data (Figure 4). Overall, this threshold resulted in NORSnet (filled bars) predicting many more proteins with unstructured regions than DISOPRED2 (hatched bars). The difference was particularly relevant for proteins with more interacting partners. (B) NORSnet (filled, dark green) predicted many more unstructured regions in proteins with seven or more interaction partners than did DISOPRED2 (hatched, light green). (C) For the thresholds at which both methods achieved 100% accuracy on the DisProt dataset, DISOPRED2 identified more proteins with unstructured regions than did NORSnet. In contrast to the situation for the NESG set (A), the difference was not as significant for promiscuous proteins (ten or more partners). (D) Although NORSnet (filled, dark green) predicted as many unstructured as structured regions in hubs (seven or more), this ratio was significantly smaller than the one for proteins with a single interaction partner. In other words, even on this dataset NORSnet picked up a much stronger overrepresentation of unstructured regions in hubs than did DISOPRED2 (hatched, light green).

Similar articles

Cited by

References

    1. Lesk AM. Introduction to protein architecture: The structural biology of proteins. Oxford: Oxford University Press; 2004. 347
    1. Brändén C, Tooze J. Introduction to protein structure. New York: Garland; 1991. 302
    1. Dyson HJ, Wright PE. Coupling of folding and binding for unstructured proteins. Curr Opin Struct Biol. 2002;12:54–60. - PubMed
    1. Fuxreiter M, Simon I, Friedrich P, Tompa P. Preformed structural elements feature in partner recognition by intrinsically unstructured proteins. J Mol Biol. 2004;338:1015–1026. - PubMed
    1. Dyson HJ, Wright PE. Intrinsically unstructured proteins and their functions. Nat Rev Mol Cell Biol. 2005;6:197–208. - PubMed

Publication types

MeSH terms

Substances