pubmed.ncbi.nlm.nih.gov

Large-scale prediction of long disordered regions in proteins using random forests - PubMed

  • ️Thu Jan 01 2009

Large-scale prediction of long disordered regions in proteins using random forests

Pengfei Han et al. BMC Bioinformatics. 2009.

Abstract

Background: Many proteins contain disordered regions that lack fixed three-dimensional (3D) structure under physiological conditions but have important biological functions. Prediction of disordered regions in protein sequences is important for understanding protein function and in high-throughput determination of protein structures. Machine learning techniques, including neural networks and support vector machines have been widely used in such predictions. Predictors designed for long disordered regions are usually less successful in predicting short disordered regions. Combining prediction of short and long disordered regions will dramatically increase the complexity of the prediction algorithm and make the predictor unsuitable for large-scale applications. Efficient batch prediction of long disordered regions alone is of greater interest in large-scale proteome studies.

Results: A new algorithm, IUPforest-L, for predicting long disordered regions using the random forest learning model is proposed in this paper. IUPforest-L is based on the Moreau-Broto auto-correlation function of amino acid indices (AAIs) and other physicochemical features of the primary sequences. In 10-fold cross validation tests, IUPforest-L can achieve an area of 89.5% under the receiver operating characteristic (ROC) curve. Compared with existing disorder predictors, IUPforest-L has high prediction accuracy and is efficient for predicting long disordered regions in large-scale proteomes.

Conclusion: The random forest model based on the auto-correlation functions of the AAIs within a protein fragment and other physicochemical features could effectively detect long disordered regions in proteins. A new predictor, IUPforest-L, was developed to batch predict long disordered regions in proteins, and the server can be accessed from http://dmg.cs.rmit.edu.au/IUPforest/IUPforest-L.php.

PubMed Disclaimer

Figures

Figure 1
Figure 1

A sample random forest. In the decision tree on the left, the node at the root tests an attribute, such as the first order auto-correlation function of the normalized flexibility parameters (see below). If it is higher than a given threshold then the residue is in a disordered state (the right branch labelled D); otherwise another input attribute is tested and a set of other tests are further performed until a decision is made. A random forest can comprise hundreds of decision trees.

Figure 2
Figure 2

Flow chart of IUPforest-L. The sequence features were calculated when a window slides along a protein sequence. IUPforest-L models were trained from the six types of features and the prediction result could be obtained by each of the trees in the forest. The final score in the prediction was the combination of the outcomes from all trees by voting.

Figure 3
Figure 3

ROC curves of 10-fold cross validation tests. The ROC curves of IUPforest-L in 10-fold cross validation tests are shown. The IUPforest-L could reach a 76% true positive rate at a 10% false positive rate with MCC = 0.67, Sproduct = 0.64 and an area of 89.5% under the ROC curve on the training data set with a window of 31 aa.

Figure 4
Figure 4

ROC curves on test set Hirose-ADS1. The ROC curves for IUPforest-L and nine publicly available predictors on the blind test dataset Hirose-ADS1 are shown. IUPforest-L has the best performance in terms of the AUC.

Figure 5
Figure 5

ROC curves on test set Han-ADS1. The ROC curves for IUPforest-L and some publicly available predictors on the blind test dataset Han-ADS1 are shown. IUPforest-L performs better in terms of the AUC than most of the predictors.

Figure 6
Figure 6

ROC curves on test set Peng-DB. The ROC curves for IUPforest-L and some publicly available predictors on the blind test dataset Peng-DB are shown. IUPforest-L performs better in terms of the AUC than most of the predictors.

Similar articles

Cited by

References

    1. Vucetic S, Brown CJ, Dunker AK, Obradovic Z. Flavors of protein disorder. Proteins. 2003;52:573–584. doi: 10.1002/prot.10437. - DOI - PubMed
    1. Dyson H, Wright PE. Intrinsically Unstructured Proteins and their Functions. Nat Rev Mol Cell Biol. 2005;6:197–208. doi: 10.1038/nrm1589. - DOI - PubMed
    1. Tompa P, Szasz C, Buday L. Structural disorder throws new light on moonlighting. Trends Biochem Sci. 2005;30:484–489. doi: 10.1016/j.tibs.2005.07.008. - DOI - PubMed
    1. Tompa P. Intrinsically unstructured proteins. Trends Biochem Sci. 2002;27:527–533. doi: 10.1016/S0968-0004(02)02169-2. - DOI - PubMed
    1. Uversky VN, Oldfield CJ, Dunker AK. Showing your ID: intrinsic disorder as an ID for recognition, regulation and cell signaling. J Mol Recognit. 2005;18:343–384. doi: 10.1002/jmr.747. - DOI - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources