Large-scale prediction of long disordered regions in proteins using random forests - PubMed
- ️Thu Jan 01 2009
Large-scale prediction of long disordered regions in proteins using random forests
Pengfei Han et al. BMC Bioinformatics. 2009.
Abstract
Background: Many proteins contain disordered regions that lack fixed three-dimensional (3D) structure under physiological conditions but have important biological functions. Prediction of disordered regions in protein sequences is important for understanding protein function and in high-throughput determination of protein structures. Machine learning techniques, including neural networks and support vector machines have been widely used in such predictions. Predictors designed for long disordered regions are usually less successful in predicting short disordered regions. Combining prediction of short and long disordered regions will dramatically increase the complexity of the prediction algorithm and make the predictor unsuitable for large-scale applications. Efficient batch prediction of long disordered regions alone is of greater interest in large-scale proteome studies.
Results: A new algorithm, IUPforest-L, for predicting long disordered regions using the random forest learning model is proposed in this paper. IUPforest-L is based on the Moreau-Broto auto-correlation function of amino acid indices (AAIs) and other physicochemical features of the primary sequences. In 10-fold cross validation tests, IUPforest-L can achieve an area of 89.5% under the receiver operating characteristic (ROC) curve. Compared with existing disorder predictors, IUPforest-L has high prediction accuracy and is efficient for predicting long disordered regions in large-scale proteomes.
Conclusion: The random forest model based on the auto-correlation functions of the AAIs within a protein fragment and other physicochemical features could effectively detect long disordered regions in proteins. A new predictor, IUPforest-L, was developed to batch predict long disordered regions in proteins, and the server can be accessed from http://dmg.cs.rmit.edu.au/IUPforest/IUPforest-L.php.
Figures
![Figure 1](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3907/2637845/a1a17be0453f/1471-2105-10-8-1.gif)
A sample random forest. In the decision tree on the left, the node at the root tests an attribute, such as the first order auto-correlation function of the normalized flexibility parameters (see below). If it is higher than a given threshold then the residue is in a disordered state (the right branch labelled D); otherwise another input attribute is tested and a set of other tests are further performed until a decision is made. A random forest can comprise hundreds of decision trees.
![Figure 2](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3907/2637845/c8dda9363f11/1471-2105-10-8-2.gif)
Flow chart of IUPforest-L. The sequence features were calculated when a window slides along a protein sequence. IUPforest-L models were trained from the six types of features and the prediction result could be obtained by each of the trees in the forest. The final score in the prediction was the combination of the outcomes from all trees by voting.
![Figure 3](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3907/2637845/9167472cc1bf/1471-2105-10-8-3.gif)
ROC curves of 10-fold cross validation tests. The ROC curves of IUPforest-L in 10-fold cross validation tests are shown. The IUPforest-L could reach a 76% true positive rate at a 10% false positive rate with MCC = 0.67, Sproduct = 0.64 and an area of 89.5% under the ROC curve on the training data set with a window of 31 aa.
![Figure 4](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3907/2637845/b1b1542e1cd1/1471-2105-10-8-4.gif)
ROC curves on test set Hirose-ADS1. The ROC curves for IUPforest-L and nine publicly available predictors on the blind test dataset Hirose-ADS1 are shown. IUPforest-L has the best performance in terms of the AUC.
![Figure 5](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3907/2637845/f397858c76d4/1471-2105-10-8-5.gif)
ROC curves on test set Han-ADS1. The ROC curves for IUPforest-L and some publicly available predictors on the blind test dataset Han-ADS1 are shown. IUPforest-L performs better in terms of the AUC than most of the predictors.
![Figure 6](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3907/2637845/497a8995fe8c/1471-2105-10-8-6.gif)
ROC curves on test set Peng-DB. The ROC curves for IUPforest-L and some publicly available predictors on the blind test dataset Peng-DB are shown. IUPforest-L performs better in terms of the AUC than most of the predictors.
Similar articles
-
Predicting disordered regions in proteins using the profiles of amino acid indices.
Han P, Zhang X, Feng ZP. Han P, et al. BMC Bioinformatics. 2009 Jan 30;10 Suppl 1(Suppl 1):S42. doi: 10.1186/1471-2105-10-S1-S42. BMC Bioinformatics. 2009. PMID: 19208144 Free PMC article.
-
Predicting disordered regions in proteins based on decision trees of reduced amino acid composition.
Han P, Zhang X, Norton RS, Feng ZP. Han P, et al. J Comput Biol. 2006 Dec;13(10):1723-34. doi: 10.1089/cmb.2006.13.1723. J Comput Biol. 2006. PMID: 17238841
-
Length-dependent prediction of protein intrinsic disorder.
Peng K, Radivojac P, Vucetic S, Dunker AK, Obradovic Z. Peng K, et al. BMC Bioinformatics. 2006 Apr 17;7:208. doi: 10.1186/1471-2105-7-208. BMC Bioinformatics. 2006. PMID: 16618368 Free PMC article.
-
A comprehensive overview of computational protein disorder prediction methods.
Deng X, Eickholt J, Cheng J. Deng X, et al. Mol Biosyst. 2012 Jan;8(1):114-21. doi: 10.1039/c1mb05207a. Epub 2011 Aug 26. Mol Biosyst. 2012. PMID: 21874190 Free PMC article. Review.
-
Liu Y, Wang X, Liu B. Liu Y, et al. Brief Bioinform. 2019 Jan 18;20(1):330-346. doi: 10.1093/bib/bbx126. Brief Bioinform. 2019. PMID: 30657889 Review.
Cited by
-
Reciprocal regulation of metabolic and signaling pathways.
Barth AS, Kumordzie A, Colantuoni C, Margulies KB, Cappola TP, Tomaselli GF. Barth AS, et al. BMC Genomics. 2010 Mar 24;11:197. doi: 10.1186/1471-2164-11-197. BMC Genomics. 2010. PMID: 20334672 Free PMC article.
-
Wang XF, Chen Z, Wang C, Yan RX, Zhang Z, Song J. Wang XF, et al. PLoS One. 2011;6(10):e26767. doi: 10.1371/journal.pone.0026767. Epub 2011 Oct 28. PLoS One. 2011. PMID: 22046350 Free PMC article.
-
Montanari F, Shields DC, Khaldi N. Montanari F, et al. PLoS One. 2011;6(9):e24989. doi: 10.1371/journal.pone.0024989. Epub 2011 Sep 15. PLoS One. 2011. PMID: 21949823 Free PMC article.
-
Nespoulous C, Rofidal V, Sommerer N, Hem S, Rossignol M. Nespoulous C, et al. Proteome Sci. 2012 Oct 30;10(1):62. doi: 10.1186/1477-5956-10-62. Proteome Sci. 2012. PMID: 23110452 Free PMC article.
-
Protein inter-domain linker prediction using Random Forest and amino acid physiochemical properties.
Shatnawi M, Zaki N, Yoo PD. Shatnawi M, et al. BMC Bioinformatics. 2014;15 Suppl 16(Suppl 16):S8. doi: 10.1186/1471-2105-15-S16-S8. Epub 2014 Dec 8. BMC Bioinformatics. 2014. PMID: 25521329 Free PMC article.
References
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources