pubmed.ncbi.nlm.nih.gov

Large-scale prediction of long disordered regions in proteins using random forests - PubMed

️Thu Jan 01 2009

Large-scale prediction of long disordered regions in proteins using random forests

Pengfei Han et al. BMC Bioinformatics. 2009.

Abstract

Background: Many proteins contain disordered regions that lack fixed three-dimensional (3D) structure under physiological conditions but have important biological functions. Prediction of disordered regions in protein sequences is important for understanding protein function and in high-throughput determination of protein structures. Machine learning techniques, including neural networks and support vector machines have been widely used in such predictions. Predictors designed for long disordered regions are usually less successful in predicting short disordered regions. Combining prediction of short and long disordered regions will dramatically increase the complexity of the prediction algorithm and make the predictor unsuitable for large-scale applications. Efficient batch prediction of long disordered regions alone is of greater interest in large-scale proteome studies.

Results: A new algorithm, IUPforest-L, for predicting long disordered regions using the random forest learning model is proposed in this paper. IUPforest-L is based on the Moreau-Broto auto-correlation function of amino acid indices (AAIs) and other physicochemical features of the primary sequences. In 10-fold cross validation tests, IUPforest-L can achieve an area of 89.5% under the receiver operating characteristic (ROC) curve. Compared with existing disorder predictors, IUPforest-L has high prediction accuracy and is efficient for predicting long disordered regions in large-scale proteomes.

Conclusion: The random forest model based on the auto-correlation functions of the AAIs within a protein fragment and other physicochemical features could effectively detect long disordered regions in proteins. A new predictor, IUPforest-L, was developed to batch predict long disordered regions in proteins, and the server can be accessed from http://dmg.cs.rmit.edu.au/IUPforest/IUPforest-L.php.

PubMed Disclaimer

Figures

**Figure 1**
**A sample random forest**. In the decision tree on the left, the node at the root tests an attribute, such as the first order auto-correlation function of the normalized flexibility parameters (see below). If it is higher than a given threshold then the residue is in a disordered state (the right branch labelled D); otherwise another input attribute is tested and a set of other tests are further performed until a decision is made. A random forest can comprise hundreds of decision trees.

**Figure 2**
**Flow chart of IUPforest-L**. The sequence features were calculated when a window slides along a protein sequence. IUPforest-L models were trained from the six types of features and the prediction result could be obtained by each of the trees in the forest. The final score in the prediction was the combination of the outcomes from all trees by voting.

**Figure 3**
**ROC curves of 10-fold cross validation tests**. The ROC curves of IUPforest-L in 10-fold cross validation tests are shown. The IUPforest-L could reach a 76% true positive rate at a 10% false positive rate with *MCC* = 0.67, *Sproduct* = 0.64 and an area of 89.5% under the ROC curve on the training data set with a window of 31 aa.

**Figure 4**
**ROC curves on test set Hirose-ADS1**. The ROC curves for IUPforest-L and nine publicly available predictors on the blind test dataset Hirose-ADS1 are shown. IUPforest-L has the best performance in terms of the *AUC*.

**Figure 5**
**ROC curves on test set Han-ADS1**. The ROC curves for IUPforest-L and some publicly available predictors on the blind test dataset Han-ADS1 are shown. IUPforest-L performs better in terms of the *AUC* than most of the predictors.

**Figure 6**
**ROC curves on test set Peng-DB**. The ROC curves for IUPforest-L and some publicly available predictors on the blind test dataset Peng-DB are shown. IUPforest-L performs better in terms of the *AUC* than most of the predictors.

Cited by

Reciprocal regulation of metabolic and signaling pathways.
Barth AS, Kumordzie A, Colantuoni C, Margulies KB, Cappola TP, Tomaselli GF. Barth AS, et al. BMC Genomics. 2010 Mar 24;11:197. doi: 10.1186/1471-2164-11-197. BMC Genomics. 2010. PMID: 20334672 Free PMC article.
Predicting residue-residue contacts and helix-helix interactions in transmembrane proteins using an integrative feature-based random forest approach.
Wang XF, Chen Z, Wang C, Yan RX, Zhang Z, Song J. Wang XF, et al. PLoS One. 2011;6(10):e26767. doi: 10.1371/journal.pone.0026767. Epub 2011 Oct 28. PLoS One. 2011. PMID: 22046350 Free PMC article.
Differences in the number of intrinsically disordered regions between yeast duplicated proteins, and their relationship with functional divergence.
Montanari F, Shields DC, Khaldi N. Montanari F, et al. PLoS One. 2011;6(9):e24989. doi: 10.1371/journal.pone.0024989. Epub 2011 Sep 15. PLoS One. 2011. PMID: 21949823 Free PMC article.
Phosphoproteomic analysis reveals major default phosphorylation sites outside long intrinsically disordered regions of Arabidopsis plasma membrane proteins.
Nespoulous C, Rofidal V, Sommerer N, Hem S, Rossignol M. Nespoulous C, et al. Proteome Sci. 2012 Oct 30;10(1):62. doi: 10.1186/1477-5956-10-62. Proteome Sci. 2012. PMID: 23110452 Free PMC article.
Protein inter-domain linker prediction using Random Forest and amino acid physiochemical properties.
Shatnawi M, Zaki N, Yoo PD. Shatnawi M, et al. BMC Bioinformatics. 2014;15 Suppl 16(Suppl 16):S8. doi: 10.1186/1471-2105-15-S16-S8. Epub 2014 Dec 8. BMC Bioinformatics. 2014. PMID: 25521329 Free PMC article.

References

1. Vucetic S, Brown CJ, Dunker AK, Obradovic Z. Flavors of protein disorder. Proteins. 2003;52:573–584. doi: 10.1002/prot.10437. - DOI - PubMed
1. Dyson H, Wright PE. Intrinsically Unstructured Proteins and their Functions. Nat Rev Mol Cell Biol. 2005;6:197–208. doi: 10.1038/nrm1589. - DOI - PubMed
1. Tompa P, Szasz C, Buday L. Structural disorder throws new light on moonlighting. Trends Biochem Sci. 2005;30:484–489. doi: 10.1016/j.tibs.2005.07.008. - DOI - PubMed
1. Tompa P. Intrinsically unstructured proteins. Trends Biochem Sci. 2002;27:527–533. doi: 10.1016/S0968-0004(02)02169-2. - DOI - PubMed
1. Uversky VN, Oldfield CJ, Dunker AK. Showing your ID: intrinsic disorder as an ID for recognition, regulation and cell signaling. J Mol Recognit. 2005;18:343–384. doi: 10.1002/jmr.747. - DOI - PubMed

Large-scale prediction of long disordered regions in proteins using random forests - PubMed