PredRSA: a gradient boosted regression trees approach for predicting protein solvent accessibility - PubMed
- ️Fri Jan 01 2016
PredRSA: a gradient boosted regression trees approach for predicting protein solvent accessibility
Chao Fan et al. BMC Bioinformatics. 2016.
Abstract
Background: Protein solvent accessibility prediction is a pivotal intermediate step towards modeling protein tertiary structures directly from one-dimensional sequences. It also plays an important part in identifying protein folds and domains. Although some methods have been presented to the protein solvent accessibility prediction in recent years, the performance is far from satisfactory. In this work, we propose PredRSA, a computational method that can accurately predict relative solvent accessible surface area (RSA) of residues by exploring various local and global sequence features which have been observed to be associated with solvent accessibility. Based on these features, a novel and efficient approach, Gradient Boosted Regression Trees (GBRT), is first adopted to predict RSA.
Results: Experimental results obtained from 5-fold cross-validation based on the Manesh-215 dataset show that the mean absolute error (MAE) and the Pearson correlation coefficient (PCC) of PredRSA are 9.0 % and 0.75, respectively, which are better than that of the existing methods. Moreover, we evaluate the performance of PredRSA using an independent test set of 68 proteins. Compared with the state-of-the-art approaches (SPINE-X and ASAquick), PredRSA achieves a significant improvement on the prediction quality.
Conclusions: Our experimental results show that the Gradient Boosted Regression Trees algorithm and the novel feature combination are quite effective in relative solvent accessibility prediction. The proposed PredRSA method could be useful in assisting the prediction of protein structures by applying the predicted RSA as useful restraints.
Figures

The definition of the six side-chain environment categories. This figure shows the classification method of side-chain environment. RSA represents the relative accessible surface areas and F represents the fraction of the whole side-chain area covered by polar atoms. If R S A<0.09, the residue will be placed into class B(buried). If 0.09≤R S A<0.36, the residue will be placed into class P(partial buried). If RSA≥0.36, the residue will be placed into class E(exposed). Within class B, if F<0.45, the residue will be placed into B 1, if 0.45≤F<0.58, the residue will be placed into B 2, and if F≥0.58, the residue will be divided into class B 3. In class P, if F<0.67, the reside will be divided into class P 1, and if F≥0.67, the reside will placed into class P 2

The framework of PredRSA for protein relative solvent accessibility prediction. Five different types of sequence-derived features are generated and used as input to build the GBRT model. These features consist of PSSM in the form of PSI-BLAST profiles, predicted secondary structure information by PSIPRED, predicted native disorder information by DISOPRED, conservation score and side-chain environment

The importance of the five relevant features used in PredRSA. PSSM, SS, DISO, SCE and CS stand for position specific scoring matrix, protein secondary structure, protein native disorder, side-chain environment and conservation score, respectively

Predicted and experimental values (%) of RSA for each residue of CASP10 T0675

Correlation between experimental RSA values and predicted RSA values of CASP10 T0675. The Pearson correlation coefficient score is 0.92 and the most buried residues are well predicted with the RSA values near zero

Comparison between true mean values and predicted mean values for 20 amino acids on the Manesh-215 dataset

Mean predicted errors of 20 amino acids on the Manesh-215 dataset. The green line represents standard root mean square error, the red line represents mean absolute error and the blue line represents the corresponding data distribution of 20 amino acids
Similar articles
-
Deng L, Fan C, Zeng Z. Deng L, et al. BMC Bioinformatics. 2017 Dec 28;18(Suppl 16):569. doi: 10.1186/s12859-017-1971-7. BMC Bioinformatics. 2017. PMID: 29297299 Free PMC article.
-
Linear regression models for solvent accessibility prediction in proteins.
Wagner M, Adamczak R, Porollo A, Meller J. Wagner M, et al. J Comput Biol. 2005 Apr;12(3):355-69. doi: 10.1089/cmb.2005.12.355. J Comput Biol. 2005. PMID: 15857247
-
Liou YF, Huang HL, Ho SY. Liou YF, et al. BMC Bioinformatics. 2016 Dec 22;17(Suppl 19):503. doi: 10.1186/s12859-016-1368-z. BMC Bioinformatics. 2016. PMID: 28155647 Free PMC article.
-
Real value prediction of protein solvent accessibility using enhanced PSSM features.
Chang DT, Huang HY, Syu YT, Wu CP. Chang DT, et al. BMC Bioinformatics. 2008 Dec 12;9 Suppl 12(Suppl 12):S12. doi: 10.1186/1471-2105-9-S12-S12. BMC Bioinformatics. 2008. PMID: 19091011 Free PMC article.
-
Wang JY, Lee HM, Ahmad S. Wang JY, et al. Proteins. 2007 Jul 1;68(1):82-91. doi: 10.1002/prot.21422. Proteins. 2007. PMID: 17436325
Cited by
-
Interpretable machine learning prediction of all-cause mortality.
Qiu W, Chen H, Dincer AB, Lundberg S, Kaeberlein M, Lee SI. Qiu W, et al. Commun Med (Lond). 2022 Oct 3;2:125. doi: 10.1038/s43856-022-00180-x. eCollection 2022. Commun Med (Lond). 2022. PMID: 36204043 Free PMC article.
-
Deng L, Fan C, Zeng Z. Deng L, et al. BMC Bioinformatics. 2017 Dec 28;18(Suppl 16):569. doi: 10.1186/s12859-017-1971-7. BMC Bioinformatics. 2017. PMID: 29297299 Free PMC article.
-
Accurate prediction of protein relative solvent accessibility using a balanced model.
Wu W, Wang Z, Cong P, Li T. Wu W, et al. BioData Min. 2017 Jan 24;10:1. doi: 10.1186/s13040-016-0121-5. eCollection 2017. BioData Min. 2017. PMID: 28127402 Free PMC article.
-
SDADB: a functional annotation database of protein structural domains.
Zeng C, Zhan W, Deng L. Zeng C, et al. Database (Oxford). 2018 Jan 1;2018:bay064. doi: 10.1093/database/bay064. Database (Oxford). 2018. PMID: 29961821 Free PMC article.
-
Liu H, Zhang W, Nie L, Ding X, Luo J, Zou L. Liu H, et al. BMC Bioinformatics. 2019 Dec 9;20(1):645. doi: 10.1186/s12859-019-3288-1. BMC Bioinformatics. 2019. PMID: 31818267 Free PMC article.
References
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources