pubmed.ncbi.nlm.nih.gov

PredRSA: a gradient boosted regression trees approach for predicting protein solvent accessibility - PubMed

️Fri Jan 01 2016

PredRSA: a gradient boosted regression trees approach for predicting protein solvent accessibility

Chao Fan et al. BMC Bioinformatics. 2016.

Abstract

Background: Protein solvent accessibility prediction is a pivotal intermediate step towards modeling protein tertiary structures directly from one-dimensional sequences. It also plays an important part in identifying protein folds and domains. Although some methods have been presented to the protein solvent accessibility prediction in recent years, the performance is far from satisfactory. In this work, we propose PredRSA, a computational method that can accurately predict relative solvent accessible surface area (RSA) of residues by exploring various local and global sequence features which have been observed to be associated with solvent accessibility. Based on these features, a novel and efficient approach, Gradient Boosted Regression Trees (GBRT), is first adopted to predict RSA.

Results: Experimental results obtained from 5-fold cross-validation based on the Manesh-215 dataset show that the mean absolute error (MAE) and the Pearson correlation coefficient (PCC) of PredRSA are 9.0 % and 0.75, respectively, which are better than that of the existing methods. Moreover, we evaluate the performance of PredRSA using an independent test set of 68 proteins. Compared with the state-of-the-art approaches (SPINE-X and ASAquick), PredRSA achieves a significant improvement on the prediction quality.

Conclusions: Our experimental results show that the Gradient Boosted Regression Trees algorithm and the novel feature combination are quite effective in relative solvent accessibility prediction. The proposed PredRSA method could be useful in assisting the prediction of protein structures by applying the predicted RSA as useful restraints.

PubMed Disclaimer

Figures

**Fig. 1**
The definition of the six side-chain environment categories. This figure shows the classification method of side-chain environment. *RSA* represents the relative accessible surface areas and F represents the fraction of the whole side-chain area covered by polar atoms. If R S A<0.09, the residue will be placed into class B(buried). If 0.09≤R S A<0.36, the residue will be placed into class P(partial buried). If RSA≥0.36, the residue will be placed into class E(exposed). Within class B, if F<0.45, the residue will be placed into B ₁, if 0.45≤F<0.58, the residue will be placed into B ₂, and if F≥0.58, the residue will be divided into class B ₃. In class P, if F<0.67, the reside will be divided into class P ₁, and if F≥0.67, the reside will placed into class P ₂

**Fig. 2**
The framework of PredRSA for protein relative solvent accessibility prediction. Five different types of sequence-derived features are generated and used as input to build the GBRT model. These features consist of PSSM in the form of PSI-BLAST profiles, predicted secondary structure information by PSIPRED, predicted native disorder information by DISOPRED, conservation score and side-chain environment

**Fig. 3**
The importance of the five relevant features used in PredRSA. PSSM, SS, DISO, SCE and CS stand for position specific scoring matrix, protein secondary structure, protein native disorder, side-chain environment and conservation score, respectively

**Fig. 4**
Predicted and experimental values (%) of RSA for each residue of CASP10 T0675

**Fig. 5**
Correlation between experimental RSA values and predicted RSA values of CASP10 T0675. The Pearson correlation coefficient score is 0.92 and the most buried residues are well predicted with the RSA values near zero

**Fig. 6**
Comparison between true mean values and predicted mean values for 20 amino acids on the Manesh-215 dataset

**Fig. 7**
Mean predicted errors of 20 amino acids on the Manesh-215 dataset. The green line represents standard root mean square error, the red line represents mean absolute error and the blue line represents the corresponding data distribution of 20 amino acids

Cited by

Interpretable machine learning prediction of all-cause mortality.
Qiu W, Chen H, Dincer AB, Lundberg S, Kaeberlein M, Lee SI. Qiu W, et al. Commun Med (Lond). 2022 Oct 3;2:125. doi: 10.1038/s43856-022-00180-x. eCollection 2022. Commun Med (Lond). 2022. PMID: 36204043 Free PMC article.
A sparse autoencoder-based deep neural network for protein solvent accessibility and contact number prediction.
Deng L, Fan C, Zeng Z. Deng L, et al. BMC Bioinformatics. 2017 Dec 28;18(Suppl 16):569. doi: 10.1186/s12859-017-1971-7. BMC Bioinformatics. 2017. PMID: 29297299 Free PMC article.
Accurate prediction of protein relative solvent accessibility using a balanced model.
Wu W, Wang Z, Cong P, Li T. Wu W, et al. BioData Min. 2017 Jan 24;10:1. doi: 10.1186/s13040-016-0121-5. eCollection 2017. BioData Min. 2017. PMID: 28127402 Free PMC article.
SDADB: a functional annotation database of protein structural domains.
Zeng C, Zhan W, Deng L. Zeng C, et al. Database (Oxford). 2018 Jan 1;2018:bay064. doi: 10.1093/database/bay064. Database (Oxford). 2018. PMID: 29961821 Free PMC article.
Predicting effective drug combinations using gradient tree boosting based on features extracted from drug-protein heterogeneous network.
Liu H, Zhang W, Nie L, Ding X, Luo J, Zou L. Liu H, et al. BMC Bioinformatics. 2019 Dec 9;20(1):645. doi: 10.1186/s12859-019-3288-1. BMC Bioinformatics. 2019. PMID: 31818267 Free PMC article.

References

1. Lee B, Richards FM. The interpretation of protein structures: estimation of static accessibility. J mole biol. 1971;55(3):379–4. doi: 10.1016/0022-2836(71)90324-X. - DOI - PubMed
1. Eyal E, Najmanovich R, Mcconkey BJ, Edelman M, Sobolev V. Importance of solvent accessibility and contact surfaces in modeling side-chain conformations in proteins. J comput chem. 2004;25(5):712–24. doi: 10.1002/jcc.10420. - DOI - PubMed
1. Rost B, Sander C. Conservation and prediction of solvent accessibility in protein families. Proteins Struct Funct Genet. 1994;20(3):216–26. doi: 10.1002/prot.340200303. - DOI - PubMed
1. Wodak SJ, Janin J. Location of structural domains in proteins. Biochem. 1981;20(23):6544–52. doi: 10.1021/bi00526a005. - DOI - PubMed
1. Liu S, Zhang C, Liang S, Zhou Y. Fold recognition by concurrent use of solvent accessibility and residue depth. Proteins Struct Funct Genet. 2007;68(3):636–45. doi: 10.1002/prot.21459. - DOI - PubMed

PredRSA: a gradient boosted regression trees approach for predicting protein solvent accessibility - PubMed