pubmed.ncbi.nlm.nih.gov

PredRSA: a gradient boosted regression trees approach for predicting protein solvent accessibility - PubMed

  • ️Fri Jan 01 2016

PredRSA: a gradient boosted regression trees approach for predicting protein solvent accessibility

Chao Fan et al. BMC Bioinformatics. 2016.

Abstract

Background: Protein solvent accessibility prediction is a pivotal intermediate step towards modeling protein tertiary structures directly from one-dimensional sequences. It also plays an important part in identifying protein folds and domains. Although some methods have been presented to the protein solvent accessibility prediction in recent years, the performance is far from satisfactory. In this work, we propose PredRSA, a computational method that can accurately predict relative solvent accessible surface area (RSA) of residues by exploring various local and global sequence features which have been observed to be associated with solvent accessibility. Based on these features, a novel and efficient approach, Gradient Boosted Regression Trees (GBRT), is first adopted to predict RSA.

Results: Experimental results obtained from 5-fold cross-validation based on the Manesh-215 dataset show that the mean absolute error (MAE) and the Pearson correlation coefficient (PCC) of PredRSA are 9.0 % and 0.75, respectively, which are better than that of the existing methods. Moreover, we evaluate the performance of PredRSA using an independent test set of 68 proteins. Compared with the state-of-the-art approaches (SPINE-X and ASAquick), PredRSA achieves a significant improvement on the prediction quality.

Conclusions: Our experimental results show that the Gradient Boosted Regression Trees algorithm and the novel feature combination are quite effective in relative solvent accessibility prediction. The proposed PredRSA method could be useful in assisting the prediction of protein structures by applying the predicted RSA as useful restraints.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1

The definition of the six side-chain environment categories. This figure shows the classification method of side-chain environment. RSA represents the relative accessible surface areas and F represents the fraction of the whole side-chain area covered by polar atoms. If R S A<0.09, the residue will be placed into class B(buried). If 0.09≤R S A<0.36, the residue will be placed into class P(partial buried). If RSA≥0.36, the residue will be placed into class E(exposed). Within class B, if F<0.45, the residue will be placed into B 1, if 0.45≤F<0.58, the residue will be placed into B 2, and if F≥0.58, the residue will be divided into class B 3. In class P, if F<0.67, the reside will be divided into class P 1, and if F≥0.67, the reside will placed into class P 2

Fig. 2
Fig. 2

The framework of PredRSA for protein relative solvent accessibility prediction. Five different types of sequence-derived features are generated and used as input to build the GBRT model. These features consist of PSSM in the form of PSI-BLAST profiles, predicted secondary structure information by PSIPRED, predicted native disorder information by DISOPRED, conservation score and side-chain environment

Fig. 3
Fig. 3

The importance of the five relevant features used in PredRSA. PSSM, SS, DISO, SCE and CS stand for position specific scoring matrix, protein secondary structure, protein native disorder, side-chain environment and conservation score, respectively

Fig. 4
Fig. 4

Predicted and experimental values (%) of RSA for each residue of CASP10 T0675

Fig. 5
Fig. 5

Correlation between experimental RSA values and predicted RSA values of CASP10 T0675. The Pearson correlation coefficient score is 0.92 and the most buried residues are well predicted with the RSA values near zero

Fig. 6
Fig. 6

Comparison between true mean values and predicted mean values for 20 amino acids on the Manesh-215 dataset

Fig. 7
Fig. 7

Mean predicted errors of 20 amino acids on the Manesh-215 dataset. The green line represents standard root mean square error, the red line represents mean absolute error and the blue line represents the corresponding data distribution of 20 amino acids

Similar articles

Cited by

References

    1. Lee B, Richards FM. The interpretation of protein structures: estimation of static accessibility. J mole biol. 1971;55(3):379–4. doi: 10.1016/0022-2836(71)90324-X. - DOI - PubMed
    1. Eyal E, Najmanovich R, Mcconkey BJ, Edelman M, Sobolev V. Importance of solvent accessibility and contact surfaces in modeling side-chain conformations in proteins. J comput chem. 2004;25(5):712–24. doi: 10.1002/jcc.10420. - DOI - PubMed
    1. Rost B, Sander C. Conservation and prediction of solvent accessibility in protein families. Proteins Struct Funct Genet. 1994;20(3):216–26. doi: 10.1002/prot.340200303. - DOI - PubMed
    1. Wodak SJ, Janin J. Location of structural domains in proteins. Biochem. 1981;20(23):6544–52. doi: 10.1021/bi00526a005. - DOI - PubMed
    1. Liu S, Zhang C, Liang S, Zhou Y. Fold recognition by concurrent use of solvent accessibility and residue depth. Proteins Struct Funct Genet. 2007;68(3):636–45. doi: 10.1002/prot.21459. - DOI - PubMed

Publication types

MeSH terms

Substances