pubmed.ncbi.nlm.nih.gov

Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties - PubMed

️Sat Jan 01 2011

Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties

Hui-Lin Huang et al. BMC Bioinformatics. 2011.

Abstract

Background: Existing methods of predicting DNA-binding proteins used valuable features of physicochemical properties to design support vector machine (SVM) based classifiers. Generally, selection of physicochemical properties and determination of their corresponding feature vectors rely mainly on known properties of binding mechanism and experience of designers. However, there exists a troublesome problem for designers that some different physicochemical properties have similar vectors of representing 20 amino acids and some closely related physicochemical properties have dissimilar vectors.

Results: This study proposes a systematic approach (named Auto-IDPCPs) to automatically identify a set of physicochemical and biochemical properties in the AAindex database to design SVM-based classifiers for predicting and analyzing DNA-binding domains/proteins. Auto-IDPCPs consists of 1) clustering 531 amino acid indices in AAindex into 20 clusters using a fuzzy c-means algorithm, 2) utilizing an efficient genetic algorithm based optimization method IBCGA to select an informative feature set of size m to represent sequences, and 3) analyzing the selected features to identify related physicochemical properties which may affect the binding mechanism of DNA-binding domains/proteins. The proposed Auto-IDPCPs identified m = 22 features of properties belonging to five clusters for predicting DNA-binding domains with a five-fold cross-validation accuracy of 87.12%, which is promising compared with the accuracy of 86.62% of the existing method PSSM-400. For predicting DNA-binding sequences, the accuracy of 75.50% was obtained using m = 28 features, where PSSM-400 has an accuracy of 74.22%. Auto-IDPCPs and PSSM-400 have accuracies of 80.73% and 82.81%, respectively, applied to an independent test data set of DNA-binding domains. Some typical physicochemical properties discovered are hydrophobicity, secondary structure, charge, solvent accessibility, polarity, flexibility, normalized Van Der Waals volume, pK (pK-C, pK-N, pK-COOH and pK-a(RCOOH)), etc.

Conclusions: The proposed approach Auto-IDPCPs would help designers to investigate informative physicochemical and biochemical properties by considering both prediction accuracy and analysis of binding mechanism simultaneously. The approach Auto-IDPCPs can be also applicable to predict and analyze other protein functions from sequences.

PubMed Disclaimer

Figures

**Figure 1**
**An illustration example**. The properties H88 and A392 are two different properties but their distance 0.0178 is relatively small. On the other hand, H88 and H178 belonging to the same group Hydrophobicity in AAindex have a large distance 0.0877. H88 and H151 in the same group have a larger distance 0.0299 than that between H88 and A392.

**Figure 2**
**The minimum spanning tree of the amino acid indices stored in the AAindex1 release 9.0** [10]. Each rectangle is an amino acid index. Coloured nodes represent the indices classified by Tomii and Kanehisa [11] Red (A): alpha and turn propensities, Yellow (B): beta propensity, Green (C): composition, Blue (H): hydrophobicity, Cyan (P): physicochemical properties, Gray (O): other properties. White: the indices added to the AAindex after the release 3.0 by Tomii and Kanehisa [11].

**Figure 3**
The flowchart of the proposed approach Auto-IDPCPs

**Figure 4**
The statistical results of property distribution in the six groups (a) 531 amino acid indices (b) 402 amino acid indices

**Figure 5**
The statistical result of S_t in selecting property sets from R =30 independent runs on DNAset and DNAaset.

**Figure 6**
Prediction accuracies for various numbers of selected properties (a) DNAset and (b) DNAaset.

**Figure 7**
The m properties ranked by using main effect difference (MED) (a) m=22 for DNAset and (b) m=28 for DNAaset.

**Figure 8**
The appearance frequency of each identified cluster in the 30 runs. The clusters 7, 9, 10, 16 and 18 are more informative.

**Figure 9**
An illustration example for exploring promising properties. H151 can be inferred from feature sets S1 and S2.

Cited by

GIpred: a computational tool for prediction of GIGANTEA proteins using machine learning algorithm.
Meher PK, Dash S, Sahu TK, Satpathy S, Pradhan SK. Meher PK, et al. Physiol Mol Biol Plants. 2022 Jan;28(1):1-16. doi: 10.1007/s12298-022-01130-6. Epub 2022 Jan 24. Physiol Mol Biol Plants. 2022. PMID: 35221569 Free PMC article.
Integration of residue attributes for sequence diversity characterization of terpenoid enzymes.
Kibinge N, Ikeda S, Ono N, Altaf-Ul-Amin M, Kanaya S. Kibinge N, et al. Biomed Res Int. 2014;2014:753428. doi: 10.1155/2014/753428. Epub 2014 May 11. Biomed Res Int. 2014. PMID: 24900985 Free PMC article.
Analysis and prediction of single-stranded and double-stranded DNA binding proteins based on protein sequences.
Wang W, Sun L, Zhang S, Zhang H, Shi J, Xu T, Li K. Wang W, et al. BMC Bioinformatics. 2017 Jun 12;18(1):300. doi: 10.1186/s12859-017-1715-8. BMC Bioinformatics. 2017. PMID: 28606086 Free PMC article.
Use Chou's 5-Step Rule to Predict DNA-Binding Proteins with Evolutionary Information.
Lu W, Song Z, Ding Y, Wu H, Cao Y, Zhang Y, Li H. Lu W, et al. Biomed Res Int. 2020 Jul 27;2020:6984045. doi: 10.1155/2020/6984045. eCollection 2020. Biomed Res Int. 2020. PMID: 32775434 Free PMC article.
Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation.
Xu R, Zhou J, Wang H, He Y, Wang X, Liu B. Xu R, et al. BMC Syst Biol. 2015;9 Suppl 1(Suppl 1):S10. doi: 10.1186/1752-0509-9-S1-S10. Epub 2015 Feb 6. BMC Syst Biol. 2015. PMID: 25708928 Free PMC article.

References

1. Gao M, Skolnick J. A threading-based method for the prediction of DNA-binding proteins with application to the human genome. PLoS Comput Biol. 2009;5(11):e1000567. doi: 10.1371/journal.pcbi.1000567. - DOI - PMC - PubMed
1. Shanahan HP, Garcia MA, Jones S, Thornton JM. Identifying DNA-binding proteins using structural motifs and the electrostatic potential. Nucleic Acids Res. 2004;32:4732–4741. doi: 10.1093/nar/gkh803. - DOI - PMC - PubMed
1. Ahmad S, Gromiha MM, Sarai A. Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information. Bioinformatics. 2004;20:477–486. doi: 10.1093/bioinformatics/btg432. - DOI - PubMed
1. Ho SY, Yu FC, Chang CY, Huang HL. Design of accurate predictors for DNA-binding sites in proteins using hybrid SVM-PSSM method. Biosystems. 2007;90(1):234–241. doi: 10.1016/j.biosystems.2006.08.007. - DOI - PubMed
1. Cai YD, Lin SL. Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. Biochim Biophys Acta. 2003;1648(1-2):127–133. - PubMed

Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties - PubMed