Improved predictions of transcription factor binding sites using physicochemical features of DNA - PubMed
Improved predictions of transcription factor binding sites using physicochemical features of DNA
Mark Maienschein-Cline et al. Nucleic Acids Res. 2012 Dec.
Abstract
Typical approaches for predicting transcription factor binding sites (TFBSs) involve use of a position-specific weight matrix (PWM) to statistically characterize the sequences of the known sites. Recently, an alternative physicochemical approach, called SiteSleuth, was proposed. In this approach, a linear support vector machine (SVM) classifier is trained to distinguish TFBSs from background sequences based on local chemical and structural features of DNA. SiteSleuth appears to generally perform better than PWM-based methods. Here, we improve the SiteSleuth approach by considering both new physicochemical features and algorithmic modifications. New features are derived from Gibbs energies of amino acid-DNA interactions and hydroxyl radical cleavage profiles of DNA. Algorithmic modifications consist of inclusion of a feature selection step, use of a nonlinear kernel in the SVM classifier, and use of a consensus-based post-processing step for predictions. We also considered SVM classification based on letter features alone to distinguish performance gains from use of SVM-based models versus use of physicochemical features. The accuracy of each of the variant methods considered was assessed by cross validation using data available in the RegulonDB database for 54 Escherichia coli TFs, as well as by experimental validation using published ChIP-chip data available for Fis and Lrp.
Figures

Schematic of amino acid–DNA interaction. The DNA structure for sequence CAG (and two flanking GC pairs on either end) is superimposed with 108 grid points around the center adenosine nucleotide arranged in a 6 × 6 × 3 grid and a sample isoleucine structure; the three sub-grids in the DNA minor groove, major groove and outside the DNA are colored orange, red and purple, respectively. G is colored green, C blue, A yellow and T red. In the calculations, the α-C of the amino acid is centered at each grid point and rotations and energy calculations are performed as described in Materials and Methods. The grid is defined by six bounding planes: the bounding planes above and below the grid are centered halfway between the central A and the adjacent nucleotides C and G above and below, respectively, and are parallel to each other as well as to the plane of the rings in adenine. The bounding plane on the left is centered between the A and T base-pairing nucleotides and is perpendicular to the previously defined planes. The bounding plane on the right is placed 20Å from the plane on the left (and parallel to it). The bounding planes in and out of the page are perpendicular to all previously defined planes and 10Å in or out from the center of the adenine ring. Thus the volume of the grid is 20 × 20 × D Å3, where D is the distance between adjacent nucleotides, typically about 3.5 Å. This figure was created using VMD (51,52).

Schematic of feature mapping procedure. For illustration, we demonstrate the mapping with the features used in SVM-PMM. (A) For a given sequence, we include three flanking sequences (lower case letters) on either side. (B) N-mers of length 3, 4 and 7 are slid across the sequence (the first 3-mer subsequence is indicated in panel (A)) and the resulting subsequences are mapped to chemical features α, structural bp features β, structural bp step features γ and hydroxyl radical cleavage features δ. (C) Those features are concatenated into a feature vector x for the sequence. The feature vector for this 10-mer will have 334 dimensions (6 + 20 for each of 10 3-mers, 2 for each of 10 7-mers and 6 for each of 9 4-mers).

Training results for all 54 TFs. (A) Average F-measure and training time. (B) Plots of F-measure for each of the six methods relative to BvH, in log-scale (for all TFs with F > 0). We do not report training times for BvH in (A), as they are ≈0. See Figure 4 for scatterplots showing the full range of F-measures for each method, and Table S1 of the
Supplementary Materialsfor their tabulated values.

Scatterplots of F-measure for each method versus SVMR-PMM-FS. For each panel, the horizontal axis is the F-measure for SVMR-PMM-FS. Vertical axes are F-measures for: (A) BvH, (B) SVM-LMM, (C) SiteSleuth, (D) SVM-PMM and (E) SVM-PMM-FS. The boxed dots are the Fis F-measures, and the circled dots are the Lrp F-measures.

Results from verification with ChIP-chip data for Fis and Lrp. Error bars reflect the standard deviation over five independent runs, and thick horizontal bars are the results of five-way consensus analysis. Shown are (A) accuracy (number of ChIP-regions with a predicted TFBSs over total number of predicted TFBSs) and (B) the number of predicted TFBSs (in inverted scale). There is no model variability in BvH, so there is no extra consensus-based result for this method. In panel (A), it should be noted that the bar for SiteSleuth has zero height and that the height of the bar for SVMR-PMM-FS is actually much taller than depicted. In panel (B), SiteSleuth has 0 predicted TFBSs for all runs, and the SVM-LMM and SVMR-PMM-FS have 0 predicted TFBSs in the five-way consensus.
Similar articles
-
Bauer AL, Hlavacek WS, Unkefer PJ, Mu F. Bauer AL, et al. PLoS Comput Biol. 2010 Nov 18;6(11):e1001007. doi: 10.1371/journal.pcbi.1001007. PLoS Comput Biol. 2010. PMID: 21124945 Free PMC article.
-
Gao R, Helfant LJ, Wu T, Li Z, Brokaw SE, Stock AM. Gao R, et al. Nucleic Acids Res. 2021 Nov 18;49(20):11537-11549. doi: 10.1093/nar/gkab935. Nucleic Acids Res. 2021. PMID: 34669947 Free PMC article.
-
MD-SVM: a novel SVM-based algorithm for the motif discovery of transcription factor binding sites.
Hu J, Wang J, Lin J, Liu T, Zhong Y, Liu J, Zheng Y, Gao Y, He J, Shang X. Hu J, et al. BMC Bioinformatics. 2019 May 1;20(Suppl 7):200. doi: 10.1186/s12859-019-2735-3. BMC Bioinformatics. 2019. PMID: 31074373 Free PMC article.
-
Global regulators of transcription in Escherichia coli: mechanisms of action and methods for study.
Grainger DC, Busby SJ. Grainger DC, et al. Adv Appl Microbiol. 2008;65:93-113. doi: 10.1016/S0065-2164(08)00604-7. Adv Appl Microbiol. 2008. PMID: 19026863 Review. No abstract available.
-
An algorithmic perspective of de novo cis-regulatory motif finding based on ChIP-seq data.
Liu B, Yang J, Li Y, McDermaid A, Ma Q. Liu B, et al. Brief Bioinform. 2018 Sep 28;19(5):1069-1081. doi: 10.1093/bib/bbx026. Brief Bioinform. 2018. PMID: 28334268 Review.
Cited by
-
Trerotola M, Antolini L, Beni L, Guerra E, Spadaccini M, Verzulli D, Moschella A, Alberti S. Trerotola M, et al. NAR Genom Bioinform. 2022 Mar 4;4(1):lqac008. doi: 10.1093/nargab/lqac008. eCollection 2022 Mar. NAR Genom Bioinform. 2022. PMID: 35261972 Free PMC article.
-
Tsai ZT, Shiu SH, Tsai HK. Tsai ZT, et al. PLoS Comput Biol. 2015 Aug 20;11(8):e1004418. doi: 10.1371/journal.pcbi.1004418. eCollection 2015 Aug. PLoS Comput Biol. 2015. PMID: 26291518 Free PMC article.
-
The pattern of DNA cleavage intensity around indels.
Chen W, Zhang L. Chen W, et al. Sci Rep. 2015 Feb 9;5:8333. doi: 10.1038/srep08333. Sci Rep. 2015. PMID: 25660536 Free PMC article.
-
Quantitative modeling of gene expression using DNA shape features of binding sites.
Peng PC, Sinha S. Peng PC, et al. Nucleic Acids Res. 2016 Jul 27;44(13):e120. doi: 10.1093/nar/gkw446. Epub 2016 Jun 1. Nucleic Acids Res. 2016. PMID: 27257066 Free PMC article.
-
Binding of nucleoid-associated protein fis to DNA is regulated by DNA breathing dynamics.
Nowak-Lovato K, Alexandrov LB, Banisadr A, Bauer AL, Bishop AR, Usheva A, Mu F, Hong-Geller E, Rasmussen KØ, Hlavacek WS, Alexandrov BS. Nowak-Lovato K, et al. PLoS Comput Biol. 2013;9(1):e1002881. doi: 10.1371/journal.pcbi.1002881. Epub 2013 Jan 17. PLoS Comput Biol. 2013. PMID: 23341768 Free PMC article.
References
-
- Holtz WJ, Keasling JD. Engineering static and dynamic control of synthetic pathways. Cell. 2010;140:19–23. - PubMed
-
- Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007;316:1497–1502. - PubMed
-
- Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier B, Varhol R, Delaney A, et al. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat. Methods. 2007;4:651–657. - PubMed
-
- Stormo GD, Zhao Y. Determining the specificity of protein-DNA interactions. Nat. Rev. Genet. 2010;11:751–760. - PubMed
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Miscellaneous