pubmed.ncbi.nlm.nih.gov

Improved predictions of transcription factor binding sites using physicochemical features of DNA - PubMed

Improved predictions of transcription factor binding sites using physicochemical features of DNA

Mark Maienschein-Cline et al. Nucleic Acids Res. 2012 Dec.

Abstract

Typical approaches for predicting transcription factor binding sites (TFBSs) involve use of a position-specific weight matrix (PWM) to statistically characterize the sequences of the known sites. Recently, an alternative physicochemical approach, called SiteSleuth, was proposed. In this approach, a linear support vector machine (SVM) classifier is trained to distinguish TFBSs from background sequences based on local chemical and structural features of DNA. SiteSleuth appears to generally perform better than PWM-based methods. Here, we improve the SiteSleuth approach by considering both new physicochemical features and algorithmic modifications. New features are derived from Gibbs energies of amino acid-DNA interactions and hydroxyl radical cleavage profiles of DNA. Algorithmic modifications consist of inclusion of a feature selection step, use of a nonlinear kernel in the SVM classifier, and use of a consensus-based post-processing step for predictions. We also considered SVM classification based on letter features alone to distinguish performance gains from use of SVM-based models versus use of physicochemical features. The accuracy of each of the variant methods considered was assessed by cross validation using data available in the RegulonDB database for 54 Escherichia coli TFs, as well as by experimental validation using published ChIP-chip data available for Fis and Lrp.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.

Schematic of amino acid–DNA interaction. The DNA structure for sequence CAG (and two flanking GC pairs on either end) is superimposed with 108 grid points around the center adenosine nucleotide arranged in a 6 × 6 × 3 grid and a sample isoleucine structure; the three sub-grids in the DNA minor groove, major groove and outside the DNA are colored orange, red and purple, respectively. G is colored green, C blue, A yellow and T red. In the calculations, the α-C of the amino acid is centered at each grid point and rotations and energy calculations are performed as described in Materials and Methods. The grid is defined by six bounding planes: the bounding planes above and below the grid are centered halfway between the central A and the adjacent nucleotides C and G above and below, respectively, and are parallel to each other as well as to the plane of the rings in adenine. The bounding plane on the left is centered between the A and T base-pairing nucleotides and is perpendicular to the previously defined planes. The bounding plane on the right is placed 20Å from the plane on the left (and parallel to it). The bounding planes in and out of the page are perpendicular to all previously defined planes and 10Å in or out from the center of the adenine ring. Thus the volume of the grid is 20 × 20 × D Å3, where D is the distance between adjacent nucleotides, typically about 3.5 Å. This figure was created using VMD (51,52).

Figure 2.
Figure 2.

Schematic of feature mapping procedure. For illustration, we demonstrate the mapping with the features used in SVM-PMM. (A) For a given sequence, we include three flanking sequences (lower case letters) on either side. (B) N-mers of length 3, 4 and 7 are slid across the sequence (the first 3-mer subsequence is indicated in panel (A)) and the resulting subsequences are mapped to chemical features α, structural bp features β, structural bp step features γ and hydroxyl radical cleavage features δ. (C) Those features are concatenated into a feature vector x for the sequence. The feature vector for this 10-mer will have 334 dimensions (6 + 20 for each of 10 3-mers, 2 for each of 10 7-mers and 6 for each of 9 4-mers).

Figure 3.
Figure 3.

Training results for all 54 TFs. (A) Average F-measure and training time. (B) Plots of F-measure for each of the six methods relative to BvH, in log-scale (for all TFs with F > 0). We do not report training times for BvH in (A), as they are ≈0. See Figure 4 for scatterplots showing the full range of F-measures for each method, and Table S1 of the

Supplementary Materials

for their tabulated values.

Figure 4.
Figure 4.

Scatterplots of F-measure for each method versus SVMR-PMM-FS. For each panel, the horizontal axis is the F-measure for SVMR-PMM-FS. Vertical axes are F-measures for: (A) BvH, (B) SVM-LMM, (C) SiteSleuth, (D) SVM-PMM and (E) SVM-PMM-FS. The boxed dots are the Fis F-measures, and the circled dots are the Lrp F-measures.

Figure 5.
Figure 5.

Results from verification with ChIP-chip data for Fis and Lrp. Error bars reflect the standard deviation over five independent runs, and thick horizontal bars are the results of five-way consensus analysis. Shown are (A) accuracy (number of ChIP-regions with a predicted TFBSs over total number of predicted TFBSs) and (B) the number of predicted TFBSs (in inverted scale). There is no model variability in BvH, so there is no extra consensus-based result for this method. In panel (A), it should be noted that the bar for SiteSleuth has zero height and that the height of the bar for SVMR-PMM-FS is actually much taller than depicted. In panel (B), SiteSleuth has 0 predicted TFBSs for all runs, and the SVM-LMM and SVMR-PMM-FS have 0 predicted TFBSs in the five-way consensus.

Similar articles

Cited by

References

    1. Khalil AS, Collins JJ. Synthetic biology: applications come of age. Nat. Rev. Genet. 2010;11:367–379. - PMC - PubMed
    1. Holtz WJ, Keasling JD. Engineering static and dynamic control of synthetic pathways. Cell. 2010;140:19–23. - PubMed
    1. Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007;316:1497–1502. - PubMed
    1. Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier B, Varhol R, Delaney A, et al. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat. Methods. 2007;4:651–657. - PubMed
    1. Stormo GD, Zhao Y. Determining the specificity of protein-DNA interactions. Nat. Rev. Genet. 2010;11:751–760. - PubMed

Publication types

MeSH terms

Substances