pubmed.ncbi.nlm.nih.gov

Local descriptors of protein structure: a systematic analysis of the sequence-structure relationship in proteins using short- and long-range interactions - PubMed

  • ️Invalid Date

Local descriptors of protein structure: a systematic analysis of the sequence-structure relationship in proteins using short- and long-range interactions

Torgeir R Hvidsten et al. Proteins. 2009 Jun.

Abstract

Local protein structure representations that incorporate long-range contacts between residues are often considered in protein structure comparison but have found relatively little use in structure prediction where assembly from single backbone fragments dominates. Here, we introduce the concept of local descriptors of protein structure to characterize local neighborhoods of amino acids including short- and long-range interactions. We build a library of recurring local descriptors and show that this library is general enough to allow assembly of unseen protein structures. The library could on average re-assemble 83% of 119 unseen structures, and showed little or no performance decrease between homologous targets and targets with folds not represented among domains used to build it. We then systematically evaluate the descriptor library to establish the level of the sequence signal in sets of protein fragments of similar geometrical conformation. In particular, we test whether that signal is strong enough to facilitate correct assignment and alignment of these local geometries to new sequences. We use the signal to assign descriptors to a test set of 479 sequences with less than 40% sequence identity to any domain used to build the library, and show that on average more than 50% of the backbone fragments constituting descriptors can be correctly aligned. We also use the assigned descriptors to infer SCOP folds, and show that correct predictions can be made in many of the 151 cases where PSI-BLAST was unable to detect significant sequence similarity to proteins in the library. Although the combinatorial problem of simultaneously aligning several fragments to sequence is a major bottleneck compared with single fragment methods, the advantage of the current approach is that correct alignments imply correct long range distance constraints. The lack of these constraints is most likely the major reason why structure prediction methods fail to consistently produce adequate models when good templates are unavailable or undetectable. Thus, we believe that the current study offers new and valuable insight into the prediction of sequence-structure relationships in proteins.

Copyright 2008 Wiley-Liss, Inc.

PubMed Disclaimer

Figures

Figure 1
Figure 1

(a) Three local descriptors centered on residues 79 (bottom), 173 (middle), and 192 (top) are illustrated as protein cartoons on the wireframe structure of the anaerobic cobalt chelatase (PDB code 1qgo, chain A). Cβ atoms of the central residues of the descriptors are shown as dark balls. (b) The descriptor centered on residues 80 in the same protein as in (a). Because local descriptors incorporate side chain directionality, close residues pointing in opposite directions may result in very different descriptors. This is particularly well illustrated with the central residues 79 and 80 in this example [i.e., (a) bottom vs. (b)].

Figure 2
Figure 2

Optimal structural superpositions of descriptors similar to the (seed) descriptor (a) 1qgoa_#192 (202 descriptors) and (b) 1qgoa_#80 (378 descriptors) (descriptors are named using the following syntax “domain name ‘#’central amino acid”). Note that the seed descriptors of these two groups are also shown in Figure 1 (although slightly rotated). The corresponding sequence alignment for some of the proteins in group 1qgoa_#192 is shown in Table I.

Figure 3
Figure 3

The number and size of descriptor groups (logarithmic scale) with different number of segments.

Figure 4
Figure 4

Structural coverage of the Nifb protein (PDB code 1O13, CASP comparative modeling target) and the SOR45 protein (PDB code 1TZA, CASP fold recognition target) with local structures from the group library. The bars show how many groups cover the particular residue in the sequence.

Figure 5
Figure 5

Coverage of CASP targets using the local substructures from the group library. ALL, all targets assessed; CM, comparative modeling targets (87 domains); FR, fold recognition (56 domains); NF, new fold (15 domains). Splines have been fitted through the data for each of the four presented data sets.

Figure 6
Figure 6

Frequencies for Valine and hydrophobic amino acids as a function of descriptor group size. The solid lines correspond to the upper and lower boundaries for the 90% confidence interval defining significant under- and over-representation. In general, smaller variation around the a priori frequencies is required for an observed frequency to be significant as the group size increases and more evidence is available. The small a priori frequencies for single amino acids give little room for significant underrepresentation, something that may only be observed for relatively large groups.

Figure 7
Figure 7

The 479 test domains plotted against E-score to the closest training domain as given by PSI-BLAST, and P-value for the top fold assignment as given by the descriptor approach (E-scores were not available (NA) for targets where PSI-BLAST could not find any relation to any training domain, and these domains were therefore given random E-scores larger than 10). The test domains are plotted with different colors and shapes to show which method, if any, was able to correctly identify domain folds. Hollow shapes indicate that the descriptor approach did not identify the correct fold as its top assignment, but that the correct fold still was in one of the top five assignments given by this method. The horizontal line indicates the E-score threshold at which predictions from PSI-BLAST may no longer be trusted. The vertical line indicates the corresponding P-value threshold for the descriptor approach. This P-value threshold was selected to maximize the sum of sensitivity (i.e. fraction of correctly predicted test domains with a lower P-value) and specificity (i.e. fraction of incorrectly predicted test domains with a higher P-value). The Receiver Operating Characteristic (ROC) curve shows sensitivity versus specificity for the full spectrum of such P-value thresholds. The selected threshold 3.39E-14 corresponds to the sensitivity of 0.84 and the specificity of 0.80.

Similar articles

Cited by

References

    1. Zhang C, Kim SH. Overview of structural genomics: from structure to function. Curr Opin Chem Biol. 2003;7(1):28–32. - PubMed
    1. Chandonia JM, Brenner SE. The impact of structural genomics: expectations and outcomes. Science. 2006;311(5759):347–351. - PubMed
    1. Tramontano A. Of men and machines. Nat Struct Biol. 2003;10(2):87–90. - PubMed
    1. Rost B. Twilight zone of protein sequence alignments. Protein Eng. 1999;12(2):85–94. - PubMed
    1. Kryshtafovych A, Venclovas C, Fidelis K, Moult J. Progress over the first decade of CASP experiments. Proteins. 2005;61(Suppl 7):225–236. - PubMed

Publication types

MeSH terms

Substances