Building native protein conformation from NMR backbone chemical shifts using Monte Carlo fragment assembly
Abstract
We have been analyzing the extent to which protein secondary structure determines protein tertiary structure in simple protein folds. An earlier paper demonstrated that three-dimensional structure can be obtained successfully using only highly approximate backbone torsion angles for every residue. Here, the initial information is further diluted by introducing a realistic degree of experimental uncertainty into this process. In particular, we tackle the practical problem of determining three-dimensional structure solely from backbone chemical shifts, which can be measured directly by NMR and are known to be correlated with a protein's backbone torsion angles. Extending our previous algorithm to incorporate these experimentally determined data, clusters of structures compatible with the experimentally determined chemical shifts were generated by fragment assembly Monte Carlo. The cluster that corresponds to the native conformation was then identified based on four energy terms: steric clash, solvent-squeezing, hydrogen-bonding, and hydrophobic contact. Currently, the method has been applied successfully to five small proteins with simple topology. Although still under development, this approach offers promise for high-throughput NMR structure determination.
Keywords: protein structure/folding, NMR spectroscopy, computational analysis
Our work focuses on the backbone, that protein component from which almost all relevant structural information can be determined (Rose et al. 2006). From the backbone structure alone, the fold can be classified (Murzin et al. 1995; Orengo et al. 1997; Kamat and Lesk 2007), evolutionary relationships can be identified (Murzin et al. 1995), and side-chain torsions can be calculated (Canutescu et al. 2003). Previously, we demonstrated that proteins with similar backbone torsion angles have similar folds (Gong and Rose 2005). Later work indicated that even highly approximate backbone torsion angles are sufficient to generate native backbone topology, using a simple algorithm based on hydrogen bonding, molecular compaction, and steric exclusion (Gong et al. 2005). This approach was further extended for proteins with uncomplicated topologies, showing that native backbone conformation can often be captured solely from knowledge of secondary structure (Fleming et al. 2006).
Approximate backbone torsion angles are the only experimental data required by these previous exercises (Gong and Rose 2005; Gong et al. 2005; Fleming et al. 2006), and distributions of backbone torsions can be obtained readily from NMR chemical shifts (Cornilescu et al. 1999). Combining these approaches, we present a new, three-stage algorithm to build native topology from chemical shifts alone.
In the first stage, experimentally determined chemical shifts are used to search the protein database (Berman et al. 2000) for backbone fragments with similar chemical shifts. In the second stage, these backbone fragments are “stitched” together in a self-consistent manner, using Monte Carlo simulation (Gong et al. 2005). In the third stage, side chains are added to the lowest energy backbone conformation, using a side-chain rotamer library (Canutescu et al. 2003) followed by conjugate gradient energy minimization (Phillips et al. 2005). Results are presented for calbindin, CspA, GB3, ubiquitin, and DinI, five proteins with uncomplicated topologies (Table 1). This three-stage algorithm, which was motivated by ideas about the mechanism of protein folding (Rose et al. 2006), holds promise for further development as a high-throughput NMR structure determination.
Table 1.
Results for proteins used in this paper
Results
Values of backbone and root-mean-square deviation (RMSD) from the native conformation are plotted as histograms in Figure 1. All conformations from GB3 simulations are tightly clustered around the native conformation, with RMSD values <5 Å. Although corresponding distributions for the other four proteins are broader, all extend to low RMSD values, and the native conformation can be successfully identified in the simulated ensemble, as described next.
Figure 1.
Root-mean-square differences (RMSD) between the experimentally determined native structure and all simulated structures. For the five proteins, these RMSD values (backbone atoms only) are plotted as histograms, with binned RMSD values shown on the abscissa and the fraction of simulated structures having that RMSD shown on the ordinate.
Distributions of simulation energy vs. backbone RMSD are replotted in Figure 2 for the entire ensemble. With the exception of a few outliers, all plots reveal the same trend: Simulation energy is positively correlated with backbone RMSD, suggesting that our four-term potential can discriminate native from nonnative conformations.
Figure 2.
Distributions of simulation energy vs. backbone RMSD in 400 independent simulations for each of the five proteins.
Three of the energy terms—steric exclusion (E soft_debump), hydrogen-bonding (E HB), and global compaction (E confine)—were used in our earlier studies (Gong et al. 2005; Fleming et al. 2006). All are necessary (Fleming et al. 2006), and plausibly so: Native globular proteins are compact, self-avoiding chains with few unsatisfied hydrogen bonding groups (Fleming and Rose 2005; Rose et al. 2006). Here, an additional contact energy term, E contact, was introduced to help capture the tendency of hydrophobic side chains to be sequestered within the protein interior and polar side chains to be exposed to the surrounding solvent (Kauzmann 1959; Dill 1985). E contact is only a crude approximation to the energy of pairwise interaction, but it serves to bias the simulated structure toward secondary structure elements having native handedness, especially in all-alpha proteins (e.g., calbindin). Typically, opposite handedness of alpha-helices results in exposed apolar residues and buried polar residues (data not shown). Of the four energy terms, hydrogen bonding is dominant.
Owing to the presence of occasional confounding outliers in Figure 2, the positive correlation between energy and backbone RMSD is not quite sufficient to identify the most stable conformation. Such outliers are an unavoidable consequence of the highly approximate energy function used here. However, this problem can be overcome using structural clustering to recognize and retain all large conformational clusters while eliminating any sparsely distributed, uncorrelated conformations.
The major conformational clusters for all proteins are listed in Table 1. In each case, the lowest energy structures and the native structure are subsumed within the largest cluster (Table 1).
For each protein, the representative structural cluster is selected by total simulation energy. Of note, the hydrogen bond potential alone is sufficient to discriminate among them, underscoring the guiding importance of hydrogen bond satisfaction in protein folding (Myers and Pace 1996; Fleming and Rose 2005; Street et al. 2006).
The selected clusters are shown in Table 1. All are native-like, with low mean backbone RMSD (<4.2 Å) from the native conformation. Within each, the single most stable conformation was chosen as the final structural model. For the five proteins, the average backbone RMSD from the native conformation is 3.6 Å. Stereoviews of these chosen models are shown superimposed on their native counterparts in Figure 3, visual corroboration that our protocol successfully identifies the native topology. These backbone models were then decorated with side chains, starting with a backbone rotamer library (Canutescu et al. 2003) followed by energy minimization (Phillips et al. 2005), resulting in an average, all-atom RMSD from the native conformation of 4.1 Å for the five proteins (Table 1).
Figure 3.
Stereoviews of the lowest energy model from the lowest energy cluster (red) superimposed on the experimentally determined native structure (green). (A) calbindin, (B) CspA, (C) GB3, (D) ubiquitin, and (E) DinI.
Discussion
Protein structure is an indispensable prerequisite to understanding protein function. Despite an exponentially increasing number of solved protein structures, the gap between known structures and known sequences is increasing more rapidly, yet escalating the need for high-throughput structure determination. X-ray crystallography and solution state nuclear magnetic resonance (NMR) are the primary methods for structure determination. Despite recent strides, NMR remains a lengthy, labor-intensive process owing to the time required for both data collection and resonance assignment (Montelione et al. 2000). Among the three major steps in traditional NMR structure determination—assignment of backbone resonances, side-chain resonances, and NOE cross-peaks—the latter two steps are especially time-consuming (Moseley and Montelione 1999; Montelione et al. 2000).
More recently, residual dipolar couplings (RDC) have been used to facilitate rapid structure determination by NMR. RDCs depend on the orientation of internuclear vectors with respect to a global axis system, and they can be measured in weakly aligned proteins dissolved in anisotropic media. The acquisition of RDCs requires comparatively little additional data collection time beyond backbone resonance assignments but provides considerable additional structural information (Prestegard et al. 2000; Tolman et al. 2001). Several structure determination methods based on RDC-restrained models have been developed (Delaglio et al. 2000; Andrec et al. 2001; Rohl and Baker 2002; Kontaxis et al. 2005; Mayer et al. 2006). However, methods that rely primarily on RDC restraints require multiple data sets collected in different media to overcome restraint degeneracy. Alternatively, chemical shifts and NOEs can be used to supplement RDC restraints for full structure determination (Mayer et al. 2006). Yet another approach involves post-processing database analysis to distinguish between true and false positives (Andrec et al. 2001).
In related work, Baker and colleagues extended their Rosetta programs to identify the native fold using residual dipolar couplings (RDC) together with chemical shifts (Rohl and Baker 2002). They conclude that addition of RDC data helps to filter out false positives in fragment selection. Our simulations, which rely solely on data from chemical shifts, seek to filter false positives in fragment assembly by using a simple potential function followed by clustering.
Previous work demonstrated the strong correlation between chemical shifts and protein backbone conformation (Cornilescu et al. 1999). Markley and coworkers developed PECAN (http://bija.nmrfam.wisc.edu/PECAN/), a program for conformational analysis from chemical shifts that emphasizes secondary structure prediction (Eghbalnia et al. 2005). Wishart and coworkers developed ShiftX and related programs (http://redpoll.pharmacy.ualberta.ca/shiftx/) that focus on the converse problem of predicting chemical shifts from known protein structure (Neal et al. 2003).
Extending these earlier ideas, the five examples presented here raise the possibility that chemical shifts may be sufficient to decipher native protein backbone conformation, at least for proteins with uncomplicated topologies. Our approach affords further possibilities as well. For example, recent work of Avbelj et al. (2004) has potential for use as a fragment filter by incorporating the under-realized relationship between chemical shifts and solvent accessibility. In sum, the algorithm described here represents a promising, open-ended approach to rapid and high-throughput protein structure determination by measuring backbone chemical shifts and running short simulations.
Materials and Methods
Suitable candidates were selected from a fragment library and incorporated into self-consistent structures using fragment-assembly Monte Carlo simulation. Then, ensembles of structures constructed in this way were clustered and assessed.
Stage I: Fragment library construction and fragment searching
A fragment library was constructed using 5665 protein chains from the PISCES server (Wang and Dunbrack Jr. 2003), all with sequence identity <40%, resolution <2.5 Å, and an R factor of 1.0 or better. Chains were split into consecutively overlapping six-residue fragments and chemical shifts of each fragment were calculated using the SPARTA program (Shen and Bax 2007), available from (http://spin.niddk.nih.gov/bax). Chemical shifts of target proteins were downloaded from the BioMagResBank (BMRB) server (http://www.bmrb.wisc.edu). Suitable candidates for fragment substitution were identified by comparing experimentally determined chemical shifts of a target protein against calculated chemical shifts of library fragments, after first eliminating any fragments from proteins having the same topology as the target protein. For every consecutive six-residue segment in the target protein, the 20 most similar fragments were selected from the library. Fragment similarity was scored based on both chemical shifts and primary structure.
Stage II: Fragment assembly Monte Carlo simulation
Simulations were initiated with the protein backbone in an extended conformation; all side-chain atoms beyond beta carbons were discarded. Standard van der Waals radii and hydrogen-bond criteria were used, as described in Gong et al. (2005). Similar to that earlier protocol (Gong et al. 2005), 50,000 cycles of Metropolis Monte Carlo simulation (Metropolis et al. 1953) were performed, preceded by 5000 relaxation cycles; each cycle consisted of n − 5 steps for a chain of length n. At each step, a randomly chosen six-residue segment of the target peptide was replaced by a randomly chosen library fragment from the list of 20 candidates. Simulated annealing was introduced into the simulation by systematically incrementing β in the Metropolis criterion, −βΔE, over the range [0.5–4.0], where β = 1/RT ≈ 2 at 300°K. Typical processing times are 2–3 h on a desktop computer for each simulation.
The Metropolis criterion was applied using an energy function with four simple terms: (1) steric exclusion (E soft_debump), (2) hydrogen-bonding (E HB), (3) global compaction (E confine), and (4) contact energy (E contact). The first three terms are identical to those described in our earlier protocol (Gong et al. 2005). Here, a small additional contact energy term, E contact, was introduced to bias structures toward forming an interior hydrophobic core by assigning one of four discrete values (0.5, −0.25, −0.5, −1.0) to pairwise spatial neighbors based on their polarity (polar:apolar, small apolar:small apolar, small apolar:large apolar, and large apolar:large apolar, respectively). As was the case previously (Gong et al. 2005), hydrogen bonding remains the dominant term in the simulation energy.
Post-simulation processing
This protocol was applied to each target protein in 400 independent simulations. Conformations with the lowest simulation energy from every simulation were collected and used to construct a representative ensemble. Members of this ensemble were then clustered by structure (viz., an α-carbon distance matrix) using Pycluster (de Hoon et al. 2004). Clusters with a Pearson correlation coefficient >95% and spanning >5% of the ensemble were retained for further analysis, and their energy distribution and RMSD from the native conformation were calculated.
Side-chain decoration
SCWRL3.0 (Canutescu et al. 2003) was used to add side chains to the lowest energy backbone conformation within the most stable cluster, as specified by the amino acid sequence. Steric clashes, an unavoidable byproduct of SCWRL side-chain decoration, were relieved by 1000 steps of side-chain torsional angle conjugate gradient minimization, using a soft-sphere potential. The model was further optimized by an additional 1000 steps of conjugate gradient minimization using the program NAMD-2.6 (http://www.ks.uiuc.edu/Research/namd) (Phillips et al. 2005), with parameters from the CHARM22 all-hydrogen force field.
Acknowledgments
We thank Patrick Fleming, Nicholas Fitzkee, Timothy Street, Lauren Perskie, and Ad Bax for useful suggestions. Support from the Burroughs Wellcome Fund (H.G.) and the Mathers Foundation (G.D.R.) is gratefully acknowledged.
Footnotes
Reprint requests to: George D. Rose, T.C. Jenkins Department of Biophysics, Johns Hopkins University, 3400 N. Charles Street, Baltimore, MD 21218, USA; e-mail: grose@jhu.edu; fax: (410) 516-4118.
References
- Andrec M., Du, P., and Levy, R.M. 2001. Protein backbone structure determination using only residual dipolar couplings from one ordering medium. J. Biomol. NMR 21: 335–347. [DOI] [PubMed] [Google Scholar]
- Avbelj F., Kocjan, D., and Baldwin, R.L. 2004. Protein chemical shifts arising from α-helices and β-sheets depend on solvent exposure. Proc. Natl. Acad. Sci. 101: 17394–17397. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Berman H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. 2000. The Protein Data Bank. Nucleic Acids Res. 28: 235–242. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Canutescu A.A., Shelenkov, A.A., and Dunbrack Jr, R.L. 2003. A graph-theory algorithm for rapid protein side-chain prediction. Protein Sci. 12: 2001–2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cornilescu G., Delaglio, F., and Bax, A. 1999. Protein backbone angle restraints from searching a database for chemical shift and sequence homology. J. Biomol. NMR 13: 289–302. [DOI] [PubMed] [Google Scholar]
- de Hoon M.J., Imoto, S., Nolan, J., and Miyano, S. 2004. Open source clustering software. Bioinformatics 20: 1453–1454. [DOI] [PubMed] [Google Scholar]
- Delaglio F., Kontaxis, G., and Bax, A. 2000. Protein structure determination using molecular fragment replacement and NMR dipolar couplings. J. Am. Chem. Soc. 122: 2142–2143. [Google Scholar]
- Dill K.A. 1985. Theory for the folding and stability of globular proteins. Biochemistry 24: 1501–1509. [DOI] [PubMed] [Google Scholar]
- Eghbalnia H.R., Wang, L., Bahrami, A., Assadi, A., and Markley, J.L. 2005. Protein energetic conformational analysis from NMR chemical shifts (PECAN) and its use in determining secondary structural elements. J. Biomol. NMR 32: 71–81. [DOI] [PubMed] [Google Scholar]
- Fleming P.J. and Rose, G.D. 2005. Do all backbone polar groups in proteins form hydrogen bonds? Protein Sci. 14: 1911–1917. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fleming P.J., Gong, H., and Rose, G.D. 2006. Secondary structure determines protein topology. Protein Sci. 15: 1829–1834. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gong H. and Rose, G.D. 2005. Does secondary structure determine tertiary structure in proteins? Proteins 61: 338–343. [DOI] [PubMed] [Google Scholar]
- Gong H., Fleming, P.J., and Rose, G.D. 2005. Building native protein conformation from highly approximate backbone torsion angles. Proc. Natl. Acad. Sci. 102: 16227–16232. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kamat A.P. and Lesk, A.M. 2007. Contact patterns between helices and strands of sheet define protein folding patterns. Proteins 66: 869–876. [DOI] [PubMed] [Google Scholar]
- Kauzmann W. 1959. Some factors in the interpretation of protein denaturation. Adv. Protein Chem. 14: 1–63. [DOI] [PubMed] [Google Scholar]
- Kontaxis G., Delaglio, F., and Bax, A. 2005. Molecular fragment replacement approach to protein structure determination by chemical shift and dipolar homology database mining. Methods Enzymol. 394: 42–78. [DOI] [PubMed] [Google Scholar]
- Mayer K.L., Qu, Y., Bansal, S., LeBlond, P.D., Jenney Jr, F.E., Brereton, P.S., Adams, M.W., Xu, Y., and Prestegard, J.H. 2006. Structure determination of a new protein from backbone-centered NMR data and NMR-assisted structure prediction. Proteins 65: 480–489. [DOI] [PubMed] [Google Scholar]
- Metropolis N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., and Teller, E. 1953. Equation of state calculations by fast computing machines. J. Chem. Phys. 21: 1087–1092. [Google Scholar]
- Montelione G.T., Zheng, D., Huang, Y.J., Gunsalus, K.C., and Szyperski, T. 2000. Protein NMR spectroscopy in structural genomics. Nat. Struct. Biol. 7: (Suppl): 982–985. [DOI] [PubMed] [Google Scholar]
- Moseley H.N. and Montelione, G.T. 1999. Automated analysis of NMR assignments and structures for proteins. Curr. Opin. Struct. Biol. 9: 635–642. [DOI] [PubMed] [Google Scholar]
- Murzin A.G., Brenner, S.E., Hubbard, T., and Chothia, C. 1995. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247: 536–540. [DOI] [PubMed] [Google Scholar]
- Myers J.K. and Pace, C.N. 1996. Hydrogen bonding stabilizes globular proteins. Biophys. J. 71: 2033–2039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neal S., Nip, A.M., Zhang, H., and Wishart, D.S. 2003. Rapid and accurate calculation of protein 1H, 13C and 15N chemical shifts. J. Biomol. NMR 26: 215–240. [DOI] [PubMed] [Google Scholar]
- Orengo C.A., Michie, A.D., Jones, D.T., Swindells, M.B., and Thornton, J.M. 1997. CATH—A hierarchic classification of protein domain structures. Structure 5: 1093–1108. [DOI] [PubMed] [Google Scholar]
- Phillips J.C., Braun, R., Wang, W., Gumbart, J., Tajkhorshid, E., Villa, E., Chipot, C., Skeel, R.D., Kale, L., and Schulten, K. 2005. Scalable molecular dynamics with NAMD. J. Comput. Chem. 26: 1781–1802. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Prestegard J.H., al-Hashimi, H.M., and Tolman, J.R. 2000. NMR structures of biomolecules using field oriented media and residual dipolar couplings. Q. Rev. Biophys. 33: 371–424. [DOI] [PubMed] [Google Scholar]
- Rohl C.A. and Baker, D. 2002. De novo determination of protein backbone structure from residual dipolar couplings using Rosetta. J. Am. Chem. Soc. 124: 2723–2729. [DOI] [PubMed] [Google Scholar]
- Rose G.D., Fleming, P.J., Banavar, J.R., and Maritan, A. 2006. A backbone-based theory of protein folding. Proc. Natl. Acad. Sci. 103: 16623–16633. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shen Y. and Bax, A. 2007. Protein backbone chemical shifts predicted from searching a database for torsion angle and sequence homology. J. Biomol. NMR (in press. [DOI] [PubMed]
- Street T.O., Bolen, D.W., and Rose, G.D. 2006. A molecular mechanism for osmolyte-induced protein stability. Proc. Natl. Acad. Sci. 103: 13997–14002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tolman J.R., Al-Hashimi, H.M., Kay, L.E., and Prestegard, J.H. 2001. Structural and dynamic analysis of residual dipolar coupling data for proteins. J. Am. Chem. Soc. 123: 1416–1424. [DOI] [PubMed] [Google Scholar]
- Wang G. and Dunbrack Jr, R.L. 2003. PISCES: A protein sequence culling server. Bioinformatics 19: 1589–1591. [DOI] [PubMed] [Google Scholar]