pmc.ncbi.nlm.nih.gov

PSI-2: Structural Genomics to Cover Protein Domain Family Space

  • ️Fri Jul 01 2005

. Author manuscript; available in PMC: 2010 Aug 12.

Published in final edited form as: Structure. 2009 Jun 10;17(6):869–881. doi: 10.1016/j.str.2009.03.015

Summary

One major objective of structural genomics efforts, including the NIH-funded Protein Structure Initiative (PSI), has been to increase the structural coverage of protein sequence space. Here, we present the target selection strategy used during the second phase of PSI (PSI-2). This strategy, jointly devised by the bioinformatics groups associated with the PSI-2 large-scale production centres, targets representatives from large, structurally uncharacterised protein domain families, and from structurally uncharacterised subfamilies in very large and diverse families with incomplete structural coverage. These very large families are extremely diverse both structurally and functionally, and are highly over-represented in known proteomes. On the basis of several metrics, we then discuss to what extent PSI-2, during its first three years, has increased the structural coverage of genomes, and contributed structural and functional novelty. Together, the results presented here suggest that PSI-2 is successfully meeting its objectives and provides useful insights into structural and functional space.

Background

The multiple international genomics and metagenomics initiatives are providing us with sequences of hundreds of genomes and millions of genes. Analysis of this windfall is greatly aided by the fact that these millions of genes can be grouped into a much smaller number of gene families that, being related by evolution, share similarities in function and three-dimensional structure (Todd et al., 2001;Pegg et al., 2006;Reeves et al., 2006;Finn et al., 2008;Redfern et al., 2008). Gene family sizes seem to follow a power law with most families containing small numbers of members, and relatively few families being very large and diverse (Todd et al., 2001;Gerlt et al., 2001;Reeves et al., 2006;Marsden et al., 2007) (Figure 1).

Figure 1.

Figure 1

Distribution of numbers of sequences from Gene3D v6.0 (Yeats et al., 2008) for all CATH superfamilies (Greene et al., 2007).

Several explanations that rely on historical, functional or thermodynamic arguments have been proposed to rationalise the existence of these very large families (Goldstein, 2008). For example, it was suggested that certain types of functions in ancestral proteins made these more amenable to duplication and diversification (Ranea et al., 2006;Goldstein, 2008). Results from other analyses imply that particular structural folds are more likely to accommodate insertions and deletions, which in turn allow functional diversification during evolution (Reeves et al., 2006). Interestingly, whilst the total number of families keeps growing at an almost linear pace with new sequencing data, the number of such very large families remains essentially constant (Goldstein, 2008;Redfern et al., 2008). This phenomenon not only reflects the laws of statistics, but also seems to hint at the history of life on Earth since several of these families contain very ancient genes that are present in organisms from all domains of life, often in multiple paralogues (Aravind et al., 2002;Goldstein, 2008). During their long evolution, genes from these families have had ample chances to diversify, both in structure and function. For example, analysis of bacterial genomes has shown that some of these ancient families have linearly expanded with genome size and the occurrence of multiple paralogues has allowed diversification of functions increasing the functional repertoire of the organisms (Ranea et al., 2006). It is also worth noting that the largest domain families are often involved in essential functions (Shakhnovich et al., 2006), making them potentially interesting targets to understand better disease-related processes for instance.

Because of the observed modularity of proteins and the fact that many proteins, especially in eukaryotes, consist of multiple domains that can combine differently in other proteins, it is generally convenient to consider domains as the fundamental units of protein evolution (Ponting et al., 2002;Moore et al., 2008). Analyses of completed genomes have characterized the extent to which domains are duplicated and fused in different domain contexts. Whilst fewer than ten percent of the protein families in an organism are common to all kingdoms of life, over half the domain sequences in an organism are likely to belong to less than 200 families universal to all kingdoms of life (Lee et al., 2005;Ranea et al., 2006) appearing in diverse multi-domain contexts. A number of domain family resources, e.g. Pfam (Finn et al., 2008), CATH (Greene et al., 2007) and SCOP (Murzin et al., 1995), have emerged to capture evolutionary relationships between domains enabling studies on the evolution of different functional roles in diverse relatives. Various large-scale efforts, among them structural genomics, are directed at attaining some level of description of all the known domain families.

Structural genomics initiatives that have been set up world-wide have undoubtedly started modifying the way protein three-dimensional structures are used to address issues in several disciplines, among which are enzymology (Gerlt, 2007), protein folding (Fersht, 2008) or protein function prediction (Watson et al., 2007;lali-Hassani et al., 2007). One major historical focus of many structural genomics efforts has been to increase structural coverage of known protein space, by selecting targets from novel, structurally uncharacterised protein families (Sali, 1998;Chandonia et al., 2006;Liu et al., 2007). More recently, it has been argued that structural coverage of protein space could only be completed by concomitantly selecting targets from very large and diverse superfamilies, which often display extreme structural and functional diversity (Todd et al., 2001;Pegg et al., 2006;Reeves et al., 2006;Marsden et al., 2007). In addition, it can be expected that a more comprehensive sampling of structures from such very large superfamilies would help understanding better the determinants and the extent of their functional diversity (Reeves et al., 2006;Redfern et al., 2008). Accordingly, as part of the Protein Structure Initiative (PSI) funded by the NIH, structural genomics production centres have committed significant part of their resources to solve structures of proteins from such diverse families (Norvell et al., 2007).

Structural data will not only help rationalising the mechanism for function divergence in these extremely large families but may also explain why some families are more highly recurrent in particular organisms or environmental contexts. Recently, structural genomics target selection strategy was extended to target protein families that were shown to be over-represented in uncultured bacteria present in specific environments, such as the human distal gut, as identified by metagenomics studies.

In this article, we present the target selection strategy that is being followed by the four large-scale production centres of the NIH Protein Structure Initiative (i.e. JCSG (www.jcsg.org), MCSG (www.mcsg.anl.gov), NESG (www.nesg.org) and NYSGXRC (www.nysgrc.org)). This strategy has been aimed at two major objectives, namely (a) to provide structures from protein families representing significant proportions of the genome sequences, and (b) to study the structural basis of functional diversity in the most diverse and highly populated families. We specifically address the benefits of increasing our sampling of structure space in these very large families.

Historical considerations on PSI target selection strategy

The first phase of the Protein Structure Initiative (PSI-1), which started in September 2000 and ended in June 2005, did not specify the exact meaning of structural coverage of protein space and generally targeted ‘novel’ proteins showing no close relationship to any proteins of known structure. A simplistic general threshold of 30% sequence identity was widely adopted as a definition of novel targets (Vitkup et al., 2001). This threshold was selected based on evidence from CASP quality assessments (Moult, 2005), which suggested that 30% sequence identity was a reasonable cut-off for building homology models. The underlying idea was that each novel structure solved could in turn be exploited to provide approximate models of all close homologues (Sali, 1998). Centres participating in the PSI focused on targets from specific organisms, metabolic pathways or other medically relevant topics. For example, the JCSG focused on targets from Thermotoga maritima, NESG from human and other eukaryotes, MCSG focused on various pathogenic organisms, and NYSGXRC solved structures of proteins involved in metabolic pathways and cancer. These four large-scale centres continued to participate in PSI-2 since July 2005.

PSI-1 succeeded in establishing a new model of structure determination whereby very large numbers of structures are solved by an automated high throughput experimental pipeline, providing unparalleled productivity and cost savings. The four large-scale centres now involved in PSI-2 have solved over 800 protein structures during the first phase of PSI, far more than conventional structural biology labs could have solved alone with a comparable amount of funding. Thereby, PSI-1 achieved one of its goals, namely to reduce significantly the cost of solving protein structures. Several reports in the literature have detailed the success of PSI-1 according to different criteria (Todd et al., 2005;Chandonia et al., 2006;Watson et al., 2007). All these publications clearly suggest that PSI-1 was successful in significantly increasing the proportion of novel distinct protein structures deposited in the PDB (Berman et al., 2000), as well as the proportion of novel structural superfamilies and novel fold groups. The analysis by Todd et al (2005) showed encouraging increases in numbers of structures solved in these different categories over the first 3 years. More recent analyses by Chandonia and Brenner (2006) confirmed these observations over 5 years (see Table 1).

Table 1.

This table summarises results from two previous studies regarding the increase in the number of structures for novel distinct proteins, proteins from novel superfamilies in existing folds, and proteins with novel folds, due to PSI, and structural biology worldwide (excluding structural genomics initiatives) over the same period of time. These results were obtained from (a) Todd et al. 2005 and (b) Chandonia and Brenner 2006. Superfamily and fold definitions were obtained from Scop (Murzin et al., 1995) in both studies.

Novel protein Novel superfamily Novel fold
Non-SG (a) 42% 2% 2%
PSI (a) 92% 7% 11%
Non-SG (b) 24% 1% 2%
PSI (b) 91% 6% 10%

However, it was clear from both analyses that there were still considerable levels of redundancy between the PSI large-scale centres and also between the centres and the general structure biology community, despite the adoption of a centralized target tracking system (TargetDB, http://targetdb.pdb.org/) for publicizing information on selected targets and progress with these targets. In many cases this was due to similar targets having advanced too far in different pipelines by the time conflicts surfaced. In some situations, targets were not stopped because they involved a relative from a different species or with a different ligand bound that could potentially provide useful biological insights.

In order to reduce the redundancy in structures targeted and solved by the four PSI large-scale centres, the target selection strategy from PSI-1 was reviewed at the start of PSI-2, and a new joint initiative was started involving four BioInformatics Groups (jointly referred to as the BIG4), each being associated with one of the four large-scale centres. The aim was to improve the productivity of PSI by reducing the overlap among centres as far as possible and to coordinate efforts of all the centres towards the main goal of PSI.

All four large-scale PSI centres split their efforts among three major target lists by spending about 70% of their efforts on a centralized list targeting structural novelty, 15% on community nominated targets and 15%, on bio-medically important targets. Here we present the strategies developed by the BIG4 in PSI-2 for assembling the centralized list targeting structural novelty, by focusing on uncharacterised domain families, as well as diverse relatives in very highly populated domain families of known structure which are predicted to be structurally and functionally dissimilar to previously determined structures. We also present our initial analyses of the structures deposited in the PDB by PSI-2 large-scale centres during the first three years of PSI-2, and examine the degree to which PSI-2 has been successful in increasing the proportion of distinct (less than 98% sequence identity to any structure pre-existing in the PDB – see Methods) and structurally novel structures solved since the beginning of the initiative. Even though it was not an explicit goal of PSI, we also assess the success of PSI-2 in contributing structural information for functional families and thereby the degree to which PSI-2 has illuminated both structure and function space, by identifying the functional categories within the Gene Ontology classification (The Gene Ontology Consortium, 2000) for which PSI-2 has solved the first structure.

Target Selection Strategy and Domain Families Targeted in PSI-2

A primary aim in PSI-2 has been to increase the proportion of domain families for which one or more structures have been characterized, by a coarse-grained sampling of sequence space. One major challenge for the target selection strategy was therefore to construct a list of domain families with no representative structure. Domain families with at least 10 relatives were targeted more specifically so as to maximize the potential impact of PSI-2 structures via homology modelling. Here, these domain families are referred to as structurally uncharacterised large families, and were also referred to internally as BIG families. Even though a family with 10 relatives may arguably not qualify as large, this threshold was chosen as a result of a compromise between selecting families having a significant size and not restricting the final list to too few families given the other constraints we had in the selection procedure (e.g. no features that might affect structure determination – see below).

Some domain families are very large and very diverse both in terms of structure and function. The largest 200 CATH (Greene et al., 2007) families in Gene3D (Yeats et al., 2008) account for at least 50% of domain sequences in the genomes and yet, Figure 2 shows that, for all these very large CATH superfamilies but 2, less than 10% of the sequence subfamilies (so-called modelling subfamilies – see Methods) within them have structural representatives in the PDB (Berman et al., 2000). Previous analyses of some of these very large families reveal that some proteins can be up to 5-times larger than other members of the same family, sometimes to the point of actually adopting a different fold (Reeves et al., 2006). Such structural divergence is often clearly correlated with divergence in function (see Figure 3). Our structural sampling of most of these very large and diverse families is very incomplete. Here, these domain families are referred to as very large and diverse families with incomplete structural coverage, and were also referred to internally as MEGA families. Another aim of PSI-2 has therefore been to target additional relatives in these MEGA families, with the expectation that this will give us deeper insights into the nature of structural divergence within a family, and on how structural changes between related domains bring about changes in function. This, in turn, should trigger improvements in algorithms that attempt to predict functions from structures. Finally, such a fine-grained sampling of subfamilies with diverse families is required to fully characterize the structural repertoire in nature.

Figure 2.

Figure 2

Proportion of structurally characterized modelling sub-families in very large and diverse families (referred to as MEGA families). MEGA families are the 200 largest superfamilies in CATH, and taken together they represent more than 50% of domains in genome sequences. These families are typically very diverse in terms of structure and function.

Figure 3.

Figure 3

Correlation between structural and functional diversity in CATH superfamilies. For each superfamily, the x-axis gives the number of molecular function GO terms identified for members of that superfamily in Gene3D. The y-axis gives the number of structurally similar sub-groups (see Methods) obtained by clustering domains from the superfamily with a normalised RMSD cut-off of 5Å.

In recent years, metagenomics experiments have revealed the extent of previously uncovered parts of the protein universe, which are found in complex communities of uncultured microbes from various environments (e.g. ocean, soil, human skin or gastrointestinal tract). On that account, PSI-2 centres also started a pilot project in which the above-mentioned target selection strategies were applied to include domain families that are over-represented in one of the most studied environments, namely the human distal gut microbiome. Sequence information from metagenomics can illuminate important functional roles being carried out by the bacterial communities found in specific habitats (Riesenfeld et al., 2004). For example many bacterial proteins in the human gut are essential for breaking down complex food substrates and synthesizing vital nutrients such as vitamins. Understanding how these communities function and what populations are most beneficial to the human host is likely to be important for understanding and promoting human health and diagnosing conditions likely to lead to disease. In practice, as will be shown hereafter, these Gut Metagenome Families constitute a subset of the targeted large families (BIG and MEGA) mentioned in the above paragraphs.

Targeting structurally uncharacterised large families for coarse-grained sampling of sequence space

Bioinformatics groups (BIG4) from all 4 PSI Centres collaborated in developing a consensus strategy for target selection. Defining domain families is a complex issue, and a number of curated domain family resources such as Pfam (Finn et al., 2008) and TIGRFAMs (Haft et al., 2003) are now publicly available, which can facilitate research in this field. In order to benefit from these existing domain family resources but also from more optimal strategies for target selection, we applied a mixed protocol to identify suitable sequence families for coarse-grained targets. A primary list of large structurally uncharacterised families was constructed using Pfam, which is one of the most comprehensive manually curated resources. Exclusion of families with less than 10 relatives (see Methods) or with features that might affect success in structure determination (e.g. trans-membrane regions etc.) resulted in a total of 1369 target Pfam families, corresponding to approximately 20% of sequences in Pfam families without structural representatives.

However, several problematic features of Pfam were identified, which originate from the fact that the aims of PSI efforts and the rules guiding Pfam classifications are similar but not identical. For example, the sequences clustered into a Pfam family sometimes represent a multi-domain family rather than a single domain family. A reverse problem happens in proteins that have been chopped into partial domains that are never found separately and may not constitute a proper domain and therefore cannot be solved experimentally. Since consensus approaches have historically been shown to be highly successful in bioinformatics, we attempted to solve these problems by using a collaborative approach involving several orthogonal methodologies for domain family definition, which would allow us to look for consensus families, i.e. families that were found by more than a single source. Therefore, the target list of Pfam families was supplemented by families identified using various automated protocols developed in the BIG4 groups described below:

  1. The Rost group identified domain families using the CLUP method (Liu et al., 2004), which applies an iterative domain chopping and comparison protocol to merge related sequences into families.

  2. In the Orengo group, the Gene3D database (Yeats et al., 2008) was used to identify NewFam domain families, which are clusters of domain sequences built from regions of genome sequences that cannot be assigned to CATH or Pfam domain families (Marsden et al., 2006).

  3. The Godzik group used an iterative protocol for building families from a broad range of protein sequence databases.

  4. The Fiser group analysed PFAM-B database (automatically generated domain clusters obtained from PRODOM (Bru et al., 2005)) for structurally uncharacterised sequence families.

A combined target list of families identified by Pfam and by the BIG4 protocols was generated, and families found by more than one source were labelled (see Methods). Each centre used their own criteria to prioritise those families that they wished to target for structure determination, for example depending on their reagent genomes, and the families were then divided amongst the four large centres using a random pick procedure. This random pick assignment was iterated four times and, in total, 2357 families were distributed to the centres (see Table 2).

Table 2.

Numbers of structurally uncharacterised large families (i.e. BIG families) allocated to PSI-2 large-scale centres.

10/2005 04/2006 08/2006 11/2006 Total
JCSG 271 129 75 100 575
MCSG 332 68 75 129 604
NESG 337 63 75 129 604
NYSXRC 329 71 75 99 574
Total 1269 331 300 457 2357

Targeting subfamilies in very large and diverse families with incomplete structural coverage for fine-grained sampling of sequence space

The Gene3D database (Yeats et al., 2008) was exploited to identify the most highly populated domain families with known structures in the genomes. This resource comprises more than five million protein sequences, including sequences from 520 completed genomes and the UniProt (UniProt Consortium, 2008) and RefSeq (Pruitt et al., 2007) databases. Putative domains are identified by scanning sequences against Hidden Markov Models (HMMs) derived from the CATH and Pfam domain databases, using conservative thresholds that have been carefully benchmarked with structural data. As of August 2008, approximately 37% of residues in protein sequences from Gene3D can be assigned to families of known structure in CATH, with a further 48% that can be assigned to Pfam families. Furthermore, approximately 55% of protein sequences in Gene3D contain at least one domain that can be assigned to a family in CATH.

Figure 4 shows that the largest 200 domain families contain more than 290,000 modelling subfamilies. PSI-2 is unlikely to solve this number of structures over the next few years and rational approaches are clearly needed to attempt to select representatives that are structurally and functionally diverse. Therefore, a large part of the first year of PSI-2 (June 2005–June 2006) was dedicated to design a robust target selection strategy and to develop the clustering and analysis tools needed to improve the rational selection of targets within the very large and diverse families selected. For example in the NESG and NYSGXRC research into improved methods of aligning sequences and deriving homology models led to revised thresholds for clustering sequences into modelling subfamilies on the basis of predicted structural similarity.

Figure 4.

Figure 4

Number of modelling families in 200 very large and diverse CATH superfamilies.

Similarly, the MCSG consortium developed the GEMMA approach (Lee et al., submitted) which exploits HMM-HMM strategies to progressively merge subfamilies of functionally related domains to enable selection of functionally diverse representatives. For some superfamilies this approach can reduce the number of predicted functionally diverse subfamilies to target by more than ten-fold, making it more feasible to achieve structural coverage of these diverse subfamilies using this rational approach. Additional constraints that operate when selecting representatives are the reagent genomes available for cloning to the centre, which restrict the choice of homologues for structure determination. A measure of success of this target selection strategy will be the degree of structural and functional novelty observed in the structures that are deposited in the PDB by the four centres during this second phase of PSI. This is reviewed below for the first three years of PSI-2.

For each of the most highly populated families in CATH that have been allocated to the PSI-2 large-scale centres, Table 3 gives the number of relatives identified in Gene3D, the number of different functional terms from the Gene Ontology (GO) (The Gene Ontology Consortium, 2000), the number of different modelling subfamilies it contains (where sequences are clustered into modelling subfamilies using a 30% sequence identity threshold), and the percentage of modelling subfamilies for which there is a solved structure. Table 3 also shows to which of the large-scale centres each family was allocated, as well as the date of allocation.

Table 3.

Very large and diverse (MEGA) families with incomplete structural coverage allocated to PSI-2 large-scale centres. For each MEGA superfamily, the table shows the name of the superfamily, the number of distinct sequences assigned to that superfamily in Gene3D v6.0, the number of distinct GO terms (biological process ontology), the number of sequence clusters at 30% sequence identity, the percentage of these sequence clusters for which a structure has already been solved, the PSI-2 centre to which the superfamily was allocated, and the allocation date. For SUPERMEGA superfamilies, PSI-2 centre and allocation date are not shown due to the particular allocation protocol used for these superfamilies (see text).

CATH Code Superfamily name Sequences GO terms s30
clusters
%s30 with
structure
PSI-2 center Allocation
date
3.30.530.40 Bh1534 unknown conserved protein 1011 8 231 0.433 NESG 2007-06
3.30.70.900 Dimeric alpha+beta barrel 1450 14 276 2.536 JCSG 2007-06
3.40.109.10 NADH Oxidase-like 2094 14 248 2.823 JCSG 2007-12
2.30.110.10 FMN-binding split barrel 2433 25 275 5.818 JCSG 2007-03
3.10.110.10 Ubiquitin Conjugating Enzyme-like 2576 185 230 7.391 NESG 2007-06
3.20.20.120 Enolase superfamily 2653 35 192 8.854 NYSGXRC 2007-12
1.20.1260.10 Ferritin-like 2889 37 259 8.88 JCSG 2007-06
3.20.20.10 PLP-binding barrel 2961 41 177 4.52 NYSGXRC 2007-06
3.90.1200.10 Protein kinase-like 3174 62 504 0.595 MCSG 2007-03
3.40.1190.10 MurD-like peptide ligases, catalytic domain 3596 35 235 2.979 JCSG 2007-06
3.10.450.50 NTF2-like 3609 92 729 2.195 JCSG 2007-06
3.40.50.10490 Glucose-6-phosphate isomerase-like 4437 67 474 4.008 JCSG 2007-12
3.40.1190.20 Ribokinase-like 5156 52 461 3.905 JCSG 2007-03
2.120.10.30 TolB, C-terminal domain-like 5247 217 1159 0.777 NESG 2007-03
3.10.20.90 Ubiquitin-like 5327 376 1217 3.615 NESG 2007-06
3.90.79.10 Nudix hydrolases 5569 80 739 2.165 NYSGXRC 2007-12
1.10.540.10 Acyl-CoA dehydrogenase C-terminal domain-like 5634 60 457 2.407 NYSGXRC 2007-06
3.10.180.10 Dihydroxybiphenyl Dioxygenase-like 5960 43 863 2.665 NYSGXRC 2007-03
1.10.1040.10 6-phosphogluconate dehydrogenase C-terminal domain-like 6348 61 427 2.81 NYSGXRC 2007-12
3.10.129.10 Thioesterase/thiol ester dehydrase-isomerase 6641 41 785 2.548 NESG 2007-12
3.30.450.40 GAF domain-like 6915 90 1508 0.729 MCSG 2007-03
3.60.15.10 Metallo-hydrolase/oxidoreductase 7353 65 863 1.622 NESG 2007-12
3.40.630.10 Zn peptidases 8338 106 595 3.866 JCSG 2007-03
3.40.50.880 Class I glutamine amidotransferase-like 9291 104 625 3.68 NYSGXRC 2007-12
3.20.20.140 Metal-dependent hydrolases 9684 126 799 3.379 MCSG 2007-12
3.30.930.10 Class II aaRS and biotin synthetases 9840 81 553 3.797 MCSG 2007-12
3.90.226.10 ClpP/crotonase 11852 110 862 2.784 MCSG 2007-12
3.40.50.1000 HAD-like 12691 239 1251 1.918 MCSG 2007-06
3.30.450.20 PYP-like sensor domain (PAS domain) 13173 220 4020 0.473 MCSG 2007-03
2.60.120.10 Jelly Rolls 13512 239 2197 2.003 JCSG 2007-12
3.20.20.80 Glycosidases 14373 177 1278 7.042 NYSGXRC 2007-06
3.40.630.30 Acyl-CoA N-acyltransferases (Nat) 14690 195 2472 1.214 MCSG 2007-06
2.40.50.140 Nucleic acid-binding proteins 17863 244 1735 4.438 NESG 2007-03
3.90.550.10 Nucleotide-diphospho-sugar transferases 17894 272 2195 1.185 NESG 2007-06
2.130.10.10 YVTN repeat-like/Quinoprotein amine dehydrogenase 19132 760 3740 0.535 NESG 2007-12
3.30.420.10 Ribonuclease H-like 20593 158 2101 1.475 NYSGXRC 2007-06
3.40.30.10 Glutaredoxin-like 22258 335 2481 4.434 MCSG 2007-12
3.40.50.620 HUP domains 23142 225 2032 3.346 MCSG 2007-06
3.40.640.10 PLP-dependent transferases 23500 287 1273 4.399 JCSG 2007-12
1.25.40.10 TPR-like 24833 503 6931 0.245 NESG 2007-12
3.50.50.60 FAD/NAD(P)-binding domain 26841 328 3376 1.777 NESG 2007-03
3.30.565.10 ATPase domain of HSP90 chaperone-like 26943 266 3128 0.607 NYSGXRC 2007-03
3.40.50.2300 CheY-like 27818 253 3599 1.334 NYSGXRC 2007-03
3.40.50.1820 Alpha/Beta-hydrolases 31099 394 4182 1.793 MCSG 2007-06
3.40.50.150 SAM-dependent methyltransferases 35362 299 4104 1.389 SUPERMEGA SUPERMEGA
3.20.20.70 Aldolase class I 38574 341 2597 4.582 SUPERMEGA SUPERMEGA
3.40.50.720 NAD(P)-binding Rossmann-like domains 70263 644 6463 3.249 SUPERMEGA SUPERMEGA
3.40.50.300 P-loop containing nucleotide triphosphate hydrolases 184999 1711 18682 1.493 SUPERMEGA SUPERMEGA

The largest four of these very large families, also called SUPERMEGA superfamilies, were not allocated to any individual centre but instead, each centre prioritised individual modelling subfamilies within these superfamilies, largely on the basis of features which suited their experimental pipelines (e.g. presence of homologues in the reagent genomes used by the centre) and functional assignments (e.g. biologically interesting GO terms for which no structures were currently known). Modelling subfamilies from these four largest families were then assigned to each centre using the draft pick protocol. It is worth noting the disproportionately larger size of the superfamily of P-loop containing nucleotide triphosphate hydrolases (CATH code 3.40.50.300), as compared with the other very large superfamilies.

Targeting families that are over-represented in the gut microbiome

Two rounds of identification of protein families over-represented in the human gut microbiome were performed. For both rounds, protein sequences found in the gut microbiome were first grouped into homologous clusters (see Methods for further details). Comparing numbers of homologues from these clusters found in the gut microbiome and in other bacterial genomes allowed the identification of clusters that are significantly over-represented in the gut. The largest and most over-represented clusters were considered as potential targets. A subset of 1092 clusters from the first round and 136 clusters from the second round (defined by HMMs) were then selected as targets and equally divided amongst the four centres using the draft pick protocol. Many of these Gut Metagenome Families constitute a subset of the targeted large families (BIG and MEGA) mentioned in the above paragraphs, however some represent novel families, specific to the human gut environment.

Analysis of the Coverage of Genome Sequences by PSI-2 Structures and their Structural and Functional Novelty

It is possible to gauge the success of the structural genomics initiatives, in particular that of PSI-2, using a number of different measures. The total number of structures solved is an obvious preliminary indicator, but it must be considered with caution since it is not necessarily indicative of the actual impact and leverage of PSI-2, or of its success at meeting its objectives (Liu et al., 2007). One major aim of PSI-2 is to determine “novel” protein structures and in that context, all newly solved structures do not have the same value. For example, alternative structures of a given protein with different ligands can be crucial for understanding better the mechanism of a particular protein, but do not help in terms of structural novelty.

We consider two measures to evaluate the success of PSI-2 at determining novel protein structures. First, we measure the extent to which these structures are affecting the structural coverage of known proteomes. Ultimately, this issue relies on the definition of modelling subfamilies and how the newly solved structures can be used to provide valuable structural information on their relative sequences. Secondly, we directly measure the structural novelty of PSI-2 structures by comparing them with previously released structures using a normalised RMSD score.

Another means of assessing the success of the structural genomics initiatives and the potential value of this data to biologists is to consider the number of diverse functions which have been characterized experimentally and captured in public resources such as GO but for which there are no structural relatives. Solving representative structures for proteins possessing these functions will help in revealing the molecular mechanisms by which these proteins function and expand our understanding of functional space as well as structural space. For this reason, we also consider the number of functional groups that were previously uncharacterised structurally and for which PSI-2 has provided a first structural representative.

Total Number of Structures solved

Analyses were performed on all structures deposited in the PDB (Berman et al., 2000) by the four PSI-2 large-scale production centres from July 1st 2005 to July 1st 2008. Some of these analyses were conducted in collaboration with the PSI Structural Genomics Knowledgebase established at Rutgers University (http://kb.psi-structuralgenomics.org/KB/) (Berman et al., 2009). A total of 1600 structures were solved by the 4 centres in the first three years of PSI-2 and they amounted to 1502 distinct chains (~94%). This compares with a ratio of 61% of distinct chains to PDB entries (9597/15629) for the entire PDB (excluding PSI structures) over the same period of time. Of the 1502 distinct structures solved, 460 (~30%) were from BIG families of which 355 (~24%) were from Pfam families. During the first three years of PSI-2, 288 Pfam families had their first structure solved by PSI-2 large-scale centres, which is about 38% of all Pfam families (total of 748) for which a first structure was deposited in the PDB during the same period of time.

Genome Coverage

Previous analyses of structural coverage of known proteomes suggest that up to 30–40% of protein residues, and ~50% of domain sequences, can currently be assigned a structure by modelling (Liu et al., 2007;Marsden et al., 2007). This proportion varies with the sequence database used, and the prediction methods used to assign sequences to structural families (e.g. PSI-BLAST (Altschul et al., 1997), HMMs, profile-profile comparisons, threading etc). A non-negligible proportion of the remaining domain sequences in these proteomes belong to families that are problematic for high throughput structure determination, because they are membrane associated, intrinsically disordered or have regions of low complexity, and thus more appropriate targets for the specialized centres of PSI (Norvell et al., 2007). A significant proportion of the remaining structurally uncharacterised and non-problematic sequences were targeted by the expanded BIG list (2298 families). The remaining targets chosen by the four centres came from 48 MEGA families and 136 META families. In total, 193249 targets have been selected over the three years since the start of PSI-2.

Figure 5 shows the increase in structural coverage of fractions of proteins and residues from UniProt, obtained by solving structures since the start of PSI-2, and compares the contributions of structures from the entire PDB, PSI-2 only and PSI-2 large-scale centres only (see Methods). Altogether, the fraction of UniProt proteins (residues) that can be structurally modelled is now reaching 48% (42%). This represents an increase of about 10% (6%) over the past three years, with a contribution of more than 2% (1.3%) from PSI-2 structures. In terms of increase in structural coverage, the contribution of PSI-2 is practically entirely due to structures solved by the four large-scale centres. About 23% of the increase in structural coverage of proteins in UniProt (UniProt Consortium, 2008) is due to structures from large-scale centres. The contribution of these structures is about 19% when defining structural coverage at the residue level. These contributions are rather encouraging given that structures from PSI-2 large scale centres only account for around 13% of the distinct structures released in the PDB since July 1st 2005. This result is somewhat expected, particularly because targets have been specifically selected for the coarse-grained sampling of sequence space (i.e. BIG families) rather than to optimally increase modelling coverage. Since the data presented here only considers structures released within the first three years of PSI-2, the proportion of novel structural coverage due to PSI-2 may increase in the final 2 years as PSI-2 large-scale centres reach optimal productivity.

Figure 5.

Figure 5

Figure 5

Increase in the fraction of proteins (a) and residues (b) from UniProt (release 12.8), that can be structurally modelled using structures released in the PDB since the start of PSI-2. The black line shows the increase in structural coverage resulting from all structures released in the PDB, the green line shows the increase resulting from PSI-2 structures only, and the blue line shows the increase resulting exclusively from structures solved by the PSI-2 large-scale centres.

When considering specific proteomes, the contribution of the PSI-2 large-scale centres to the increase in structural coverage greatly depends on the type of organism. For example, there was a total of 7049 novel human proteins (~10% of the total number of human sequences in UniProt 12.8 – i.e. 72034 protein sequences) for which a structure could be modelled thanks to structures deposited in the PDB between July 1st 2005 and July 1st 2008, but only 231 (i.e. 3.3% of the structural coverage increase, and 0.3% of the human proteome) out of these were due to structures solved by the four large-scale centres (for residues, the fraction of the total increase in structural coverage that is due to large-scale centres is also 3.3%). In contrast, the contribution of the large-scale centres to novel structural coverage amounts to 37% for Escherichia coli over the same period of time, i.e. 206 out of 560 proteins for which structure can now be modelled (respectively 5% and 13% of the total number of Escherichia coli sequences – i.e. 4381 protein sequences). For residues, the fraction of the total increase in structural coverage of Escherichia coli that is due to large-scale centres is 28.2%. These discrepancies between human and E. Coli are somewhat expected given that large-scale centres have preferentially targeted prokaryotic proteins.

Structural novelty of PSI-2 structures

Whether a new structure is deemed structurally novel depends on the criteria used to recognize structural similarity. Recent analyses of homologous domains in the CATH database revealed a mean value of 5Å for the normalised RMSD following superposition of homologous domains to be an appropriate cut-off for defining structural similarity (see Methods for definition of the normalised RMSD) (Cuff et al., submitted). Relatives superposing with higher normalised RMSD values have been observed to be structurally divergent often due to significant structural embellishments to the cores of the structural domains (Reeves et al., 2006). Therefore a normalised RMSD cut-off of 5Å was applied to determine whether structures solved by PSI-2 and traditional structural biology were significantly structurally different from those previously deposited in the PDB (Berman et al., 2000). Since improved structural alignments can sometimes be obtained by aligning the constituent domains rather than complete multi-domain structures, all the structures were scanned against the CATH non-redundant domain library (CATH version 3.2).

Figure 6 shows that 28% of the domain structures solved by PSI-2 large scale centres are structurally novel when using these criteria. This compares with 3% of domains solved by non-Structural Genomics structural biology worldwide which are structurally novel. These results cover domain structures solved by the PSI-2 large-scale centres, whether or not the targets were selected as part of BIG families. Of the 365 distinct domain structures (less than 98% sequence identity) from BIG families that have been solved and classified in CATH, 155 (42%) were found to be structurally novel according to the normalised RMSD cut-off of 5Å. Encouragingly, a significant proportion of structures from MEGA families were also found to be structurally novel (15%), as computed over the total number of 282 distinct domains from MEGA families and solved by PSI-2 large-scale centres. This suggests that the strategies described above for selecting structurally diverse representatives from these families appear to be performing well.

Figure 6.

Figure 6

Structural novelty of structural domains solved by PSI-2 large-scale centres (‘LSC’) and traditional structural biology worldwide (excluding Structural Genomics structures) between June 2005 and June 2008. Only domains classified in CATH are considered in this plot.

We also evaluated structural novelty by counting the number of structures that were the first representative of their superfamily or fold in CATH. Of the 859 distinct domain structures solved by PSI-2 large-scale centres, which are classified in CATH, 102 structures comprise novel CATH superfamilies, and 28 comprise novel CATH folds. Unfortunately, equivalent numbers for non-structural genomics structural biology since June 2005 cannot be readily computed for comparison, because a specific effort to classify PSI-2 structures was made by curators for the most recent release of the CATH database (CATH v3.2). Besides, of the 365 distinct domain structures from BIG families, 75 (21%) were found to represent novel CATH superfamilies (including 21 that represented novel folds), whereas 290 (79%) were found to belong to previously existing CATH domain families among which 116 (32%) were assigned to MEGA superfamilies. These BIG families are therefore clearly diverse subfamilies of the CATH families, that were no longer recognizable by sequence based protocols but that showed clear structural similarity to relatives from previously known CATH superfamilies.

Number of Structurally Uncharacterised Functional Groups for which structures were solved

In order to assess how well structural genomics was contributing structures towards the aim of increasing the number of functional groups with a representative structure, the number of functional categories in the Gene Ontology (GO) for which PSI-2 or structural biology solved the first representative structure was assessed. Of the 1502 distinct structures solved by PSI-2 large-scale centres, 51% could be mapped to a functional category in the GO database (molecular function ontology). This contrasted with 81% of structures solved by non-structural genomics structural biology worldwide. Similar ratios were obtained when considering the GO biological process ontology. Thus a significant proportion of PSI-2 structures have been functionally annotated, suggesting a non-negligible leverage of structural information from PSI-2 in terms of functional data. More importantly, 2.2% of distinct structures (i.e. 33 structures) solved by PSI-2 large-scale centres represented the first structure solved for one of their GO terms, including 27 structures for molecular function terms and 12 for biological process terms, with 7 structures being first representative for one term of both category. These GO terms, which consist mostly of enzymatic functions, are listed together with their representative PSI-2 structure in Tables 4 and 5 for molecular function and biological process, respectively. For comparison, 6% of distinct structures (i.e. 374 out of 6080 structures) solved by non-Structural Genomics projects and released in the PDB between July 1st 2005 and July 1st 2008 represented the first structure solved for one of their GO terms. Thus, the proportion of structures being first representatives of a given function is of the same order of magnitude for structures from PSI-2 large-scale centres and those from standard structural biology. This is encouraging given that targeting novel functions was not an explicit aim for PSI-2.

Table 4.

Molecular Function Gene Ontology terms for which PSI-2 large-scale centres produced the first structural representative between 2005-07-01 and 2008-07-01. The Table shows the PDB ID of the structure, the GO term identifier and the corresponding GO term name.

PDB ID GO ID GO Name
2aa4 GO:0009384 N-acylmannosamine kinase
2ajt GO:0008733 L-arabinose isomerase
2ako GO:0004349 glutamate 5-kinase
2ap9 GO:0003991 acetylglutamate kinase
2awd GO:0009024 tagatose-6-phosphate kinase
2fpo GO:0008990 rRNA (guanine-N2-)-methyltransferase
2gfh GO:0050124 N-acylneuraminate-9-phosphatase
2ghr GO:0008899 homoserine O-succinyltransferase
2gok GO:0050480 imidazolonepropionase
2i09 GO:0008973 phosphopentomutase
2idb GO:0008694 3-octaprenyl-4-hydroxybenzoate carboxy-lyase
2jo6 GO:0008942 nitrite reductase [NAD(P)H]
2jzc GO:0004577 N-acetylglucosaminyldiphosphodolichol N-acetylglucosaminyltransferase
2ols GO:0008986 pyruvate - water dikinase
2p35 GO:0030798 trans-aconitate 2-methyltransferase
2ph5 GO:0047296 homospermidine synthase
2qez GO:0008851 ethanolamine ammonia-lyase
2qgn GO:0004811 tRNA isopentenyltransferase
2qiw GO:0008807 carboxyvinyl-carboxyphosphonate phosphorylmutase
2qrr GO:0015424 amino acid-transporting ATPase
2qrr GO:0048474 D-methionine transmembrane transporter
2qt3 GO:0018764 N-isopropylammelide isopropylaminohydrolase
2qyv GO:0008769 X-His dipeptidase
2r6h GO:0016655 oxidoreductase acting on NADH/NADPH, quinone (or similar) as acceptor
2raa GO:0019164 pyruvate synthase
3bp1 GO:0033739 queuine synthase
3cbw GO:0004567 beta-mannosidase
3cea GO:0050112 inositol 2-dehydrogenase

Table 5.

Biological Process Gene Ontology terms for which PSI-2 large-scale centres produced the first structural representative between 2005-07-01 and 2008-07-01. The Table shows the PDB ID of the structure, the GO term identifier and the corresponding GO term name.

PDB ID GO ID GO Name
2awd GO:0005988 lactose metabolic process
2fa1 GO:0015716 phosphonate transport
2g7u GO:0046278 protocatechuate metabolic process
2g9i GO:0009108 coenzyme biosynthetic process
2gfh GO:0046380 N-acetylneuraminate biosynthetic process
2ghr GO:0019281 methionine biosynthetic process from homoserine via O-succinyl-L-homoserine and cystathionine
2gok GO:0019556 histidine catabolic process to glutamate and formamide
2i09 GO:0043094 metabolic compound salvage
2js7 GO:0045351 type I interferon biosynthetic process
2jzc GO:0006488 dolichol-linked oligosaccharide biosynthetic process
2oyn GO:0009398 FMN biosynthetic process
2qrr GO:0048473 D-methionine transport

Functional Insights Gained From the META Structures Solved

META families coverage was initiated in year 3 of PSI-2 and insufficient data is available at this point to fully evaluate this target selection strategy. PSI centres solved a significant number of novel proteins from human gut microbes, including over 25 proteins involved in carbohydrate metabolism and first representatives of over 10 novel protein families first found in the human gut. These preliminary results highlight two dominant mechanisms of adaptation of microbes to the specific challenges of the gut environment, namely expansion and functional diversification of known protein families, and evolution of new specialized families (Ellrott et al., submitted).

Conclusion

The Protein Structure Initiative (PSI) is now more than half way through its second phase. An important stated aim of this effort has been to make structural information available for a large proportion of genome sequences. In order to achieve this, a strategy has been set up to select structural genomics targets in protein domain families of substantial size for which no structural information was available yet. These families have been referred to as BIG families. This target selection strategy, which is extensively presented here, has been made possible by the joint efforts of several bioinformatics groups associated to PSI-2. Early in the second phase of PSI, analyses made it clear that a large fraction of BIG families that were targeted turned out to be remote homologues of previously known structural families. Genomic analyses also suggest that a significant proportion of genome sequences belong to a few universal families that are highly structurally and functionally divergent. It is clear that structural genomics can make a major contribution to biology by understanding the manner in which these families diverge structurally and how this mediates changes in molecular function, biological role and interaction partners. Therefore another important aim of PSI-2 has been to increase the number of representative structures from these families (referred to as MEGA families) in a way that reveals more comprehensively their considerable diversity and that contributes new structural information for the relatives within the superfamilies that clearly have different functional roles.

The results presented here suggest that during its first three years, PSI-2 has been successful at meeting several of its stated aims, by contributing significant numbers of structural representatives of novel structures and functions, and by participating substantially to a general increase in the number of genome sequences that can be modelled structurally. We hope that this analysis, together with previous reports on the success of structural genomics (Todd et al., 2005;Chandonia et al., 2006;Watson et al., 2007) and more specific analyses (Todd et al., 2005;Watson et al., 2007;lali-Hassani et al., 2007), will shed light on the capacity of the Protein Structure Initiative and other similar efforts world-wide to contribute valuable data for facing the new challenges in understanding biology at the molecular and cellular levels (Gerlt, 2007;Blundell, 2007).

Methods and Definitions

Definitions for target selection strategy

At the start of PSI-2, the PSI committee issued a statement publicizing the fact that PSI-2 would aim to ‘increase the number of large families for which a structure was known’. This can be described as coarse-grained coverage of protein structural space. However, it was also recognized that for some large and highly diverged families a single representative would not provide sufficient structural insight for the entire family and that in such cases, structures should be solved for several representatives. This process would be described as fine-grained coverage. Although these definitions appear intuitively obvious, practical use of the guidelines was initially hampered by the lack of universally accepted definitions. For instance, the term “protein family” is used by different authors to designate groups of proteins that share differing levels of similarity, so that coarse-grained coverage according to one author could correspond to fine-grained coverage according to another.

Various groups working on protein families and domain definitions (e.g. Pfam (Finn et al., 2008), TIGRFAMs (Haft et al., 2003)) have used different strategies and protocols to construct databases of domains and families. However, the BIG4 felt that none of the existing resources fully incorporated structural information into the families and domain definitions. Furthermore, in determining a sensible strategy for target selection for structural genomics, there are various experimental issues that have an important bearing on choice of a suitable approach. For example, whilst it may seem tempting to opt for a particular organism of biological significance such as yeast or human, there may be significant experimental difficulties with expressing proteins from this organism or restricting the selection strategy to a few organisms. In order to coordinate target selection, the BIG4 came up with the following working definitions of families:

Modelling Subfamily (MS)

This describes a group of closely related sequences in which any two targets share a “minimal similarity”. Modelling subfamilies were constructed by multi-linkage clustering, using a clustering threshold of 30% pair-wise sequence identity between any two members of the subfamily. This threshold was chosen as it ensures that once a single structure has been solved for the MS, there is a reasonable probability that homology models can be built for all other relatives with good accuracy. We anticipate that the precise definition of a modelling subfamily will probably change as modelling algorithms evolve and improve.

BIG family (large families)

We refer to BIG families to describe groups of related proteins, with many relatives, identified using profile-based sequence similarity search strategies. Currently a minimum of 10 relatives is being employed to define a BIG family, though this may change in the future. We hypothesize that a BIG family could consist of multiple modelling subfamilies and that members of a BIG family may display non-negligible structural diversity. Since the primary focus of PSI-2 is to solve representatives of large families with unknown structures, BIG families were validated as targets by ensuring that they contained no relative with a known structure in the PDB. Standard bioinformatics approaches were used to eliminate families that could be problematic (as in PSI-1) (Marsden et al., 2008).

MEGA family (very large families)

Some domain families are extremely large (some are ten-fold or more larger than the average BIG family) and we can anticipate extreme structural divergence within them (Marsden et al., 2007). Multiple targets from these families would be needed to get even approximate models for all structural variants. We refer to such families as MEGA families. In practice, MEGA families were defined as the 200 most populated homologous superfamilies in CATH (H-level). Taken together, these 200 MEGA families cover at least 50% of domain sequences in genomes. Most MEGA families already have representatives of known structure, but an important goal of PSI-2 is to fully characterize structural (and functional) variability in these families.

META family

We use the term META family to refer to clusters of homologous sequences that are over-represented in metagenomic samples from a particular environment (microbiome). This term falls into a slightly different category than MEGA, BIG or modelling subfamily since it does not refer to the size of a family nor to the presence of already determined structures. PSI targets selected from META-families were usually chosen from fully sequenced microbial genomes, but metagenomic sequences were used to calculate their over-representation ratios and, thus, to identify META-families (see below).

Final selection of BIG families by mapping families from Pfam and from the different BIG protocols

The final target list of BIG families was defined by looking for a consensus between the families defined from Pfam and different protocols defined by the BIG4. Consensus mapping between the different BIG family resources was achieved as follows:

Each family was defined by a multiple alignment of the seed sequences. Relatives were then identified by profile based scans of a non-redundant version of the UniProt database (UniProt Consortium, 2008). Two families were deemed to be equivalent if at least 70% of the sequences in the larger family can be matched to sequences in the smaller family, where sequences are identified as matching if they have the same UniProt ID and at least 60% of the residues in the larger sequence are equivalent to residues in the smaller sequence. Some manual inspection was undertaken to check the quality of these family assignments. Families identified by several approaches were eventually considered for assignment to PSI-2 large-scale centres.

Selection of META-families

The Godzik group performed two rounds of identification of protein families over-represented in human gut microbiome, with the underlying aim to identify protein families that are important for the human gut flora, unique for this environment, or significantly over-represented there. Modelling subfamilies were identified in the first round, and BIG families were identified in the second round.

1) identification of modelling subfamilies over-represented in human gut microbiome: Modelling META-subfamilies were defined as sequence clusters seeded with proteins from four bacteria isolated from human gut flora: Eubacterium rectale, Bacteroides vulgatus, Bacteroides thethaiotaomicron, and Bacteroides fragilis (made available by Jeff Gordon laboratory, http://gordonlab.wustl.edu/). For each seed sequence, BLASTP (Altschul et al., 1990) hits were collected from two sets of sequences:

The ratio between the number of hits of a seed sequence found in the “GUT” and in “ALL” was used as a measure of over-representation of a modelling META-subfamily defined by that seed sequence. The top 20% most over-represented modelling subfamilies were distributed between the four large-scale centres using a random pick mechanism which ensured that all close homologues of any given seed sequence were assigned to the same centre.

2) identification of novel BIG-families from human gut microbiome: the aim in this round was to identify BIG-families with no functional annotation that were over-represented in the human gut microbiome. BIG families were defined by Hidden Markov Models (HMMs) (Eddy, 1996).

Available sequences of human gut metagenomic samples were first collected (from datasets published by the Hattori lab (Kurokawa et al., 2007), and from the above-mentioned US metagenomic samples). Functionally annotated sequences were filtered out from these sets by removing all sequences with significant BLASTP hits (e-value lower than 0.001) to annotated sequences in KEGG (Kanehisa et al., 2000). The remaining sequences were clustered using PDB-Blast (Li et al., 2002) and an e-value equal to 0.001 as the clustering cut-off. The resulting clusters were expanded by collecting non-redundant homologues of all cluster members. Homologues were obtained using PSI-BLAST searches against a database that consists of the NR database and metagenomic datasets clustered at 85% sequence identity using CD-HIT (Li et al., 2006a). Multiple sequence alignments of these homologues were then constructed with CLUSTALW (Thompson et al., 1994), and were used to build HMMs (Eddy, 1996) using HMMBUILD. These HMMs, which represented BIG-families, were used to collect hits from two sets of sequences using HMMPFAM (both programs available from http://hmmer.janelia.org/):

The ratio between the number of hits in “GUT” and in “NHR” was used to define BIG families that were over-represented in the human gut microbiome. The most over-represented families were then distributed between the large-scale centres.

Structural and functional novelty, and Genome Coverage calculations

Lists of PSI-2 structures used for computing structural coverage of genomes were obtained directly from the large-scale centres, considering only structures deposited in the PDB (Berman et al., 2000) between July 1st 2005 and July 1st 2008. Corresponding lists of non-PSI structures were downloaded from the PDB website using identical date restrictions. Distinct structures have been defined as lists of structures sharing less than 98% pair-wise sequence identity, and were obtained by running CD-HIT with that cut-off and considering single representatives from all resulting CD-HIT clusters (Li et al., 2006b).

Increase in genome coverage in terms of structural modelling was computed by running PSI-BLAST against UniProt (release 12.8) for each PDB structure in turn, and by considering modelling subfamilies around each structure to decide on sequences for which the structure could be modelled.

Structural novelty was computed by considering domains from novel structures that have been classified in CATH release 3.2 (Greene et al., 2007). All domains in CATH v3.2 were structurally aligned against one another using SSAP (Orengo et al., 1996;Greene et al., 2007), and were clustered into structurally similar groups by complete clustering with a normalised RMSD cut-off of 5.0Å. The normalised RMSD score (normRMSD) is computed as follows:

normRMSD=RMSD×max(L1,L2)Nmat

Where RMSD is the root mean square deviation of the superposition, max(L1,L2) is the length in amino acids of the largest domain in the superposition, and Nmat is the number of aligned residue pairs (Kolodny et al., 2005). Domains that are assigned to the same cluster are considered structurally similar, and the structure from each cluster that was first deposited in the PDB is considered to be structurally novel. Fold and superfamily novelty was evaluated by considering the first structure in each CATH fold and CATH superfamily to have been deposited in the PDB.

Functional novelty was evaluated by mapping PDB structures to GO terms using the PDB to GO mapping provided by the MSD at the EBI (Velankar et al., 2005).

Results and statistics generated by the BIG4 groups, and presented in this article, are also available from the BIG4 website (http://psi-big4.org/).

Acknowledgments

This work was supported by a grant from the Protein Structure Initiative (PSI) of the National Institute for General Medicine at the National Institutes of Health.

Abbreviations

JCSG

Joint Center for Structural Genomics

MCSG

Midwest Center for Structural Genomics

NESG

Northeast Structural Genomics Consortium

NYSGXRC

New York SGX Research Center for Structural Genomics

PSI

Protein Structure Initiative

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
  2. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Aravind L, Anantharaman V, Koonin EV. Monophyly of class I aminoacyl tRNA synthetase, USPA, ETFP, photolyase, and PP-ATPase nucleotide-binding domains: implications for protein evolution in the RNA. Proteins. 2002;48:1–14. doi: 10.1002/prot.10064. [DOI] [PubMed] [Google Scholar]
  4. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Berman HM, Westbrook JD, Gabanyi MJ, Tao W, Shah R, Kouranov A, Schwede T, Arnold K, Kiefer F, Bordoli L, Kopp J, Podvinec M, Adams PD, Carter LG, Minor W, Nair R, La BJ. The protein structure initiative structural genomics knowledgebase. Nucleic Acids Res. 2009;37:D365–D368. doi: 10.1093/nar/gkn790. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Blundell T. New dimensions of structural proteomics: exploring chemical and biological space. Structure. 2007;15:1342–1343. doi: 10.1016/j.str.2007.10.008. [DOI] [PubMed] [Google Scholar]
  7. Bru C, Courcelle E, Carrere S, Beausse Y, Dalmar S, Kahn D. The ProDom database of protein domain families: more emphasis on 3D. Nucleic Acids Res. 2005;33:D212–D215. doi: 10.1093/nar/gki034. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Chandonia JM, Brenner SE. The impact of structural genomics: expectations and outcomes. Science. 2006;311:347–351. doi: 10.1126/science.1121018. [DOI] [PubMed] [Google Scholar]
  9. Eddy SR. Hidden Markov models. Curr. Opin. Struct. Biol. 1996;6:361–365. doi: 10.1016/s0959-440x(96)80056-x. [DOI] [PubMed] [Google Scholar]
  10. Fersht AR. From the first protein structures to our current knowledge of protein folding: delights and scepticisms. Nat. Rev. Mol. Cell Biol. 2008;9:650–654. doi: 10.1038/nrm2446. [DOI] [PubMed] [Google Scholar]
  11. Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, Bateman A. The Pfam protein families database. Nucleic Acids Res. 2008;36:D281–D288. doi: 10.1093/nar/gkm960. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Gerlt JA. A Protein Structure (or Function ?) Initiative. Structure. 2007;15:1353–1356. doi: 10.1016/j.str.2007.10.003. [DOI] [PubMed] [Google Scholar]
  13. Gerlt JA, Babbitt PC. Divergent evolution of enzymatic function: mechanistically diverse superfamilies and functionally distinct suprafamilies. Annu. Rev. Biochem. 2001;70:209–246. doi: 10.1146/annurev.biochem.70.1.209. [DOI] [PubMed] [Google Scholar]
  14. Gill SR, Pop M, Deboy RT, Eckburg PB, Turnbaugh PJ, Samuel BS, Gordon JI, Relman DA, Fraser-Liggett CM, Nelson KE. Metagenomic analysis of the human distal gut microbiome. Science. 2006;312:1355–1359. doi: 10.1126/science.1124234. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Goldstein RA. The structure of protein evolution and the evolution of protein structure. Curr. Opin. Struct. Biol. 2008;18:170–177. doi: 10.1016/j.sbi.2008.01.006. [DOI] [PubMed] [Google Scholar]
  16. Greene LH, Lewis TE, Addou S, Cuff A, Dallman T, Dibley M, Redfern O, Pearl F, Nambudiry R, Reid A, Sillitoe I, Yeats C, Thornton JM, Orengo CA. The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution. Nucleic Acids Res. 2007;35:D291–D297. doi: 10.1093/nar/gkl959. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Haft DH, Selengut JD, White O. The TIGRFAMs database of protein families. Nucleic Acids Res. 2003;31:371–373. doi: 10.1093/nar/gkg128. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Kanehisa M, Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research. 2000;28:27–30. doi: 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Kolodny R, Koehl P, Levitt M. Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. J. Mol. Biol. 2005;346:1173–1188. doi: 10.1016/j.jmb.2004.12.032. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Kurokawa K, Itoh T, Kuwahara T, Oshima K, Toh H, Toyoda A, Takami H, Morita H, Sharma VK, Srivastava TP, Taylor TD, Noguchi H, Mori H, Ogura Y, Ehrlich DS, Itoh K, Takagi T, Sakaki Y, Hayashi T, Hattori M. Comparative metagenomics revealed commonly enriched gene sets in human gut microbiomes. DNA Res. 2007;14:169–181. doi: 10.1093/dnares/dsm018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. lali-Hassani A, Pan PW, Dombrovski L, Najmanovich R, Tempel W, Dong A, Loppnau P, Martin F, Thornton J, Edwards AM, Bochkarev A, Plotnikov AN, Vedadi M, Arrowsmith CH. Structural and chemical profiling of the human cytosolic sulfotransferases. PLoS. Biol. 2007;5:e97. doi: 10.1371/journal.pbio.0050097. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Lee D, Grant A, Marsden RL, Orengo C. Identification and distribution of protein families in 120 completed genomes using Gene3D. Proteins. 2005;59:603–615. doi: 10.1002/prot.20409. [DOI] [PubMed] [Google Scholar]
  23. Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006b;22:1658–1659. doi: 10.1093/bioinformatics/btl158. [DOI] [PubMed] [Google Scholar]
  24. Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006a;22:1658–1659. doi: 10.1093/bioinformatics/btl158. [DOI] [PubMed] [Google Scholar]
  25. Li W, Jaroszewski L, Godzik A. Sequence clustering strategies improve remote homology recognitions while reducing search times. Protein Eng. 2002;15:643–649. doi: 10.1093/protein/15.8.643. [DOI] [PubMed] [Google Scholar]
  26. Liu J, Montelione GT, Rost B. Novel leverage of structural genomics. Nat. Biotechnol. 2007;25:849–851. doi: 10.1038/nbt0807-849. [DOI] [PubMed] [Google Scholar]
  27. Liu J, Rost B. CHOP proteins into structural domain-like fragments. Proteins. 2004;55:678–688. doi: 10.1002/prot.20095. [DOI] [PubMed] [Google Scholar]
  28. Marsden RL, Lee D, Maibaum M, Yeats C, Orengo CA. Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space. Nucleic Acids Res. 2006;34:1066–1080. doi: 10.1093/nar/gkj494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Marsden RL, Lewis TA, Orengo CA. Towards a comprehensive structural coverage of completed genomes: a structural genomics viewpoint. BMC. Bioinformatics. 2007;8:86. doi: 10.1186/1471-2105-8-86. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Marsden RL, Orengo CA. Target selection for structural genomics: an overview. Methods Mol. Biol. 2008;426:3–25. doi: 10.1007/978-1-60327-058-8_1. [DOI] [PubMed] [Google Scholar]
  31. Moore AD, Bjorklund AK, Ekman D, Bornberg-Bauer E, Elofsson A. Arrangements in the modular evolution of proteins. Trends Biochem. Sci. 2008;33:444–451. doi: 10.1016/j.tibs.2008.05.008. [DOI] [PubMed] [Google Scholar]
  32. Moult J. A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. Curr. Opin. Struct. Biol. 2005;15:285–289. doi: 10.1016/j.sbi.2005.05.011. [DOI] [PubMed] [Google Scholar]
  33. Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 1995;247:536–540. doi: 10.1006/jmbi.1995.0159. [DOI] [PubMed] [Google Scholar]
  34. Norvell JC, Berg JM. Update on the protein structure initiative. Structure. 2007;15:1519–1522. doi: 10.1016/j.str.2007.11.004. [DOI] [PubMed] [Google Scholar]
  35. Orengo CA, Taylor WR. SSAP: sequential structure alignment program for protein structure comparison. Methods Enzymol. 1996;266:617–635. doi: 10.1016/s0076-6879(96)66038-8. [DOI] [PubMed] [Google Scholar]
  36. Pegg SC, Brown SD, Ojha S, Seffernick J, Meng EC, Morris JH, Chang PJ, Huang CC, Ferrin TE, Babbitt PC. Leveraging enzyme structure-function relationships for functional inference and experimental design: the structure-function linkage database. Biochemistry. 2006;45:2545–2555. doi: 10.1021/bi052101l. [DOI] [PubMed] [Google Scholar]
  37. Ponting CP, Russell RR. The natural history of protein domains. Annu. Rev. Biophys. Biomol. Struct. 2002;31:45–71. doi: 10.1146/annurev.biophys.31.082901.134314. [DOI] [PubMed] [Google Scholar]
  38. Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007;35:D61–D65. doi: 10.1093/nar/gkl842. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Ranea JA, Sillero A, Thornton JM, Orengo CA. Protein superfamily evolution and the last universal common ancestor (LUCA) J. Mol. Evol. 2006;63:513–525. doi: 10.1007/s00239-005-0289-7. [DOI] [PubMed] [Google Scholar]
  40. Redfern OC, Dessailly B, Orengo CA. Exploring the structure and function paradigm. Curr. Opin. Struct. Biol. 2008;18:394–402. doi: 10.1016/j.sbi.2008.05.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Reeves GA, Dallman TJ, Redfern OC, Akpor A, Orengo CA. Structural diversity of domain superfamilies in the CATH database. J. Mol. Biol. 2006;360:725–741. doi: 10.1016/j.jmb.2006.05.035. [DOI] [PubMed] [Google Scholar]
  42. Riesenfeld CS, Schloss PD, Handelsman J. Metagenomics: genomic analysis of microbial communities. Annu. Rev. Genet. 2004;38:525–552. doi: 10.1146/annurev.genet.38.072902.091216. [DOI] [PubMed] [Google Scholar]
  43. Sali A. 100,000 protein structures for the biologist. Nat. Struct. Biol. 1998;5:1029–1032. doi: 10.1038/4136. [DOI] [PubMed] [Google Scholar]
  44. Shakhnovich BE, Koonin EV. Origins and impact of constraints in evolution of gene families. Genome Res. 2006;16:1529–1536. doi: 10.1101/gr.5346206. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. The Gene Ontology Consortium. Gene ontology: tool for the unification of biology. Nat. Genet. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22:4673–4680. doi: 10.1093/nar/22.22.4673. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Todd AE, Marsden RL, Thornton JM, Orengo CA. Progress of structural genomics initiatives: an analysis of solved target structures. J. Mol. Biol. 2005;348:1235–1260. doi: 10.1016/j.jmb.2005.03.037. [DOI] [PubMed] [Google Scholar]
  48. Todd AE, Orengo CA, Thornton JM. Evolution of function in protein superfamilies, from a structural perspective. J. Mol. Biol. 2001;307:1113–1143. doi: 10.1006/jmbi.2001.4513. [DOI] [PubMed] [Google Scholar]
  49. UniProt Consortium. The universal protein resource (UniProt) Nucleic Acids Res. 2008;36:D190–D195. doi: 10.1093/nar/gkm895. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Velankar S, McNeil P, Mittard-Runte V, Suarez A, Barrell D, Apweiler R, Henrick K. E-MSD: an integrated data resource for bioinformatics. Nucleic Acids Res. 2005;33:D262–D265. doi: 10.1093/nar/gki058. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Vitkup D, Melamud E, Moult J, Sander C. Completeness in structural genomics. Nat. Struct. Biol. 2001;8:559–566. doi: 10.1038/88640. [DOI] [PubMed] [Google Scholar]
  52. Watson JD, Sanderson S, Ezersky A, Savchenko A, Edwards A, Orengo C, Joachimiak A, Laskowski RA, Thornton JM. Towards fully automated structure-based function prediction in structural genomics: a case study. J. Mol. Biol. 2007;367:1511–1522. doi: 10.1016/j.jmb.2007.01.063. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Yeats C, Lees J, Reid A, Kellam P, Martin N, Liu X, Orengo C. Gene3D: comprehensive structural and functional annotation of genomes. Nucleic Acids Res. 2008;36:D414–D418. doi: 10.1093/nar/gkm1019. [DOI] [PMC free article] [PubMed] [Google Scholar]