pubmed.ncbi.nlm.nih.gov

Unexpected features of the dark proteome - PubMed

  • ️Thu Jan 01 2015

. 2015 Dec 29;112(52):15898-903.

doi: 10.1073/pnas.1508380112. Epub 2015 Nov 17.

Affiliations

Unexpected features of the dark proteome

Nelson Perdigão et al. Proc Natl Acad Sci U S A. 2015.

Abstract

We surveyed the "dark" proteome-that is, regions of proteins never observed by experimental structure determination and inaccessible to homology modeling. For 546,000 Swiss-Prot proteins, we found that 44-54% of the proteome in eukaryotes and viruses was dark, compared with only ∼14% in archaea and bacteria. Surprisingly, most of the dark proteome could not be accounted for by conventional explanations, such as intrinsic disorder or transmembrane regions. Nearly half of the dark proteome comprised dark proteins, in which the entire sequence lacked similarity to any known structure. Dark proteins fulfill a wide variety of functions, but a subset showed distinct and largely unexpected features, such as association with secretion, specific tissues, the endoplasmic reticulum, disulfide bonding, and proteolytic cleavage. Dark proteins also had short sequence length, low evolutionary reuse, and few known interactions with other proteins. These results suggest new research directions in structural and computational biology.

Keywords: protein disorder; secreted proteins; structure prediction; transmembrane proteins; unknown unknowns.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.
Fig. 1.

Mapping the dark proteome. (A) For all proteins in Swiss-Prot, each residue was classified into one of four categories: (i) PDB regions—residues exactly matched to a PDB entry in Aquaria; (ii) gray regions—residues aligned to at least one PDB entry in Aquaria but always with amino acid substitutions (dark gray); (iii) dark regions—residues with no matching PDB entry in Aquaria; and (iv) dark proteins, where a single dark region spans the entire sequence. (B) We then calculated the total fraction of residues in each of the above four categories for all proteins in eukaryotes, bacteria, archaea, and viruses. The dark proteome (i.e., the fraction of residues in dark proteins or dark regions) varies from 13% (bacteria) to 54% (viruses).

Fig. S1.
Fig. S1.

A more stringent definition of the dark proteome. Similar distributions are plotted as for Fig. 1B but using a more stringent definition of darkness (Defining Darkness More Stringently) that excludes any residue matching a structure in either Aquaria or PMP (2). The only change is a very slight reduction in darkness for eukaryotes.

Fig. 2.
Fig. 2.

Darkness vs. disorder, compositional bias, and transmembrane fraction for 178,692 eukaryotic proteins. Overall, these three factors explain only a small part of the dark proteome. Corresponding plots for bacteria, archaea, and viruses are in Fig. S3. In each 2D plot, dark proteins cluster on the line at darkness = 100%. Density plots A, B, D, and F are shown in more detail in Fig. S2. (A) The distribution of darkness was bimodal: 50% of proteins had ≤28% dark residues; 20% had 100% darkness. (B) The distribution of disorder was also bimodal: 50% of dark proteins had ≤10% disordered residues, whereas 4% had 100% disorder; for nondark proteins, 50% had ≤6% disorder, whereas 1% had 100% disorder. Median disorder was much less than median darkness (28%), implying that most of the dark proteome was not disordered. (C) Two-dimensional plot shows that darkness > disorder for most proteins (dotted line), implying that most disordered residues were dark and many dark residues were not disordered. (D) Compositional bias was 0% in most proteins and slightly more prevalent in dark proteins. (E) Two-dimensional plot shows that darkness > compositional bias for most proteins (dotted line), implying that most compositionally biased residues were dark and many dark residues were not compositionally biased. (F) Most dark proteins had no transmembrane residues (see Fig. S4 for details). (G) Two-dimensional plot shows that darkness > transmembrane fraction for many proteins (gray dotted line), implying that many dark residues were not transmembrane. Most proteins occur in the region where darkness + transmembrane ≤ 1 (orange dotted line), implying that dark and transmembrane regions were mostly disjoint.

Fig. S2.
Fig. S2.

On interpreting the density plots in this work. This figure shows the full range for the density plots from Fig. 2, for darkness (A), disorder (B), compositional bias (C), and transmembrane fraction (D) in eukaryotes. Because peaks occur at 0% and 100%, the kernel density method used to create these plots places some of the density <0% and >100%, which could not be shown in Fig. 2 and Fig. S3. For all density plots in this work, the density values (y axis) have been scaled, so that the total area under the curve equals 1. The density values therefore depend on the range of values on the x axis and will be large when x values have a small range (as shown here, where 0 ≤ x ≤ 1) and small when x values have a large range (e.g., Fig. 4B, where 0 ≤ x ≤ 150). See Density Plots for further details.

Fig. S3.
Fig. S3.

Darkness vs. disorder, compositional bias, and transmembrane fraction in bacteria, archaea, and viruses. This figure shows equivalent plots to those in Fig. 2 (see the legend of Fig. 2 for details on each part). (A) For 331,559 bacterial proteins, the distribution of darkness was bimodal: 50% of proteins had ≤4% dark residues; 7% had 100% darkness. The disorder distribution shows that, surprisingly, dark proteins had lower median disorder (0%) than nondark proteins (3%). The 2D plot of disorder vs. darkness is distinctly different from in eukaryotes (Fig. 2C); most proteins occur in the region where darkness + disorder ≤ 1 (dotted line), implying that dark and disordered regions were mostly disjoint. The plots for compositional bias and transmembrane fraction are similar to eukaryotes (Fig. 2 EG), implying that: most dark proteins had no compositional bias and no transmembrane regions; most compositionally biased residues were dark but most dark residues were not compositionally biased; and dark and transmembrane regions were mostly disjoint. (B) For 19,270 archaeal proteins, the distribution of darkness was bimodal: 50% of proteins had ≤4% dark residues; 8% had 100% darkness. The disorder distribution shows that, surprisingly, dark proteins had lower median disorder (0%) than nondark proteins (1%). The 2D plot of disorder vs. darkness is similar to bacteria, implying that dark and disordered regions were mostly disjoint. The plots for compositional bias and transmembrane fraction are similar to bacteria and eukaryotes (Fig. 2 EG), implying that: most dark proteins had no compositional bias and no transmembrane regions; most compositionally biased residues were dark but most dark residues were not compositionally biased; and dark and transmembrane regions were mostly disjoint. (C) For 16,479 viral proteins, the distribution of darkness was more evenly distributed than in archaea, bacteria, or eukaryotes: 50% of proteins had ≤65% darkness; 44% had 100% darkness. The disorder distribution shows that, surprisingly, dark proteins had lower median disorder (3%) than nondark proteins (5%). The 2D plot of disorder vs. darkness is distinctly different from eukaryotes (Fig. 2C), bacteria, and archaea; the almost random distribution implies that darkness had almost no relationship to disorder in viruses. The orange rectangle indicates a group of viral proteins regularly spaced along the horizontal direction, a pattern reoccurring several times on the plot; these are proteins from strains of the same virus that vary in the number of disordered residues. The relatively frequent occurrence of this pattern is consistent with the hypothesis that variation in disordered regions is a key aspect of viral strategies to hijack cell regulation (39). The plots for compositional bias and transmembrane fraction are similar to archaea, bacteria, and eukaryotes (Fig. 2 EG) implying that: most dark proteins had no compositional bias and no transmembrane regions; most compositionally biased residues were dark but most dark residues were not compositionally biased; and dark and transmembrane regions were mostly disjoint.

Fig. S4.
Fig. S4.

Zoomed-in transmembrane distributions for dark vs. nondark proteins. (A) Zoomed-in view of Fig. 2F comparing the fraction of transmembrane residues found in dark and nondark eukaryotic proteins. A slightly higher proportion of dark proteins have >10% transmembrane residues, although interestingly a larger fraction of nondark proteins have ∼50% transmembrane residues. (B) Zoomed-in view of the transmembrane density plot in Fig. S3A comparing dark and nondark bacterial proteins. A much larger proportion of dark proteins have >10% transmembrane residues, with a pronounced peak at ∼55%. (C) Zoomed-in view of the transmembrane density plot in Fig. S3B comparing dark and nondark archaeal proteins. A much larger proportion of dark proteins have >10% transmembrane residues, with a broad peak at ∼45–60%. (D) Zoomed-in view of the transmembrane density plot in Fig. S3C comparing dark and nondark viral proteins. Overall, only a slightly higher proportion of dark proteins have >10% transmembrane residues, and the density of both dark and nondark proteins is much lower in this range than for eukaryotes, bacteria, or archaea.

Fig. S5.
Fig. S5.

Transmembrane fraction vs. darkness. In each histogram, proteins have been binned into six groups according to their darkness score (darkness = 0%, 0% < darkness < 25%, 25% ≤ darkness < 50%, 50% ≤ darkness < 75%, 75% ≤ darkness < 100%, and darkness = 100%). We then calculated the average fraction of transmembrane residues across all proteins in each bin. (A) Surprisingly, for eukaryotic proteins, the largest fraction of transmembrane residues was seen for proteins with 0% darkness, and the fraction tended to decrease with increasing darkness, although rising somewhat for dark proteins (100% darkness). (B) Bacterial proteins show nearly the opposite behavior: the smallest fraction of transmembrane residues was seen for proteins with 0% darkness and the largest for proteins with 100% darkness. Interestingly, however, there was a dip in transmembrane fraction for proteins with 75% ≤ darkness < 100%. (C) Archaeal proteins show a similar overall pattern to bacteria: the transmembrane fraction tended to increase with increasing darkness, although there as a dip in transmembrane fraction for proteins with 50% ≤ darkness < 100%. (D) Overall, viral proteins have much lower transmembrane fraction and relatively little dependency on darkness.

Fig. 3.
Fig. 3.

Known vs. unknown dark proteins. Each linear diagram (38) shows known dark proteins [i.e., those with ≥25% of residues disordered (magenta), compositionally biased (blue), transmembrane (green), or both disordered and compositionally biased (stripes)]. The remaining fraction (gray) are unknown unknowns (i.e., dark proteins predominately ordered, globular, and low in compositional bias). (A) In eukaryotes, high disorder accounted for most of the known dark proteins. Most dark proteins with high compositional bias were also highly disordered. (B and C) In bacteria and archaea, highly transmembrane proteins accounted for most of the known dark proteins (consistent with Figs. S4 and S5). (D) Viruses had the largest unknown unknown fraction and, like eukaryotes, had a large fraction of highly disordered dark proteins.

Fig. S6.
Fig. S6.

Disorder, compositional bias, and transmembrane fraction for nondark proteins. Each linear diagram (38) shows the fraction of nondark proteins with ≥25% of residues disordered (magenta), compositionally biased (blue), transmembrane (green), or both disordered and compositionally biased (stripes). The remaining fractions (gray) are nondark proteins predominately ordered, globular, and low in compositional bias; in Fig. S7, these proteins are compared with the corresponding dark proteins (gray fractions in Fig. 3). The figure shows data from eukaryotes (A), bacteria (B), archaea (C), and viruses (D). Note that in eukaryotic nondark proteins (A), the difference in gray fraction compared with dark proteins (Fig. 3A) is smaller than may be expected.

Fig. S7.
Fig. S7.

Amino acid composition in dark vs. nondark proteins. (AD) We used linear discriminant analysis to examine differences in amino acid composition for dark and nondark proteins that were ordered, globular, and low in compositional bias (i.e., proteins corresponding to the gray regions in Fig. 3 and Fig. S6). In all cases, we found highly significant differences (Welch t test, P < 10−15) along the first linear discriminant coefficient (LD1). On each box plot, the thick central vertical bar indicates the median value; the shaded region shows the interquartile range (estimated span of 50% of data); dotted lines show the interdecile range (estimated span of 99.3% of data). (EH) Averaged across all organisms, the largest difference in amino acid composition was seen for cysteine, which increased by 25% (from 1.71 to 2.13% composition) in dark proteins; this finding is consistent with the observation that disulfide bonds, cysteine frameworks, and disulfide-rich knottins are overrepresented in dark proteins (Fig. 6). The next two largest differences were seen for phenylalanine and tryptophan, which increased in dark proteins by an average of 18% and 14%, respectively; these are reported to be the most increased amino acids in transmembrane vs. nontransmembrane proteins (30).

Fig. 4.
Fig. 4.

Length, interactions, and evolutionary reuse for dark vs. nondark eukaryotic proteins. In each case, dark proteins had significantly lower values overall compared with nondark proteins (signed Kolmogorov–Smirnov test, P ≪ 10−4). Corresponding plots for bacteria, archaea, and viruses are in Fig. S8. (A) Dark proteins had shorter sequence length (median of 140 fewer amino acids, or 37% shorter). (B) Dark proteins had fewer interactions with other proteins. Note that the small peaks at ∼110 interactions arise from ribosomal proteins. (C) Dark proteins had lower evolutionary reuse. In A and C, note that to interpret the y axes values as true density scores, x values must be transformed using log base 10 (i.e., 100 becomes 2, etc.).

Fig. S8.
Fig. S8.

Length, interactions, and evolutionary reuse for dark vs. nondark proteins from bacteria, archaea, and viruses. In each case, dark proteins had significantly lower values overall compared with nondark proteins (signed Kolmogorov–Smirnov test, P ≪ 10−4). (A) For bacteria, dark proteins had shorter sequence length (median of 86 fewer amino acids, or 31% shorter), fewer interactions with other proteins, and lower evolutionary reuse. (B) For archaea, dark proteins had shorter sequence length (median of 87 fewer amino acids, or 34% shorter), fewer interactions with other proteins, and lower evolutionary reuse. The differences between dark and nondark archaeal proteins were generally greater than for bacteria. (C) For viruses, dark proteins had shorter sequence length (median of 198 fewer amino acids, or 50% shorter) and lower evolutionary reuse. The evolutionary reuse scores were much lower than for eukaryotes, bacteria, and archaea. No protein–protein interaction data were available for viruses in the resource used in this study (33). Note that the small peaks at ∼110 interactions arise from ribosomal proteins. Note also that, for the length and evolutionary reuse plots, to interpret the y axes values as true density scores, the x values must be transformed using log base 10 (so 100 becomes 2, etc.).

Fig. 5.
Fig. 5.

Cellular locations over- and underrepresented in dark proteins. Pooling annotations for all eukaryotic proteins, we determined which subcellular compartments were enriched in dark proteins; these proteins were most strongly overrepresented in the extracellular space, followed by the endoplasmic reticulum and then the plasma membrane. Dark proteins were underrepresented among cytoplasmic proteins.

Fig. 6.
Fig. 6.

Functional annotations over- or underrepresented in dark proteins. Pooling annotations for all eukaryotic proteins, we used enrichment analysis to find biological functions associated with dark proteins (Dataset S2). The tree map shows all over- and underrepresented annotations (dark gray and blue, respectively) in eight functional categories; cell area indicates annotation significance [scaled to –log10(P), using the adjusted P value from Fisher’s exact test]. Dark proteins were overrepresented in many specific secretory tissues and underrepresented only in three “tissue” annotations: “Red blood cells,” “Ubiquitous,” and “Widely expressed” (text not shown). Dark proteins were also overrepresented in cysteine-rich domains and disulfide bonds (of all dark proteins with annotated posttranslational modifications, 16% had disulfide bonds compared with 6.4% for nondark proteins). Dark proteins were underrepresented in many “Catalytic site” and “Pathway” annotations, where inference often requires similarity to a PDB structure.

Similar articles

Cited by

References

    1. Berman HM, et al. The Protein Data Bank. Nucleic Acids Res. 2000;28(1):235–242. - PMC - PubMed
    1. Haas J, et al. The Protein Model Portal--A comprehensive resource for protein structure and model information. Database (Oxford) 2013;2013:bat031. - PMC - PubMed
    1. Petrey D, et al. Template-based prediction of protein function. Curr Opin Struct Biol. 2015;32:33–38. - PMC - PubMed
    1. Chothia C. Proteins. One thousand families for the molecular biologist. Nature. 1992;357(6379):543–544. - PubMed
    1. Holm L, Sander C. Mapping the protein universe. Science. 1996;273(5275):595–603. - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources