pubmed.ncbi.nlm.nih.gov

High GC content causes orphan proteins to be intrinsically disordered - PubMed

  • ️Sun Jan 01 2017

High GC content causes orphan proteins to be intrinsically disordered

Walter Basile et al. PLoS Comput Biol. 2017.

Abstract

De novo creation of protein coding genes involves the formation of short ORFs from noncoding regions; some of these ORFs might then become fixed in the population. These orphan proteins need to, at the bare minimum, not cause serious harm to the organism, meaning that they should for instance not aggregate. Therefore, although the creation of short ORFs could be truly random, the fixation should be subjected to some selective pressure. The selective forces acting on orphan proteins have been elusive, and contradictory results have been reported. In Drosophila young proteins are more disordered than ancient ones, while the opposite trend is present in yeast. To the best of our knowledge no valid explanation for this difference has been proposed. To solve this riddle we studied structural properties and age of proteins in 187 eukaryotic organisms. We find that, with the exception of length, there are only small differences in the properties between proteins of different ages. However, when we take the GC content into account we noted that it could explain the opposite trends observed for orphans in yeast (low GC) and Drosophila (high GC). GC content is correlated with codons coding for disorder promoting amino acids. This leads us to propose that intrinsic disorder is not a strong determining factor for fixation of orphan proteins. Instead these proteins largely resemble random proteins given a particular GC level. During evolution the properties of a protein change faster than the GC level causing the relationship between disorder and GC to gradually weaken.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Overview of the proteins assigned to the four age groups: (a) the fraction of proteins belonging to each age group, (b) the average length, in amino acids, (c) the average GC content of the genes, (d) intrinsic disorder predicted by IUpred (long), (e) percentage of transmembrane residues, (f) percentage of residues in low-complexity regions, percentage of residues predicted to be in (g) a coil, (h) a β-sheet and (i) in a helix.

The difference between orphans and ancient is statistically significant for all the considered properties: the p-value of a rank-sum test is always <10−141.

Fig 2
Fig 2. For six selected species ((a,b) two strains of S. cerevisiae, (c) C. Albicans, (d) D. melanogaster, (e) D. sechellia and (f) C. elegans), intrinsic disorder (% of amino acid predicted as disordered by IUpred long) is shown as violin plots for proteins in the different age groups.
Fig 3
Fig 3. Structural properties of proteins of different ages plotted against the GC content of the genome (coding regions).

For clarity only the ancient (blue) and orphan (red) proteins are shown individually, but the linear fitted lines for genus orphans (pink line) and intermediate ones (light blue) are also shown. In the text box three values are presented: rank-sum p-value = p-value of a rank-sum test of orphans versus ancient (only the property on y axis is considered); correlation p-values = p-value of a linear regression test for orphan and ancient.

Fig 4
Fig 4. Running averages of predicted structural properties against GC content: (a) disorder, predicted by IUpred (long); (b) low complexity, predicted by SEG; (c) percentage of transmembrane residues predicted by Scampi; (d,e,f) percentage of residues in secondary structure of type, respectively, coil, beta sheet and alpha helix.

For each property, colored lines represent proteins of different age: orphans (red), genus orphans (pink), intermediate (light blue) and ancient (blue). The black lines represent randomly generated proteins at different GC frequencies.

Fig 5
Fig 5. Running averages of structural properties computed from amino acid scales against GC content: (a) Intrinsic Disorder Propensity (TOP-IDP); (b) hydrophobicity (Hessa scale); (c,d,e,f) average propensity for secondary structure of type, respectively, turn, coil, beta sheet and alpha helix.

For each property, colored lines represent proteins of different age: orphans (red), genus orphans (pink), intermediate (light blue) and ancient (blue). The black lines represent randomly generated proteins at different GC frequencies.

Fig 6
Fig 6. The relationship of each amino acid frequency with the GC content and age of the protein.

A black line represents the expected values. The amino acids are sorted by the GC content in their codons.

Fig 7
Fig 7. The fraction of GC in all codons encoding an amino acid is plotted as a dotted line and the values for the different propensity scales as filled bars: (a) TOP-IDP, (b) Hessa transmembrane scale, and (c-f) Koehl secondary structure preference scale.

For each scale the Pearson (R) correlation with GC is also shown.

Similar articles

Cited by

References

    1. Wissler L, Gadau J, Simola DF, Helmkampf M, Bornberg-Bauer E. Mechanisms and Dynamics of Orphan Gene Emergence in Insect Genomes. Genome Biology and Evolution. 2013;5(2):439–455. 10.1093/gbe/evt009 - DOI - PMC - PubMed
    1. Domazet-Loso T, Tautz D. A phylogenetically based transcriptome age index mirrors ontogenetic divergence patterns. Nature. 2010;468:815–818. 10.1038/nature09632 - DOI - PubMed
    1. Tautz D, Domazet-LoÅ¡o T. The evolutionary origin of orphan genes. Nat Rev Genet. 2011. October;12(10):692–702. 10.1038/nrg3053 - DOI - PubMed
    1. Neme R, Tautz D. Evolution: dynamics of de novo gene emergence. Curr Biol. 2014. March;24(6):R238–40. 10.1016/j.cub.2014.02.016 - DOI - PubMed
    1. Keese PK, Gibbs A. Origins of genes: big bang or continuous creation? Proc Natl Acad Sci U S A. 1992. October;89(20):9489–9493. 10.1073/pnas.89.20.9489 - DOI - PMC - PubMed

Publication types

MeSH terms

Substances

Grants and funding

This work was supported by grants from the Swedish Research Council (http://www.vr.se/, VR-NT 2012-5046, VR-M 2010-3555) and the Swedish E-science Research Center (SeRC, www.e-science.se). Computational resources were provided by the Swedish National Infrastructure for Computing (SNIC, http://www.snic.vr.se/). SL was financed by Bioinformatics Infrastructure for Life Science (BILS, www.bils.se). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

LinkOut - more resources