pubmed.ncbi.nlm.nih.gov

Young Genes are Highly Disordered as Predicted by the Preadaptation Hypothesis of De Novo Gene Birth - PubMed

Young Genes are Highly Disordered as Predicted by the Preadaptation Hypothesis of De Novo Gene Birth

Benjamin A Wilson et al. Nat Ecol Evol. 2017 Jun.

Abstract

The phenomenon of de novo gene birth from junk DNA is surprising, because random polypeptides are expected to be toxic. There are two conflicting views about how de novo gene birth is nevertheless possible: the continuum hypothesis invokes a gradual gene birth process, while the preadaptation hypothesis predicts that young genes will show extreme levels of gene-like traits. We show that intrinsic structural disorder conforms to the predictions of the preadaptation hypothesis and falsifies the continuum hypothesis, with all genes having higher levels than translated junk DNA, but young genes having the highest level of all. Results are robust to homology detection bias, to the non-independence of multiple members of the same gene family, and to the false positive annotation of protein-coding genes.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1. The continuum and preadaptation hypotheses make incompatible predictions about the properties of intergenic sequences relative to young vs. old genes.
Fig. 2
Fig. 2. Young genes have higher ISD (black circles) than old genes.

This result from the analysis of 15,347 mouse genes is unchanged by correction for evolutionary rate, and only becomes stronger after correction for length (green squares). Back-transformed central tendency estimates +/- one standard error come from a linear mixed model, where gene family, phylostratum, and length are random, fixed, and quantitative terms respectively. Importantly, this means that we do not treat genes as independent data points, but instead take into account phylogenetic confounding, and use gene families as independent data points. Length-corrected ISD values are with respect to a standardized length of 179 amino acids. Both young genes and old genes have higher ISD than intergenic sequences (blue diamond) and repeat-masked intergenic sequences (light blue triangle). Phylostrata on the x-axis are labeled according to the clade in which the oldest detectable homolog of a gene can be found. To minimize homology detection bias, the oldest phylostrata have been condensed into a single Pre-vertebrate phylostratum.

Fig. 3
Fig. 3. In agreement with many previous studies, young genes evolve faster (A) and are shorter (B).

These properties are directly causal for homology detection bias, and hence there is no way to produce bias-corrected values as for Fig. 2. However, the statistical insignificance of rate-correction in Fig. 2 suggests that homology detection bias is negligible. Back-transformed central tendency estimates +/- one standard error come from a linear mixed model, where gene family and phylostratum are random and fixed terms respectively.

Fig. 4
Fig. 4. Elevated ISD can be broken down into contributions from amino acid contribution and from exact amino acid order.

(A) ISD in real proteins (black circles) relative to amino acid scrambled controls (orange squares), and controls generated to have matched GC content (yellow diamonds), with error bars showing the back transform of the central tendency estimates +/- 1 standard error derived from mixed models as in Fig. 2. Excess ISD is driven primarily by amino acid composition, not GC content or precise amino acid order. (B) Paired comparisons show that the small excess in ISD relative to that predicted from amino acid composition is statistically significant (95% confidence intervals are shown) in all young genes except the very youngest, despite the broad confidence intervals in (A) that do not take into account the paired nature of the data.

Fig. 5
Fig. 5. Putative evidence for the continuum hypothesis can be explained as a statistical artefact known as Simpson’s paradox.

A) The continuum view posits the existence of “proto-genes” that have “characteristics intermediate between non-genic ORFs and genes”. Candidate proto-genes were classified on the basis of being annotated as ORFs, and having detectable sequence homology in sister species (without necessarily retention of approximate ORF boundaries), and Carvunis et al (2012) claimed to show a continuum of properties as a function of conservation level, shown as a greyscale. B) The same data can be explained without resorting to the existence of such intermediates. Sequence homology for ORFs that are not protein-coding genes (white circles) becomes more difficult to detect as a function of age, such that the proportion of true genes (black circles) increases with age, giving rise to the same observations as A. The downward trend in ISD arises as an example of Simpson’s paradox. C) By carefully excluding all non-genes, we see the true relationship between gene age and ISD, and compare it to intergenic control sequences that are definitely not protein-coding genes. Note that if true protein-coding genes were excluded in B (rather than excluding non-genes as in C), there would be no relationship with conservation levels.

Fig. 6
Fig. 6. Young yeast genes, like the young mouse genes in Fig. 2, have higher ISD.

A) Back-transformed central tendency estimates +/- one standard error come from a linear mixed model, where gene family and phylostratum are random and fixed terms, respectively. Phylostrata are labeled according to the species most closely related to S. cerevisiae in which a homolog is still found, except for the “S. kudriavzevii” group, which includes younger genes found in at least two species. The analysis includes 5452 yeast genes that overlap with the genes used by Carvunis et al. (2012) with filtering indicated in Table 1. B) Using the age classifications of Carvunis et al. (2012) (Table 1, 2nd column), and ignoring gene family, we reproduce the trend of low ISD in young “proto-genes” using our slightly different ISD measurement. Standard means +/- one standard error are reported for untransformed ISD estimates. This trend is insensitive to whether cysteines are included (black circles) or excluded (blue diamonds) from the protein primary sequence. This trend disappears when we screen out “proto-genes” that lack strong evidence for a functional protein product (light-blue squares), by excluding genes whose age we could classify or which were unique to S. cerevisiae, and those classified as “dubious” in SGD (Table 1; last column). Correspondences between the ages assigned by the two phylostratigraphies are indicated with shaded triangles between the two figure parts.

Similar articles

Cited by

References

    1. McLysaght A, Guerzoni D. New genes from non-coding sequence: the role of de novo protein-coding genes in eukaryotic evolutionary innovation. Phil Trans R Soc B. 2015;370:20140332. - PMC - PubMed
    1. Monsellier E, Chiti F. Prevention of amyloid-like aggregation as a driving force of protein evolution. EMBO Rep. 2007;8:737–742. - PMC - PubMed
    1. Carvunis A-R, et al. Proto-genes and de novo gene birth. Nature. 2012;487:370–374. - PMC - PubMed
    1. Masel J. Cryptic genetic variation is enriched for potential adaptations. Genetics. 2006;172:1985–1991. - PMC - PubMed
    1. Rajon E, Masel J. The evolution of molecular error rates and the consequences for evolvability. Proc Natl Acad Sci USA. 2011;108:1082–1087. - PMC - PubMed

LinkOut - more resources