Origins of De Novo Genes in Human and Chimpanzee - PubMed
- ️Thu Jan 01 2015
Origins of De Novo Genes in Human and Chimpanzee
Jorge Ruiz-Orera et al. PLoS Genet. 2015.
Abstract
The birth of new genes is an important motor of evolutionary innovation. Whereas many new genes arise by gene duplication, others originate at genomic regions that did not contain any genes or gene copies. Some of these newly expressed genes may acquire coding or non-coding functions and be preserved by natural selection. However, it is yet unclear which is the prevalence and underlying mechanisms of de novo gene emergence. In order to obtain a comprehensive view of this process, we have performed in-depth sequencing of the transcriptomes of four mammalian species--human, chimpanzee, macaque, and mouse--and subsequently compared the assembled transcripts and the corresponding syntenic genomic regions. This has resulted in the identification of over five thousand new multiexonic transcriptional events in human and/or chimpanzee that are not observed in the rest of species. Using comparative genomics, we show that the expression of these transcripts is associated with the gain of regulatory motifs upstream of the transcription start site (TSS) and of U1 snRNP sites downstream of the TSS. In general, these transcripts show little evidence of purifying selection, suggesting that many of them are not functional. However, we find signatures of selection in a subset of de novo genes which have evidence of protein translation. Taken together, the data support a model in which frequently-occurring new transcriptional events in the genome provide the raw material for the evolution of new proteins.
Conflict of interest statement
The authors have declared that no competing interests exist.
Figures

a) Percentage of annotated and novel genes and transcripts using strand-specific deep polyA+ RNA sequencing. Classification is based on the comparison to reference gene annotations in Ensembl v.75. 70.65 and 87.77% of annotated genes in human and mouse are classified as protein-coding, respectively. Number of genes identified: human 34,188; chimpanzee, 35,915; macaque 34,427; mouse 31,043. Number of transcripts identified: human 99,670; chimpanzee 102,262; macaque 93,860; mouse 85,688. b) Cumulative density of nucleotide length in annotated and novel assembled transcripts. c) Cumulative density of expression values in logarithmic scale in annotated and novel assembled transcripts. Expression is measured in fragments per kilobase per million mapped reads (FPKM) values, selecting the maximum value across all samples.

a) Simplified phylogenetic tree indicating the nine species considered in this study. In all species we had RNA-Seq data from several tissues. Chimpanzee, human, macaque and mouse were the species for which we performed strand-specific deep polyA+ RNA sequencing. We indicate the branches in which de novo genes were defined, together with the number of genes. b) Categories of transcripts in de novo genes based on genomic location. Intergenic, transcripts that do not overlap any other gene; Overlapping antisense, transcripts that overlap exons from other genes in the opposite strand; Overlapping intronic, transcripts that overlap introns from other genes in the opposite strand, with no exonic overlap. c) Classification of de novo genes based on existing evidence in databases. Annotated; genes classified as annotated in Ensembl v.75; EST/nr; non-annotated genes with BLAST hits (10−4) to expressed sequence tags (EST) and/or non-redundant protein (nr) sequences in the same species. Novel; rest of genes. d) Patterns of gene expression in four tissues. Brain refers to frontal cortex. Transcripts with FPKM > 0 in a tissue are considered as expressed in that tissue. In red boxes, fraction of transcripts whose expression is restricted to that tissue (τ > 0.85, see Methods). Chimp conserved, transcripts assembled in chimpanzee not classified as de novo. Human conserved, transcripts assembled in human not classified as de novo. e) Number of testis GTEx samples with expression of de novo and conserved genes. We considered all annotated genes with FPKM > 0 in at least one testis sample. Conserved, genes sampled from the total pool of annotated genes analyzed in GTEx with the same distribution of FPKM values than in annotated de novo genes (n = 200).

a) Overrepresented transcription factor binding sites (TFBS) in the region -100 to 0 with respect to the transcription start site (TSS) in de novo genes. The region from -300 to +300 with respect to the TSS was analysed (n = 3,875). Color code relates to normalized values (highest value is yellow). b) Fine-grained motif density 200bp upstream of the TSS is shown. c) Comparison of motif density in genomic syntenic regions in macaque for de novo transcripts (n = 3,116) and conserved transcripts (n = 4,323, randomly taken human and chimpanzee annotated transcripts not classified as de novo). Significant differences between human/chimpanzee and macaque are indicated; Fisher-test; *, p-value < 0.05; **, p-value < 0.01. d) Density of the main human transposable elements (TE) families around the TSS of de novo and conserved transcripts. Regions -3 kB to +3 kB with respect to the TSS were analyzed. LTR frequency is higher in the region -100 to +100 in de novo genes when compared to conserved genes (Fisher-test p-value < 10−18). e) Comparison of motif density in promoters with and without long terminal repeat (LTR) in the region -500 to 0 with respect to the TSS. Significant differences in motif density in the -100 bp window are indicated. f) Signatures of transcription elongation in de novo and conserved genes. Density of U1 and PAS motifs in the 500bp region upstream and downstream of the TSS. Comparison of U1 and PAS motif density in genomic syntenic regions in macaque for de novo transcripts (n = 3,116) and conserved transcripts (n = 4,323). There is an increase of U1 motifs in de novo transcripts when compared to macaque (indicated by a black arrow, Fisher-test, p-value = 0.016 for the region +100 to +200).

a-d) ORF length and coding score for ORFs in different sequence types. De novo gene, longest ORF in de novo transcripts (n = 1,933). CodRNA (all), annotated coding sequences from Ensembl v.75 (n = 8,462). CodRNA (short), annotated coding sequences sampled as to have the same transcript length distribution as de novo transcripts (n = 1,952). Intron, longest ORF in intronic sequences from annotated genes sampled as to have the same transcript length distribution as novo transcripts (n = 5,000); Proteogenomics—ORFs in de novo transcripts with peptide evidence by mass-spectrometry; Ribosome profiling—ORFs in de novo transcripts with ribosome association evidence in brain. e) Example of hominoid-specific de novo gene with evidence of protein expression from proteogenomics, with RNA-Seq read profiles in two human samples. (f) Example of hominoid-specific de novo gene with RNA-Seq and ribosome profiling read profiles. Predicted coding sequences are highlighted with red boxes and the putative encoded protein sequences displayed.
Similar articles
-
Parallel patterns of evolution in the genomes and transcriptomes of humans and chimpanzees.
Khaitovich P, Hellmann I, Enard W, Nowick K, Leinweber M, Franz H, Weiss G, Lachmann M, Pääbo S. Khaitovich P, et al. Science. 2005 Sep 16;309(5742):1850-4. doi: 10.1126/science.1108296. Epub 2005 Sep 1. Science. 2005. PMID: 16141373
-
Human-specific histone methylation signatures at transcription start sites in prefrontal neurons.
Shulha HP, Crisci JL, Reshetov D, Tushir JS, Cheung I, Bharadwaj R, Chou HJ, Houston IB, Peter CJ, Mitchell AC, Yao WD, Myers RH, Chen JF, Preuss TM, Rogaev EI, Jensen JD, Weng Z, Akbarian S. Shulha HP, et al. PLoS Biol. 2012;10(11):e1001427. doi: 10.1371/journal.pbio.1001427. Epub 2012 Nov 20. PLoS Biol. 2012. PMID: 23185133 Free PMC article.
-
Shankar R, Chaurasia A, Ghosh B, Chekmenev D, Cheremushkin E, Kel A, Mukerji M. Shankar R, et al. Mol Genet Genomics. 2007 Apr;277(4):441-55. doi: 10.1007/s00438-007-0210-8. Epub 2007 Mar 9. Mol Genet Genomics. 2007. PMID: 17375324
-
Comparative primate genomics: the year of the chimpanzee.
Ruvolo M. Ruvolo M. Curr Opin Genet Dev. 2004 Dec;14(6):650-6. doi: 10.1016/j.gde.2004.08.007. Curr Opin Genet Dev. 2004. PMID: 15531160 Review.
-
[Comparative studies on human and chimpanzee genomes].
Yoko K, Atsushi T, Hideki N, Asao F. Yoko K, et al. Tanpakushitsu Kakusan Koso. 2005 Dec;50(16 Suppl):2072-7. Tanpakushitsu Kakusan Koso. 2005. PMID: 16411432 Review. Japanese. No abstract available.
Cited by
-
Zheng EB, Zhao L. Zheng EB, et al. Elife. 2022 Sep 30;11:e78772. doi: 10.7554/eLife.78772. Elife. 2022. PMID: 36178469 Free PMC article.
-
Zile K, Dessimoz C, Wurm Y, Masel J. Zile K, et al. Genome Biol Evol. 2020 Aug 1;12(8):1355-1366. doi: 10.1093/gbe/evaa127. Genome Biol Evol. 2020. PMID: 32589737 Free PMC article.
-
Illuminating the Function of the Orphan Transporter, SLC22A10 in Humans and Other Primates.
Yee SW, Ferrández-Peral L, Alentorn P, Fontsere C, Ceylan M, Koleske ML, Handin N, Artegoitia VM, Lara G, Chien HC, Zhou X, Dainat J, Zalevsky A, Sali A, Brand CM, Capra JA, Artursson P, Newman JW, Marques-Bonet T, Giacomini KM. Yee SW, et al. Res Sq [Preprint]. 2023 Sep 14:rs.3.rs-3263845. doi: 10.21203/rs.3.rs-3263845/v1. Res Sq. 2023. PMID: 37790518 Free PMC article. Updated. Preprint.
-
Illuminating the Function of the Orphan Transporter, SLC22A10 in Humans and Other Primates.
Yee SW, Ferrández-Peral L, Alentorn P, Fontsere C, Ceylan M, Koleske ML, Handin N, Artegoitia VM, Lara G, Chien HC, Zhou X, Dainat J, Zalevsky A, Sali A, Brand CM, Capra JA, Artursson P, Newman JW, Marques-Bonet T, Giacomini KM. Yee SW, et al. bioRxiv [Preprint]. 2023 Aug 12:2023.08.08.552553. doi: 10.1101/2023.08.08.552553. bioRxiv. 2023. PMID: 37609337 Free PMC article. Updated. Preprint.
-
Vegesna R, Tomaszkiewicz M, Ryder OA, Campos-Sánchez R, Medvedev P, DeGiorgio M, Makova KD. Vegesna R, et al. Genome Biol Evol. 2020 Jun 1;12(6):842-859. doi: 10.1093/gbe/evaa088. Genome Biol Evol. 2020. PMID: 32374870 Free PMC article.
References
-
- Haldane JBS (1932) The causes of evolution New York: Harper and Bros.
-
- Ohno S (1970) Evolution by gene duplication Springer; New York.
Publication types
MeSH terms
Substances
Grants and funding
The main grant was BFU2012-36820 from the Spanish Government, which was co-funded by the European Regional Development Fund (FEDER). Another grant was from Instituto de Salud Carlos III, Gobierno de España, grant number PT13/0001. We also received funds from Agència de Gestió d'Ajuts Universitaris i de Recerca Generalitat de Catalunya, grant number 2014SGR1121. Another funding source was the European Molecular Biology Organization Young Investigators Program 2014 grant awarded to TMB. TMB was also supported by MICINN BFU2014-55090-P, BFU2015-7116-ERC and BFU2015-6215-ERC (www.mecd.gob.es). MA and TMB were supported by ICREA Institut Català de Recerca i Estudis Avançats, Generalitat de Catalunya. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
LinkOut - more resources
Full Text Sources
Other Literature Sources
Medical
Molecular Biology Databases