pubmed.ncbi.nlm.nih.gov

A Comprehensive Analysis of Transcript-Supported De Novo Genes in Saccharomyces sensu stricto Yeasts - PubMed

  • ️Sun Jan 01 2017

A Comprehensive Analysis of Transcript-Supported De Novo Genes in Saccharomyces sensu stricto Yeasts

Tzu-Chiao Lu et al. Mol Biol Evol. 2017.

Abstract

Novel genes arising from random DNA sequences (de novo genes) have been suggested to be widespread in the genomes of different organisms. However, our knowledge about the origin and evolution of de novo genes is still limited. To systematically understand the general features of de novo genes, we established a robust pipeline to analyze >20,000 transcript-supported coding sequences (CDSs) from the budding yeast Saccharomyces cerevisiae. Our analysis pipeline combined phylogeny, synteny, and sequence alignment information to identify possible orthologs across 20 Saccharomycetaceae yeasts and discovered 4,340 S. cerevisiae-specific de novo genes and 8,871 S. sensu stricto-specific de novo genes. We further combine information on CDS positions and transcript structures to show that >65% of de novo genes arose from transcript isoforms of ancient genes, especially in the upstream and internal regions of ancient genes. Fourteen identified de novo genes with high transcript levels were chosen to verify their protein expressions. Ten of them, including eight transcript isoform-associated CDSs, showed translation signals and five proteins exhibited specific cytosolic localizations. Our results suggest that de novo genes frequently arise in the S. sensu stricto complex and have the potential to be quickly integrated into ancient cellular network.

Keywords: S. sensu stricto yeast; de novo gene; novel gene; synteny analysis; transcript isoform; yeast evolution; yeast genomics.

© The Author 2017. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.

PubMed Disclaimer

Figures

<sc>Fig</sc>. 1.
Fig. 1.

Identification of de novo genes in the Saccharomyces sensu stricto complex. (A) Data sources and analytical pipeline for identification of de novo genes. sgdCDSs, smORFs, and txCDSs (blue boxes) represent CDSs collected from SGD, ribosome profiling, and TIF-seq data, respectively. The pink box represents all collected CDSs after data filtering. The green box indicates genes without protein similarity in non-Saccharomycetaceae species (de novo candidate genes). Filled-green boxes represent S. cerevisiae-specific (age 0) or S. sensu stricto-specific (age 1) de novo genes. Numbers in red represent the analytical steps mentioned in the methods. Numbers in the parentheses are the CDS number in each step or group. (B) The majority of CDSs overlap with ancient genes or localize in syntenic regions. (C) All possible CDSs in the overlapping or syntenic regions are annotated and compared with the de novo candidate genes. For de novo candidates overlapping with ScANCs, we included 3-kb proximal regions of ScANCs to detect possible orthologs. For de novo candidates with conserved synteny, we searched the entire syntenic region between two flanking ancient genes (ScANCa and ScANCb) for possible orthologs. This process was repeated for all 20 Saccharomycetaceae yeasts. (D) Age assignments of de novo candidates. Gray circles indicate phylogenetic nodes of Saccharomycetaceae yeasts determined by a previous study (Marcet-Houben and Gabaldon 2015). The red circle indicates the WGD event. Species belonging to each phylogenetic branch are listed. The numbers of de novo candidate genes with corresponding ages are shown on the right.

<sc>Fig</sc>. 2.
Fig. 2.

Comparisons of gene age assignment between different studies. (A) Distribution of gene ages for different CDS sources. (B) Gene age comparisons between different genome-wide studies. In the sgdCDS group, we identified 47 age 0 genes and 50 age 1 genes, many of which were either not identified or assigned different ages in other studies. (C) The factors contributing to the exclusion of 755 de novo sgdCDSs identified in previous genome-wide studies. Also, see supplementary figures S1–S3, Supplementary Material online, for detailed comparisons.

<sc>Fig</sc>. 3.
Fig. 3.

De novo genes originate from nonconserved sequences. (A) The proportions of age 0 genes with identified homologous DNA sequences drop quickly with respect to species divergence in Saccharomyces sensu stricto species. (B) Only a small proportion of age 0 and age 1 genes have homologous DNA sequences in the syntenic regions of age 1 to age 6 and age 2 to age 6 species, respectively. Two different E value cutoffs (10−2 and 10−4) for TBLASTN were applied, but the results were similar. (C) Sequence conservation is lower in regions containing de novo genes (Mann–Whitney test, P value <2.2e-16, ScANCs vs. age 0, or ScANCs vs. age 1). Conservation scores in different groups of CDSs were calculated using orthologous DNA sequences in S. sensu stricto yeasts.

<sc>Fig</sc>. 4.
Fig. 4.

Most de novo genes are associated with ancient genes. (A) A large proportion of de novo genes overlap with ancient genes. Ancient genes (ScANC) are defined as the Saccharomyces cerevisiae genes with pre-WGD ancestors. (B, C) Distribution of de novo genes with relative positions or distances to the nearest ScANCs. All CDSs with at least two transcripts were shown in (B) and only CDSs with >40 transcripts were shown in (C). Also, see supplementary figure S4, Supplementary Material online, for the distribution of de novo genes with longer lengths (at least 150 or 300 nt). (D) The association of de novo genes with ancient genes is not simply due to the compact nature of yeast genomes. Intergenic CDSs (intCDS) were used to represent the random origin of CDSs in the yeast genome. Distances to the nearest ScANC of non-ScANC-overlapping CDSs were compared among different group of CDSs. The result showed that both age 0 and age 1 genes were much closer to ScANCs compared with the distances between intCDSs and ScANCs (Mann–Whitney test, P value <2.2e-16) in the sense strand.

<sc>Fig</sc>. 5.
Fig. 5.

Transcript structures of de novo genes are often affected by ScANCs. (A) De novo genes whose transcripts are in the same direction and overlap with that of an associated ScANC were further classified into “upstream,” “internal,” or “downstream” types depending on the relative positions of their initiation and stop codons with respect to that of the associated ScANC. The darker regions of the transcripts correspond to CDSs. Only de novo genes with >40 transcripts in TIF-seq data records are shown here. Also, see supplementary figures S4 and S5, Supplementary Material online, for the distribution of all de novo genes and the example of different types of de novo genes. (B) Proximal de novo genes often terminate at the same sites as associated ScANCs. To avoid bias from low-expressing genes, only genes with >40 transcripts in TIF-seq data were analyzed. TSS, transcription start site. TTS, transcription termination site. (C) The upstream type of de novo genes are expressed at higher levels than other types (Mann–Whitney test, P value <2.2e-16 for all pairwise comparisons).

<sc>Fig</sc>. 6.
Fig. 6.

Population data reveal different protein conservation levels between age 1 genes and age 0 genes. (A) The changes in de novo genes that are commonly observed among 93 different Saccharomyces cerevisiae strains. The structural changes included loss of start codons and CDS-length variation. Only length changes >15 bp were considered true variants. Sequence divergence was determined by protein sequence identity through pairwise alignments. Age 1 genes are more conserved than age 0 genes in terms of both structure and sequence. Also, see supplementary figure S6, Supplementary Material online, for the comparison of de novo genes and intergenic CDSs. (B) Different tests are applied for detecting possible selection forces in different types of age 1 CDSs across species and population. The internal type of de novo genes were significantly different from all other types in dN/dS (Mann–Whitney test, P value ≤3.86e-15). Age 1 genes carrying S. paradoxus orthologs and >150 bp were selected for the analysis. (C) The internal type of age 1 genes were the least variable in structural changes. Different types of age 1 genes were analyzed for CDS structure conservation in S. cerevisiae and S. paradoxus populations.

<sc>Fig</sc>. 7.
Fig. 7.

Many highly expressed de novo genes are translated and exhibit specific protein localization patterns. (A) Ribosome profiling of overprinting CDSs indicates increased translation signals from the alternative frame in de novo CDS-containing regions. The frame with the highest ribosome signals in ScANC-specific region is defined as frame 0. The other frame with enhanced signals in the overprinting region is assigned as frame 1. The remaining frame is frame 2. Frame-specific enrichment was analyzed in CDSs that are (a) totally overlapped (including the internal type of de novo genes) or (b) partially overlapped (including upstream and downstream types) with ScANCs. The results showed that ribosome signals in frame 1 were significant enriched in de novo CDS-containing regions (Mann–Whitney test, ScANC+de novo vs. ScANC-specific, P value <2.2e-16 and =6.8e-16 for totally overlapped and partially overlapped genes, respectively). No enrichment was observed in frame 2 (Mann–Whitney test, ScANC+de novo vs. ScANC-specific, P value=1 and =0.98 for totally overlapped and partially overlapped genes, respectively). In the right panel, the ribosome profile of three different frames in each CDS was represented by a gray line. Also, see supplementary figure S7, Supplementary Material online, for examples of two overprinting genes. (B, C) Detection of full-length proteins of de novo genes by Western blot. To examine whether full-length proteins were translated from de novo genes, we selected different types of candidate genes (table 1) that were expressed for >40 transcripts in the TIF-seq data to perform Western-blot analysis. These candidate genes were fused with GFP (B) or TAP (C) tags and were under the control of their own promoters. Among 14 candidates, 10 exhibited detectable fusion protein signals. Glucose-6-phosphate dehydrogenase (G6PDH) served as a loading control. (D) Some de novo proteins showed special localization patterns. The GFP-fusion protein of txCDS9817 represented a general cytosolic localization pattern. Five other GFP-fusion proteins were observed to localize in mitochondria, cell periphery, ER, or cytosolic punctates. Also, see supplementary figure S8, Supplementary Material online, for the protein features in de novo genes.

Similar articles

Cited by

References

    1. Aguilera F, McDougall C, Degnan BM.. 2017. Co-option and de novo gene evolution underlie Molluscan shell diversity. Mol Biol Evol. 34:779–792. - PMC - PubMed
    1. Arendsee ZW, Li L, Wurtele ES.. 2014. Coming of age: orphan genes in plants. Trends Plant Sci. 19:698–708. - PubMed
    1. Begun DJ, Lindfors HA, Kern AD, Jones CD.. 2007. Evidence for de novo evolution of testis-expressed genes in the Drosophila yakuba/Drosophila erecta clade. Genetics 176:1131–1137. - PMC - PubMed
    1. Betran E, Bai Y, Motiwale M.. 2006. Fast protein evolution and germ line expression of a Drosophila parental gene and its young retroposed paralog. Mol Biol Evol. 23:2191–2202. - PubMed
    1. Bolger AM, Lohse M, Usadel B.. 2014. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114–2120. - PMC - PubMed

MeSH terms

LinkOut - more resources