Population diversity of ORFan genes in Escherichia coli - PubMed
Population diversity of ORFan genes in Escherichia coli
Guoqin Yu et al. Genome Biol Evol. 2012.
Abstract
The origin and evolution of "ORFans" (suspected genes without known relatives) remain unclear. Here, we take advantage of a unique opportunity to examine the population diversity of thousands of ORFans, based on a collection of 35 complete genomes of isolates of Escherichia coli and Shigella (which is included phylogenetically within E. coli). As expected from previous studies, ORFans are shorter and AT-richer in sequence than non-ORFans. We find that ORFans often are very narrowly distributed: the most common pattern is for an ORFan to be found in only one genome. We compared within-species population diversity of ORFan genes with those of two control groups of non-ORFan genes. Patterns of population variation suggest that most ORFans are not artifacts, but encode real genes whose protein-coding capacity is conserved, reflecting selection against nonsynonymous mutations. Nevertheless, nonsynonymous nucleotide diversity is higher than for non-ORFans, whereas synonymous diversity is roughly the same. In particular, there is a several-fold excess of ORFans in the highest decile of diversity relative to controls, which might be due to weaker purifying selection, positive selection, or a subclass of ORFans that are decaying.
Figures
![F<sc>ig</sc>. 1.—](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc61/3514957/72e103c05edf/evs081f1p.gif)
Clades used to define comparison groups with different phylogenetic depths. The t1 and t2 clades were chosen due to high bootstrap support (>96%) in a phylogeny of species computed as described (Materials and Methods) and available as
supplementary figure S1,
Supplementary Materialonline.
![F<sc>ig</sc>. 2.—](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc61/3514957/5bdd0d41594e/evs081f2p.gif)
Scheme for creating matching control clusters for each ORFan cluster. The pseudocode shown here describes the method used to generate customized control clusters. If ORFan_list is not equal to the intersection of t_list and ORFan_list, then the putative control cluster cannot be used because it does not have the right set of strains.
![F<sc>ig</sc>. 3.—](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc61/3514957/aeac7f642c58/evs081f3p.gif)
ORFan composition as a function of genome size in 35 Escherichia coli strains. For each genome, counts are shown for three categories of putative protein-coding genes, along with regression lines. Two of the categories are mutually exclusive: each gene in a genome is either from a cluster (in the NCBI Protein Clusters database) that has a curated functional annotation (solid circle), or it is from a cluster annotated as “hypothetical protein” (plus symbols). The solid squares show the counts of ORFans, the vast majority of which are noncurated (see text). As genome size increases, the number of proteins with assigned functions remains nearly constant. The increase in genome size is not mainly attributable to ORFans, but is attributable to other genes for which functions are unknown.
![F<sc>ig</sc>. 4.—](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc61/3514957/eac7fe9e05a6/evs081f4p.gif)
Genic features of ORFans compared with non-ORFans. (A), average size (in base pairs). ORFans are shorter than non-ORFans. (B), average percent of GC at first (GC1), second (GC2), and third (GC3) position of codons. Except for GC2 in (B), the three classes of clusters ORFans differ significantly in genic features. ORFans have lower GC content at first and third positions of codons. Bars represent 95% confidence intervals. The letters denote significantly different results by the Wilcoxon test (results from Student’s t-test are the same).
![F<sc>ig</sc>. 5.—](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc61/3514957/f450440c7609/evs081f5p.gif)
Distribution of ORFan and non-ORFan genes among genomes of Escherichia coli strains. (A) The average number of E. coli stains per protein cluster in the ORFan and non-ORFan cluster groups; (B) the frequency distribution of number of E. coli stains per cluster used in the ORFan and non-ORFan comparison groups (this excludes ORFan clusters with only one member, which is the most common size of a cluster). ORFans typically have narrow distributions, while non-ORFans in the t2 comparison group are present in most genomes. Non-ORFans in the t1 group have an intermediate distribution. The letters in (A) denote significantly different results by the Wilcoxon test (results from Student’s t-test are the same).
![F<sc>ig</sc>. 6.—](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc61/3514957/5359e7246934/evs081f6p.gif)
Mean population statistics compared between ORFans and non-ORFans. This figure shows a comparison of means for pi (upper panel), dS (middle panel), and dN (lower panel), whereas the complete distributions are compared, via deciles, in figure 7. There are two different sets of mean values for ORFan clusters, because the comparisons with t1 and t2 use overlapping but nonidentical sets of ORFan clusters, due to the need to create matching controls with the same strain composition (see Materials and Methods). Although synonymous diversity is not much different, nonsynonymous diversity, as well as total diversity (Pi), is significantly different between ORFans and non-ORFans in the t2 control group. The letters denote significantly different results by the Wilcoxon test; Student’s t-test gives a slightly different result for dN, not shown, with no significant distinction between t1 and the ORFan clusters (i.e., the pattern is a–a–a–b).
![F<sc>ig</sc>. 7.—](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc61/3514957/5e0d96558083/evs081f7p.gif)
Distribution of diversity statistics compared between ORFans and non-ORFan controls. The three rows show the distributions for pi (upper), dS (middle) and dN (lower), for both t1 (left column) and t2 (right column) control sets. To understand the shape of the distribution of population statistics for ORFans, values for ORFans were gathered into decile bins defined by the non-ORFan control clusters, that is, each bin comprises 10% of the distribution of non-ORFan values. The value on the Y axis for the first decile bin in (A), for instance, represents the frequency with which the Pi value for an ORFan ranks in the top 10% of the values in its customized t1 control group. (C, E) The same comparison for dN and dS; (B, D, and F) the distribution of Pi, dN, and dS (respectively) relative to the t2 comparison group. The null expectation is a straight line at a value of 10%, with a slight anomaly at the low (right) end of the distribution due to zero values (in cases where zero values exceed 10% of the control distribution, zero values in the ORFan distribution will be placed in whichever bin is counted first, which in this case tends to leave a shortage in the last bin). The symmetric but slightly U-shaped distribution of dS values indicates that ORFans exhibit greater variance, but otherwise have the same distribution of synonymous differences as non-ORFans. However, the deviation from the distribution of dN (and Pi) values is asymmetric, with a 2-fold or more excess of ORFan clusters with diversity in the top 10% or 20% of the distribution relative to non-ORFan controls.
Similar articles
-
Bacterial genomes as new gene homes: the genealogy of ORFans in E. coli.
Daubin V, Ochman H. Daubin V, et al. Genome Res. 2004 Jun;14(6):1036-42. doi: 10.1101/gr.2231904. Genome Res. 2004. PMID: 15173110 Free PMC article.
-
Structural features and the persistence of acquired proteins.
Narra HP, Cordes MH, Ochman H. Narra HP, et al. Proteomics. 2008 Nov;8(22):4772-81. doi: 10.1002/pmic.200800061. Proteomics. 2008. PMID: 18924109 Free PMC article.
-
Identification and investigation of ORFans in the viral world.
Yin Y, Fischer D. Yin Y, et al. BMC Genomics. 2008 Jan 19;9:24. doi: 10.1186/1471-2164-9-24. BMC Genomics. 2008. PMID: 18205946 Free PMC article.
-
Twenty thousand ORFan microbial protein families for the biologist?
Siew N, Fischer D. Siew N, et al. Structure. 2003 Jan;11(1):7-9. doi: 10.1016/s0969-2126(02)00938-3. Structure. 2003. PMID: 12517334 Review.
-
The evolution of the Escherichia coli phylogeny.
Chaudhuri RR, Henderson IR. Chaudhuri RR, et al. Infect Genet Evol. 2012 Mar;12(2):214-26. doi: 10.1016/j.meegid.2012.01.005. Epub 2012 Jan 14. Infect Genet Evol. 2012. PMID: 22266241 Review.
Cited by
-
A comprehensive survey of integron-associated genes present in metagenomes.
Buongermino Pereira M, Österlund T, Eriksson KM, Backhaus T, Axelson-Fisk M, Kristiansson E. Buongermino Pereira M, et al. BMC Genomics. 2020 Jul 20;21(1):495. doi: 10.1186/s12864-020-06830-5. BMC Genomics. 2020. PMID: 32689930 Free PMC article.
-
Evolution of new functions de novo and from preexisting genes.
Andersson DI, Jerlström-Hultqvist J, Näsvall J. Andersson DI, et al. Cold Spring Harb Perspect Biol. 2015 Jun 1;7(6):a017996. doi: 10.1101/cshperspect.a017996. Cold Spring Harb Perspect Biol. 2015. PMID: 26032716 Free PMC article. Review.
-
Theory of prokaryotic genome evolution.
Sela I, Wolf YI, Koonin EV. Sela I, et al. Proc Natl Acad Sci U S A. 2016 Oct 11;113(41):11399-11407. doi: 10.1073/pnas.1614083113. Epub 2016 Oct 4. Proc Natl Acad Sci U S A. 2016. PMID: 27702904 Free PMC article.
-
Gunasekera RS, Raja KKB, Hewapathirana S, Tundrea E, Gunasekera V, Galbadage T, Nelson PA. Gunasekera RS, et al. PLoS One. 2023 Oct 25;18(10):e0291260. doi: 10.1371/journal.pone.0291260. eCollection 2023. PLoS One. 2023. PMID: 37879070 Free PMC article.
-
Narrow-host-range bacteriophages that infect Rhizobium etli associate with distinct genomic types.
Santamaría RI, Bustos P, Sepúlveda-Robles O, Lozano L, Rodríguez C, Fernández JL, Juárez S, Kameyama L, Guarneros G, Dávila G, González V. Santamaría RI, et al. Appl Environ Microbiol. 2014 Jan;80(2):446-54. doi: 10.1128/AEM.02256-13. Epub 2013 Nov 1. Appl Environ Microbiol. 2014. PMID: 24185856 Free PMC article.
References
-
- Awano T, et al. A frame shift mutation in canine TPP1 (the ortholog of human CLN2) in a juvenile dachshund with neuronal ceroid lipofuscinosis. Mol Genet Metabol. 2006;89:254–260. - PubMed
-
- Benson MD, et al. A new human hereditary amyloidosis: the result of a stop-codon mutation in the apolipoprotein AII gene. Genomics. 2001;72:272–277. - PubMed
-
- Charlebois RL, Clarke GD, Beiko RG, Jean A. Characterization of species-specific genes using a flexible, Web-based querying system. FEMS Microbiol Lett. 2003;225:213–220. - PubMed
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Research Materials