Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome - PubMed
Comparative Study
. 2003 Dec;13(12):2541-58.
doi: 10.1101/gr.1429003.
Affiliations
- PMID: 14656962
- PMCID: PMC403796
- DOI: 10.1101/gr.1429003
Comparative Study
Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome
Zhaolei Zhang et al. Genome Res. 2003 Dec.
Abstract
Processed pseudogenes were created by reverse-transcription of mRNAs; they provide snapshots of ancient genes existing millions of years ago in the genome. To find them in the present-day human, we developed a pipeline using features such as intron-absence, frame-disruption, polyadenylation, and truncation. This has enabled us to identify in recent genome drafts approximately 8000 processed pseudogenes (distributed from http://pseudogene.org). Overall, processed pseudogenes are very similar to their closest corresponding human gene, being 94% complete in coding regions, with sequence similarity of 75% for amino acids and 86% for nucleotides. Their chromosomal distribution appears random and dispersed, with the numbers on chromosomes proportional to length, suggesting sustained "bombardment" over evolution. However, it does vary with GC-content: Processed pseudogenes occur mostly in intermediate GC-content regions. This is similar to Alus but contrasts with functional genes and L1-repeats. Pseudogenes, moreover, have age profiles similar to Alus. The number of pseudogenes associated with a given gene follows a power-law relationship, with a few genes giving rise to many pseudogenes and most giving rise to few. The prevalence of processed pseudogenes agrees well with germ-line gene expression. Highly expressed ribosomal proteins account for approximately 20% of the total. Other notables include cyclophilin-A, keratin, GAPDH, and cytochrome c.
Figures

Number of genes and pseudogenes on each human chromosome. Shown in the figure are the Ensembl functional genes (known and novel), “True,” “Putative,” “Disrupted,” and ribosomal protein (RP) processed pseudogenes. The inset shows the total number of functional genes and processed pseudogenes in the entire genome.

Distribution of human processed pseudogenes among chromosomes. Each filled diamond ♦ represents a chromosome. (A) Correlation between chromosome length and number of processed pseudogenes on each chromosome (R = 0.92, P < 10-10). (B) The processed pseudogene density on each chromosome is correlated with the chromosome GC content (R = 0.55, P < 10-2).

Distribution of human processed pseudogenes among chromosomes. Each filled diamond ♦ represents a chromosome. (A) Correlation between chromosome length and number of processed pseudogenes on each chromosome (R = 0.92, P < 10-10). (B) The processed pseudogene density on each chromosome is correlated with the chromosome GC content (R = 0.55, P < 10-2).

Overall statistics of human processed pseudogenes. (A) Sequence completeness among human processed pseudogenes. Sequence completeness is defined as the ratio between the length of the predicted protein sequence from the pseudogene and the length of the closest matching protein sequence from SWISS-PROT or TrEMBL. (B) Distribution of the nucleotide sequence identity between the processed pseudogenes and the corresponding functional genes (coding region only). (C) Distribution of the number of frame disruptions among processed pseudogenes. Pseudogenes that have the same number of frame disruptions were grouped together and the numbers of frame disruptions (X-axis) were plotted versus the size of the group (Y-axis). The Y-axis is a log scale.

Overall statistics of human processed pseudogenes. (A) Sequence completeness among human processed pseudogenes. Sequence completeness is defined as the ratio between the length of the predicted protein sequence from the pseudogene and the length of the closest matching protein sequence from SWISS-PROT or TrEMBL. (B) Distribution of the nucleotide sequence identity between the processed pseudogenes and the corresponding functional genes (coding region only). (C) Distribution of the number of frame disruptions among processed pseudogenes. Pseudogenes that have the same number of frame disruptions were grouped together and the numbers of frame disruptions (X-axis) were plotted versus the size of the group (Y-axis). The Y-axis is a log scale.

Overall statistics of human processed pseudogenes. (A) Sequence completeness among human processed pseudogenes. Sequence completeness is defined as the ratio between the length of the predicted protein sequence from the pseudogene and the length of the closest matching protein sequence from SWISS-PROT or TrEMBL. (B) Distribution of the nucleotide sequence identity between the processed pseudogenes and the corresponding functional genes (coding region only). (C) Distribution of the number of frame disruptions among processed pseudogenes. Pseudogenes that have the same number of frame disruptions were grouped together and the numbers of frame disruptions (X-axis) were plotted versus the size of the group (Y-axis). The Y-axis is a log scale.

Isochore distribution of the human processed pseudogenes (—♦—), in comparison with the Alu (shaded columns) and LINE1 (open columns) elements. The pseudogene density is in units of number per 10 Mb, and the Alu and LINE1 elements are in units of number per Mb.

(A) Occurrences of processed pseudogenes among human functional genes. Human genes that have the same number of processed pseudogenes are grouped together. For each group, the number of pseudogenes (X-axis) and the size of the group (Y-axis) are plotted together. For instance, as seen in the plot, 2299 human genes have between one and five processed pseudogenes. (B) Classification of the processed pseudogenes into GO functional categories. “Unclassified” are those pseudogenes that arose from functional genes that were not yet assigned to a GO category. Less populated categories are lumped together into “Others.”

(A) Occurrences of processed pseudogenes among human functional genes. Human genes that have the same number of processed pseudogenes are grouped together. For each group, the number of pseudogenes (X-axis) and the size of the group (Y-axis) are plotted together. For instance, as seen in the plot, 2299 human genes have between one and five processed pseudogenes. (B) Classification of the processed pseudogenes into GO functional categories. “Unclassified” are those pseudogenes that arose from functional genes that were not yet assigned to a GO category. Less populated categories are lumped together into “Others.”

Nucleotide sequence divergences of human processed pseudogenes in comparison with Alu and LINE1 elements. Pseudogenes and repeats were grouped into bins according to their nucleotide divergence from functional sequences.

The Ka/Ks ratios of the human processed pseudogenes. Ka/Ks is the ratio between the nonsynonymous rate of substitutions (Ka) and the synonymous rate of substitution (Ks). The human processed pseudogenes are divided into two groups according to whether they contain frame disruptions, and the fractions of the pseudogenes in each group are shown side by side for each Ka/Ks bin.

Effects of sequence identity and BLAST E-value cutoffs. For different combinations of sequence identity and BLAST E-value, the total numbers of processed pseudogenes and “putative” processed pseudogenes in the final sets are shown together. The cutoffs that were used in this study are underlined.

A flow chart showing procedures in searching for processed pseudogenes in the human genome. (RP) ribosomal proteins; (ΨG) pseudogene; (OR) olfactory receptor; (Numts) nuclear mitochondrial pseudogenes; (S-W) Smith-Waterman. The steps are as follows: (1) Six-frame TBLASTN run searching for SWISS-PROT/TrEMBL protein similarities in the human genome. (2) Remove overlaps with Ensembl functional gene annotations. (3) Merging, extension, and realignment. BLAST hits were merged and extended on both sides to match the length of query protein sequence and then realigned with the protein sequence. After this step, 44,478 pseudogene candidates were obtained. (4) Remove false positives, repeats, low complexity sequences, and potential functional gene candidates. A total of 19,927 pseudogene candidates were obtained at this step. In steps 5 and 6, processed pseudogenes, duplicated pseudogenes, and pseudogene fragments were separated according to sequence continuity and completeness. Two special types of pseudogene, ORs and Numts, were further removed from the pool, and processed pseudogenes were grouped into three classes, “True,” “Putative,” and “Disrupted.” See text for the definition of these three classes.
Similar articles
-
Harrison PM, Hegyi H, Balasubramanian S, Luscombe NM, Bertone P, Echols N, Johnson T, Gerstein M. Harrison PM, et al. Genome Res. 2002 Feb;12(2):272-80. doi: 10.1101/gr.207102. Genome Res. 2002. PMID: 11827946 Free PMC article.
-
Comparative analysis of processed pseudogenes in the mouse and human genomes.
Zhang Z, Carriero N, Gerstein M. Zhang Z, et al. Trends Genet. 2004 Feb;20(2):62-7. doi: 10.1016/j.tig.2003.12.005. Trends Genet. 2004. PMID: 14746985 Review.
-
Identification and analysis of over 2000 ribosomal protein pseudogenes in the human genome.
Zhang Z, Harrison P, Gerstein M. Zhang Z, et al. Genome Res. 2002 Oct;12(10):1466-82. doi: 10.1101/gr.331902. Genome Res. 2002. PMID: 12368239 Free PMC article.
-
Khelifi A, Meunier J, Duret L, Mouchiroud D. Khelifi A, et al. J Mol Evol. 2006 Jun;62(6):745-52. doi: 10.1007/s00239-005-0186-0. Epub 2006 Apr 28. J Mol Evol. 2006. PMID: 16752212
-
Processed pseudogenes: characteristics and evolution.
Vanin EF. Vanin EF. Annu Rev Genet. 1985;19:253-72. doi: 10.1146/annurev.ge.19.120185.001345. Annu Rev Genet. 1985. PMID: 3909943 Review.
Cited by
-
Chen S, Wu J, Zhang Y, Zhao Y, Xu W, Li Y, Xie J. Chen S, et al. Front Genet. 2021 May 28;12:668940. doi: 10.3389/fgene.2021.668940. eCollection 2021. Front Genet. 2021. PMID: 34122520 Free PMC article.
-
Chen JM, Stenson PD, Cooper DN, Férec C. Chen JM, et al. Hum Genet. 2005 Sep;117(5):411-27. doi: 10.1007/s00439-005-1321-0. Epub 2005 Jun 28. Hum Genet. 2005. PMID: 15983781 Review.
-
Global survey of chromatin accessibility using DNA microarrays.
Weil MR, Widlak P, Minna JD, Garner HR. Weil MR, et al. Genome Res. 2004 Jul;14(7):1374-81. doi: 10.1101/gr.1396104. Genome Res. 2004. PMID: 15231753 Free PMC article.
-
Sloop GD, Pop G, Weidman JJ, St Cyr JA. Sloop GD, et al. Cureus. 2018 May 31;10(5):e2715. doi: 10.7759/cureus.2715. Cureus. 2018. PMID: 30079281 Free PMC article. Review.
-
Retrocopy contributions to the evolution of the human genome.
Baertsch R, Diekhans M, Kent WJ, Haussler D, Brosius J. Baertsch R, et al. BMC Genomics. 2008 Oct 8;9:466. doi: 10.1186/1471-2164-9-466. BMC Genomics. 2008. PMID: 18842134 Free PMC article.
References
-
- Alberts, B., Bray, D., Lewis, J., Raff, M., Roberts, K., and Watson, J. 1994. Molecular biology of the cell. Garland Publishing, New York.
-
- Andersson, S.G., Zomorodipour, A., Andersson, J.O., Sicheritz-Ponten, T., Alsmark, U.C., Podowski, R.M., Naslund, A.K., Eriksson, A.S., Winkler, H.H., and Kurland, C.G. 1998. The genome sequence of Rickettsia prowazekii and the origin of mitochondria. Nature 396: 133-140. - PubMed
-
- Arabidopsis Genome Initiative. 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796-815. - PubMed
WEB SITE REFERENCES
-
- http://bioinfo.mbb.yale.edu/genome/pseudogene; pseudogene database.
-
- http://www.ebi.ac.uk/GOA/; GO annotation of SWISS-PROT/TrEmbl proteins.
-
- http://www.ebi.ac.uk/proteome; EBI nonredundant human proteome.
-
- http://www.ebi.ac.uk/swissprot/; SWISS-PROT human protein sequences.
-
- http://www.ebi.ac.uk/trembl/; TrEMBL human protein sequences.
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials
Miscellaneous