pubmed.ncbi.nlm.nih.gov

Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome - PubMed

Comparative Study

. 2003 Dec;13(12):2541-58.

doi: 10.1101/gr.1429003.

Affiliations

PMID: 14656962
PMCID: PMC403796
DOI: 10.1101/gr.1429003

Comparative Study

Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome

Zhaolei Zhang et al. Genome Res. 2003 Dec.

Abstract

Processed pseudogenes were created by reverse-transcription of mRNAs; they provide snapshots of ancient genes existing millions of years ago in the genome. To find them in the present-day human, we developed a pipeline using features such as intron-absence, frame-disruption, polyadenylation, and truncation. This has enabled us to identify in recent genome drafts approximately 8000 processed pseudogenes (distributed from http://pseudogene.org). Overall, processed pseudogenes are very similar to their closest corresponding human gene, being 94% complete in coding regions, with sequence similarity of 75% for amino acids and 86% for nucleotides. Their chromosomal distribution appears random and dispersed, with the numbers on chromosomes proportional to length, suggesting sustained "bombardment" over evolution. However, it does vary with GC-content: Processed pseudogenes occur mostly in intermediate GC-content regions. This is similar to Alus but contrasts with functional genes and L1-repeats. Pseudogenes, moreover, have age profiles similar to Alus. The number of pseudogenes associated with a given gene follows a power-law relationship, with a few genes giving rise to many pseudogenes and most giving rise to few. The prevalence of processed pseudogenes agrees well with germ-line gene expression. Highly expressed ribosomal proteins account for approximately 20% of the total. Other notables include cyclophilin-A, keratin, GAPDH, and cytochrome c.

PubMed Disclaimer

Figures

**Figure 1**
Number of genes and pseudogenes on each human chromosome. Shown in the figure are the Ensembl functional genes (known and novel), “True,” “Putative,” “Disrupted,” and ribosomal protein (RP) processed pseudogenes. The inset shows the total number of functional genes and processed pseudogenes in the entire genome.

**Figure 2**
Distribution of human processed pseudogenes among chromosomes. Each filled diamond ♦ represents a chromosome. (A) Correlation between chromosome length and number of processed pseudogenes on each chromosome (R = 0.92, P < 10^-10). (B) The processed pseudogene density on each chromosome is correlated with the chromosome GC content (R = 0.55, P < 10^-2).

**Figure 3**
Overall statistics of human processed pseudogenes. (A) Sequence completeness among human processed pseudogenes. Sequence completeness is defined as the ratio between the length of the predicted protein sequence from the pseudogene and the length of the closest matching protein sequence from SWISS-PROT or TrEMBL. (B) Distribution of the nucleotide sequence identity between the processed pseudogenes and the corresponding functional genes (coding region only). (C) Distribution of the number of frame disruptions among processed pseudogenes. Pseudogenes that have the same number of frame disruptions were grouped together and the numbers of frame disruptions (X-axis) were plotted versus the size of the group (Y-axis). The Y-axis is a log scale.

**Figure 4**
Isochore distribution of the human processed pseudogenes (—♦—), in comparison with the Alu (shaded columns) and LINE1 (open columns) elements. The pseudogene density is in units of number per 10 Mb, and the Alu and LINE1 elements are in units of number per Mb.

**Figure 5**
(A) Occurrences of processed pseudogenes among human functional genes. Human genes that have the same number of processed pseudogenes are grouped together. For each group, the number of pseudogenes (X-axis) and the size of the group (Y-axis) are plotted together. For instance, as seen in the plot, 2299 human genes have between one and five processed pseudogenes. (B) Classification of the processed pseudogenes into GO functional categories. “Unclassified” are those pseudogenes that arose from functional genes that were not yet assigned to a GO category. Less populated categories are lumped together into “Others.”

**Figure 6**
Nucleotide sequence divergences of human processed pseudogenes in comparison with Alu and LINE1 elements. Pseudogenes and repeats were grouped into bins according to their nucleotide divergence from functional sequences.

**Figure 7**
The K_a/K_s ratios of the human processed pseudogenes. K_a/K_s is the ratio between the nonsynonymous rate of substitutions (K_a) and the synonymous rate of substitution (K_s). The human processed pseudogenes are divided into two groups according to whether they contain frame disruptions, and the fractions of the pseudogenes in each group are shown side by side for each K_a/K_s bin.

**Figure 8**
Effects of sequence identity and BLAST E-value cutoffs. For different combinations of sequence identity and BLAST E-value, the total numbers of processed pseudogenes and “putative” processed pseudogenes in the final sets are shown together. The cutoffs that were used in this study are underlined.

**Figure 9**
A flow chart showing procedures in searching for processed pseudogenes in the human genome. (RP) ribosomal proteins; (ΨG) pseudogene; (OR) olfactory receptor; (Numts) nuclear mitochondrial pseudogenes; (S-W) Smith-Waterman. The steps are as follows: (1) Six-frame TBLASTN run searching for SWISS-PROT/TrEMBL protein similarities in the human genome. (2) Remove overlaps with Ensembl functional gene annotations. (3) Merging, extension, and realignment. BLAST hits were merged and extended on both sides to match the length of query protein sequence and then realigned with the protein sequence. After this step, 44,478 pseudogene candidates were obtained. (4) Remove false positives, repeats, low complexity sequences, and potential functional gene candidates. A total of 19,927 pseudogene candidates were obtained at this step. In steps 5 and 6, processed pseudogenes, duplicated pseudogenes, and pseudogene fragments were separated according to sequence continuity and completeness. Two special types of pseudogene, ORs and Numts, were further removed from the pool, and processed pseudogenes were grouped into three classes, “True,” “Putative,” and “Disrupted.” See text for the definition of these three classes.

Cited by

Genome-Wide Analysis of Coding and Non-coding RNA Reveals a Conserved miR164-NAC-mRNA Regulatory Pathway for Disease Defense in Populus.
Chen S, Wu J, Zhang Y, Zhao Y, Xu W, Li Y, Xie J. Chen S, et al. Front Genet. 2021 May 28;12:668940. doi: 10.3389/fgene.2021.668940. eCollection 2021. Front Genet. 2021. PMID: 34122520 Free PMC article.
A systematic analysis of LINE-1 endonuclease-dependent retrotranspositional events causing human genetic disease.
Chen JM, Stenson PD, Cooper DN, Férec C. Chen JM, et al. Hum Genet. 2005 Sep;117(5):411-27. doi: 10.1007/s00439-005-1321-0. Epub 2005 Jun 28. Hum Genet. 2005. PMID: 15983781 Review.
Global survey of chromatin accessibility using DNA microarrays.
Weil MR, Widlak P, Minna JD, Garner HR. Weil MR, et al. Genome Res. 2004 Jul;14(7):1374-81. doi: 10.1101/gr.1396104. Genome Res. 2004. PMID: 15231753 Free PMC article.
Apolipoprotein(a) is the Product of a Pseudogene: Implications for the Pathophysiology of Lipoprotein(a).
Sloop GD, Pop G, Weidman JJ, St Cyr JA. Sloop GD, et al. Cureus. 2018 May 31;10(5):e2715. doi: 10.7759/cureus.2715. Cureus. 2018. PMID: 30079281 Free PMC article. Review.
Retrocopy contributions to the evolution of the human genome.
Baertsch R, Diekhans M, Kent WJ, Haussler D, Brosius J. Baertsch R, et al. BMC Genomics. 2008 Oct 8;9:466. doi: 10.1186/1471-2164-9-466. BMC Genomics. 2008. PMID: 18842134 Free PMC article.

References

1. Alberts, B., Bray, D., Lewis, J., Raff, M., Roberts, K., and Watson, J. 1994. Molecular biology of the cell. Garland Publishing, New York.
1. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 3389-3402. - PMC - PubMed
1. Andersson, S.G., Zomorodipour, A., Andersson, J.O., Sicheritz-Ponten, T., Alsmark, U.C., Podowski, R.M., Naslund, A.K., Eriksson, A.S., Winkler, H.H., and Kurland, C.G. 1998. The genome sequence of Rickettsia prowazekii and the origin of mitochondria. Nature 396: 133-140. - PubMed
1. Arabidopsis Genome Initiative. 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796-815. - PubMed
1. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. 2000. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25: 25-29. - PMC - PubMed

WEB SITE REFERENCES

1. http://bioinfo.mbb.yale.edu/genome/pseudogene; pseudogene database.
1. http://www.ebi.ac.uk/GOA/; GO annotation of SWISS-PROT/TrEmbl proteins.
1. http://www.ebi.ac.uk/proteome; EBI nonredundant human proteome.
1. http://www.ebi.ac.uk/swissprot/; SWISS-PROT human protein sequences.
1. http://www.ebi.ac.uk/trembl/; TrEMBL human protein sequences.

Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome - PubMed

Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome

Abstract

Figures

Similar articles

Cited by

References

WEB SITE REFERENCES

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Miscellaneous