pubmed.ncbi.nlm.nih.gov

Molecular fossils in the human genome: identification and analysis of the pseudogenes in chromosomes 21 and 22 - PubMed

Comparative Study

Molecular fossils in the human genome: identification and analysis of the pseudogenes in chromosomes 21 and 22

Paul M Harrison et al. Genome Res. 2002 Feb.

Abstract

We have developed an initial approach for annotating and surveying pseudogenes in the human genome. We search human genomic DNA for regions that are similar to known protein sequences and contain obvious disablements (i.e., mid-sequence stop codons or frameshifts), while ensuring minimal overlap with annotations of known genes. Pseudogenes can be divided into "processed" and "nonprocessed"; the former are reverse transcribed from mRNA (and therefore have no intron structure), whereas the latter presumably arise from genomic duplications. We annotate putative processed pseudogenes based on whether there is a continuous span of homology that is >70% of the length of the closest matching human protein (i.e., with introns removed), or whether there is evidence of polyadenylation. We have applied our approach to chromosomes 21 and 22, the first parts of the human genome completely sequenced, finding 190 new pseudogene annotations beyond the 264 reported by the sequencing centers. In total, on chromosomes 21 and 22, there are 189 processed pseudogenes, 195 nonprocessed pseudogenes, and, additionally, 70 pseudogenic immunoglobulin gene segments. (Detailed assignments are available at http://bioinfo.mbb.yale.edu/genome/pseudogene or http://genecensus.org/pseudogene.) By extrapolation, we predict that there could be up to approximately 20,000 pseudogenes in the whole human genome, with a little more than half of them processed. We have determined the main populations and clusters of pseudogenes on chromosomes 21 and 22. There are notable excesses of pseudogenes relative to genes near the centromeres of both chromosomes, indicating the existence of pseudogenic "hot-spots" in the genome. We have looked at the distribution of InterPro families and Gene Ontology (GO) functional categories in our pseudogenes. Overall, the families in both processed and nonprocessed pseudogene populations occur according to a similar power-law distribution as that found for the occurrence of gene families, with a few big families and many small ones. The processed population is, in particular, enriched in highly expressed ribosomal-protein sequences (approximately 20%), which appear fairly evenly distributed across the chromosomes. We compared processed pseudogenes of different evolutionary ages, observing a high degree of similarity between "ancient" and "modern" subpopulations. This may be attributable to the consistently high expression of ribosomal proteins over evolutionary time. Finally, we find that chromosome 22 pseudogene population is dominated by immunoglobulin segments, which have a greater rate of disablement per amino acid than the other pseudogene populations and are also substantially more diverged.

PubMed Disclaimer

Figures

**Figure 1**
Flow diagram of the scheme for assignment of pseudogenes. The schematic shows the steps in assignment of pseudogenes. Ovals denote sources of data, and boxes denote operations. The term “ψg” denotes “pseudogene.” The steps are as follows (described in detail in the text): (1) Six-frame blast. Comparison of SWISSPROT database to chromosome 21 and 22 genomic sequences using
BLAST
(Altschul et al. 1997) to find potential pseudogenic protein homologies (with stop codons in them). (2)
FASTA
realignment. Realignment with the
FASTA
package (Pearson et al. 1997) of the top-matching sequence for the potential pseudogene to find longest protein-homology fragment that has >1 disablement (frameshift or premature stop codon). (3) Minimize overlap with known genes. Overlap of putative pseudogenes with known human genes (from the Sanger center annotations for chromosome 22 and the Riken center annotations for 21) is minimized by choosing a suitable margin at the ends of pseudogenic homologies within which to ignore disablements. (4) Merge with existing ψg annotations. Pseudogene annotations from the Sanger and Riken centers are merged with those that are duplicates of these in our own set of annotations being deleted. (5) Date by finding closest matching Ensembl protein. For each pseudogene, the closest matching Ensembl human protein was found so that the pseudogene could be approximately dated. This was realigned to the genomic DNA sequence (backward step denoted by dotted arrow) and used as a replacement if it produced a longer pseudogene. (6) Assess for processing. All pseudogenes were then assessed for processing by searching for evidence of polyadenylation or extensive spans of protein homology in the absence of exon structure.

**Figure 2**
The relationship between the number of InterPro families for pseudogenes and their sizes shows a power–law behavior. The number of InterPro families is plotted vs. the size of a family on a log–log scale. “All pseudogenes” (processed and nonprocessed combined) are plotted with a filled diamond, processed pseudogenes with a cross, and nonprocessed pseudogenes with an unfilled circle. The straight line indicates the best least-squares linear fit to all points for all pseudogenes, except for the outliers that are labeled in the plot. This is indicative of a power–law relationship between the size of a protein family and the number of families that have this size.

**Figure 3**
Distribution of genes and pseudogenes on chromosomes 21 and 22 into GO functional categories. For each table within the figure, the total is split up into the number for chromosome 21 plus number for chromosome 22 and is sorted in decreasing order of this total. Each GO functional category is given a different color. The table at top *left* lists the top five GO functional categories for genes. The table at top *right* lists the top five GO categories for pseudogenes (processed and nonprocessed combined). This is split up into processed and nonprocessed; processed pseudogenes are further divided into ancient and modern sets, as described in the text. The number of immunoglobulin pseudogenic fragments is lumped in with the nonprocessed population. The GO functional categories in the figure are as follows (with GO numbers): (1) “Transcription factor,” GO:0003700. For processed and nonprocessed pseudogenes these all arise from zinc finger C2H2 type. (2) “Other DNA-binding,” all DNA-binding (GO:0003677) except “Transcription factor.” The proteins found for processed pseudogenes in this class are as follows: (IPR000910) HMG1/2 (high mobility group) box; (IPR000210) BTB/POZ domain; (IPR000637) HMG-I and HMG-Y DNA-binding domain (A + T-hook); (IPR002119) Histone H2A; (IPR000079) High mobility group proteins HMG14 and HMG17; (IPR001514) RNA polymerases D/30 to 40 Kd subunits; and (IPR001005) Myb DNA binding domain. (3) “Nucleotide-binding,” GO:0000166. (4) “Nucleic-acid binding,” GO:0003676 (this class arises from motifs or domains that cannot be classified specifically as “DNA-binding” or “RNA-binding”). (5) “Kinase,” GO:0016301. (6) “Ribosomal protein,” GO:0003735. (7) “Receptor,” GO:0004930. (8) “Transferase,” GO:0016740. (9) “RNA binding,” GO:0003723. The proteins found for processed pseudogenes in this class are as follows: (IPR001014) Ribosomal L23 protein (also classed as ribosomal protein); (IPR001410) DEAD/DEAH box helicase; (IPR002942) S4 domain; (IPR001892) Ribosomal protein S13 (twice, also classed as ribosomal protein). The protein found for nonprocessed pseudogenes in this class is: (IPR001965) PHD finger. (10) “Oxidoreductase,” GO:0016491. (11) “Cell cycle regulator,” GO:0003750.

**Figure 4**
Distribution of pseudogene and gene densities for chromosomes 21 and 22. On the *left* are panels for chromosome 21, and on the *right* are panels for chromosome 22. For each chromosome, the panels are genes predicted by
GenomeScan
and the genome sequencing centers (Riken for chromosome 21 and Sanger for 22) (*top*), nonprocessed pseudogenes (*middle*), and processed pseudogenes (*bottom*). Each bin is named x for the interval x to x + 5 Mb. The first bin contains ∼300,000 bases that are beyond the centromere (containing two genes and six pseudogenes). The final bin ends at the end of the telomere. The bins for the pseudogenic hot-spots referred to in the text are asterisked. For processed pseudogenes, we have added a representation of the distribution of ribosomal–protein processed pseudogenes along the chromosomes at the bottom of the panels, with a dot for each ribosomal–protein pseudogene at its approximate position along the chromosome.

**Figure 5**
Distribution of the percent identity to closest-matching Ensembl proteins for processed, nonprocessed, and immunoglobulin gene segment pseudogenes. The percentage identity to the closest matching Ensembl human protein for pseudogenes. The bin named x contains every value y such that x < y < (x + 10)%. The panels are processed pseudogenes (*top*), nonprocessed pseudogenes (*middle*), and immunoglobulin pseudogenic gene segments (*bottom*).

Cited by

RNAdb--a comprehensive mammalian noncoding RNA database.
Pang KC, Stephen S, Engström PG, Tajul-Arifin K, Chen W, Wahlestedt C, Lenhard B, Hayashizaki Y, Mattick JS. Pang KC, et al. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D125-30. doi: 10.1093/nar/gki089. Nucleic Acids Res. 2005. PMID: 15608161 Free PMC article.
Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome.
Zhang Z, Harrison PM, Liu Y, Gerstein M. Zhang Z, et al. Genome Res. 2003 Dec;13(12):2541-58. doi: 10.1101/gr.1429003. Genome Res. 2003. PMID: 14656962 Free PMC article.
Small, but surprisingly repetitive genomes: transposon expansion and not polyploidy has driven a doubling in genome size in a metazoan species complex.
Blommaert J, Riss S, Hecox-Lea B, Mark Welch DB, Stelzer CP. Blommaert J, et al. BMC Genomics. 2019 Jun 7;20(1):466. doi: 10.1186/s12864-019-5859-y. BMC Genomics. 2019. PMID: 31174483 Free PMC article.
Active Alu element "A-tails": size does matter.
Roy-Engel AM, Salem AH, Oyeniran OO, Deininger L, Hedges DJ, Kilroy GE, Batzer MA, Deininger PL. Roy-Engel AM, et al. Genome Res. 2002 Sep;12(9):1333-44. doi: 10.1101/gr.384802. Genome Res. 2002. PMID: 12213770 Free PMC article.
Identification and analysis of over 2000 ribosomal protein pseudogenes in the human genome.
Zhang Z, Harrison P, Gerstein M. Zhang Z, et al. Genome Res. 2002 Oct;12(10):1466-82. doi: 10.1101/gr.331902. Genome Res. 2002. PMID: 12368239 Free PMC article.

References

1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. - PMC - PubMed
1. Andersson SG, Zomorodipour A, Andersson JO, Sicheritz-Ponten T, Alsmark U C, Podowski RM, Naslund AK, Eriksson AS, Winkler HH, Kurland CG. The genome sequence of Rickettsia prowazekii and the origin of mitochondria. Nature. 1998;396:133–140. - PubMed
1. Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, Bucher P, Cerutti L, Corpet F, Croning MD, et al. InterPro—an integrated documentation resource for protein families, domains and functional sites. Bioinformatics. 2000;16:1145–1150. - PubMed
1. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25:25–29. - PMC - PubMed
1. Bairoch A, Apweiler R. The SWISSPROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000;28:45–48. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Molecular fossils in the human genome: identification and analysis of the pseudogenes in chromosomes 21 and 22 - PubMed