pubmed.ncbi.nlm.nih.gov

Identification of novel human genes evolutionarily conserved in Caenorhabditis elegans by comparative proteomics - PubMed

Comparative Study

Identification of novel human genes evolutionarily conserved in Caenorhabditis elegans by comparative proteomics

C H Lai et al. Genome Res. 2000 May.

Abstract

Modern biomedical research greatly benefits from large-scale genome-sequencing projects ranging from studies of viruses, bacteria, and yeast to multicellular organisms, like Caenorhabditis elegans. Comparative genomic studies offer a vast array of prospects for identification and functional annotation of human ortholog genes. We presented a novel comparative proteomic approach for assembling human gene contigs and assisting gene discovery. The C. elegans proteome was used as an alignment template to assist in novel human gene identification from human EST nucleotide databases. Among the available 18,452 C. elegans protein sequences, our results indicate that at least 83% (15,344 sequences) of C. elegans proteome has human homologous genes, with 7,954 records of C. elegans proteins matching known human gene transcripts. Only 11% or less of C. elegans proteome contains nematode-specific genes. We found that the remaining 7,390 sequences might lead to discoveries of novel human genes, and over 150 putative full-length human gene transcripts were assembled upon further database analyses. [The sequence data described in this paper have been submitted to the

PubMed Disclaimer

Figures

Figure 1
Figure 1

Illustration of CGI gene identification by error correction and gap closure. (a and b), A C. elegans protein sequence of 264 amino acids (AF022982) was used in an initial search against the HGI database. Two different reading frames (frame-1 and frame-2) in THC195430 matched the C. elegans query. Positions of amino acid residues on AF022982 are listed in bold, and positions of nucleotide sequences of THC195430 are underlined. Please note the overlapping nucleotide sequence at position 502 at both reading frames. (b) The TBLASTN results around position 502. A high-level of similarity was observed at both reading frames. (c) Upon further sequence analysis on dbEST, one dbEST entry (X84715) was used to correct the reading frame by inserting a carboxy nucleotide at position 502 as indicated by the arrowhead. A continuous translatable reading frame was generated following the correction. (d) CGI gene identification by a gap closure procedure. In these cases, two separated THC entries were linked by a C. elegans protein scaffold and the gap sequences were determined by performing RT-PCR experiments with primers designed from each THC entry.

Figure 2
Figure 2

Illustration of IBMS CGI gene database. Following the determination of CGI genes, more comprehensive analysis was performed to obtain the possible full-length nucleotide sequences by dbEST searches; to determine the optimum open-reading frame; to search the UniGene-human database for chromosome localization, tissue distribution, and EST matches; and a final BLAST analysis to confirm its novelty and protein family annotation. All information was then stored in a customized FileMaker Pro-based IBMS CGI gene database. Only the basic layout is shown here. There are different easy-viewing layouts designed to store nucleotide and translated protein information, the original C. elegans query protein information (possible ortholog gene), the full-length contig assembly information and the final BLASTP search results.

Figure 3
Figure 3

Simplified output list of CGI genes used for tissue blot analysis by IBMS database. They are CGI-1, CGI-2, CGI-7, CGI-13, CGI-17, CGI-19, CGI-27. The full list (CGI-1 to CGI-151) can be obtained as IBMS database for CGI genes by request or via anonymous ftp at

140.109.41.19

.

Figure 4
Figure 4

Analyses of human CGI proteins and their original C. elegans protein queries. (a) Protein length. (b) Matched areas of sequence length. (c) Similarity percentage. Analyses were performed for protein length and BLAST results of CGI-1 to CGI-151 (excluding CGI-71, which is almost identical to CGI-8 with three bases inserted). The average length of human CGI genes is 304 amino acids (106∼ 933). C. elegans proteins have an average length of 312 residues (107∼1160). Matched areas from BLAST analysis results averaged 255 residues (85∼692). The average homology percentage is 41% (20%∼71%) and the average similarity percentage is 59% (34%∼87%).

Figure 5
Figure 5

RNA master blot analysis of CGI genes in human tissues. Master tissue blots were hybridized with cloned RT-PCR-amplified fragments of human CGI genes as indicated. (a) CGI-7. (b) CGI-17. (c) CGI-27. The exposure time for each blot was 3 days for A, 7 days for B and 20 hours for C. The tissue distribution on the blot from left to right in order (1–8) was: A, whole brain; amygdala; caudate nucleus; cerebellum; cerebral cortex; frontal lobe; hippocampus; medulla oblongata. B, occipital pole; putamen; substantia nigra; temporal lobe; thalamus; subthalamic; nucleus; spinal cord. C, heart; aorta; skeletal muscle; colon; bladder; uterus; prostate; stomach. D, testis; ovary; pancreas; pituitary gland; adrenal gland; thyroid gland; salivary gland; mammary gland. E, kidney; liver; small intestine; spleen; thymus; peripheral leukocyte; lymph node; bone marrow. F, appendix, lung, trachea, placenta. G, fetal brain; fetal heart; fetal kidney; fetal liver; fetal spleen; fetal thymus; fetal lung. H, yeast total RNA; yeast tRNA; E. coli rRNA; E. coli DNA; Poly r(A); human C0t DNA; human DNA; human DNA.

Figure 6
Figure 6

Northern blot analysis of CGI genes in human tissues. Multiple tissue blots were hybridized with cloned RT-PCR-amplified fragments of human CGI genes as indicated. (a) CGI-2. (b) CGI-19. (c) CGI-1. (d) CGI-13. A and C were exposed for 4 days, whereas B and D were exposed for 10 days. Approximately 2 μg poly(A) RNA from these tissues was loaded in each lane. Tissues are indicated above each lane (BR: Brain, HE: Heart, S.Mu: Skeletal Muscle, CO: Colon, TH or THYM: Thymus, SP: Spleen, KI: Kidney, LI: Liver, S.IN: Small Intestine, PL: Placenta, LU: Lung, PBL: Peripheral Blood Leukocyte, PA: Pancreas, PR: Prostate, TE: Testis, OV: Ovary, ST: Stomach, THYR: Thyroid, S.CO: Spinal Cord, LY.N: Lymph Node, TR: Trachea, AD.G: Adrenal Gland, B.MA: Bone Marrow) and the marker sizes are 9.5 kb, 7.5 kb, 4.4 kb, 2.4 kb and 1.35 kb.

Similar articles

Cited by

References

    1. Aaronson JS, Eckman B, Blevins RA, Borkowski JA, Myerson J, Imran S, Elliston KO. Toward the development of a gene index to the human genome: an assessment of the nature of high-throughput EST sequence data. Genome Res. 1996;6:829–845. - PubMed
    1. Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, Merril CR, Wu A, Olde B, Moreno RF, et al. Complementary DNA sequencing: expressed sequence tags and human genome project. Science. 1991;252:1651–1656. - PubMed
    1. Adams MD, Kerlavage AR, Fleischmann RD, Fuldner RA, Bult CJ, Lee NH, Kirkness EF, Weinstock KG, Gocayne JD, White O, et al. Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. Nature. 1995;377:3–17. - PubMed
    1. Andrade MA, Daruvar A, Casari G, Schneider R, Termier M, Sander C. Characterization of new proteins found by analysis of short open reading frames from the full yeast genome. Yeast. 1997;13:1363–1374. - PubMed
    1. Bailey LC, Jr, Searls DB, Overton GC. Analysis of EST-driven gene annotation in human genomic sequence. Genome Res. 1998;8:362–376. - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources