Genome annotation assessment in Drosophila melanogaster - PubMed
Comparative Study
Genome annotation assessment in Drosophila melanogaster
M G Reese et al. Genome Res. 2000 Apr.
Abstract
Computational methods for automated genome annotation are critical to our community's ability to make full use of the large volume of genomic sequence being generated and released. To explore the accuracy of these automated feature prediction tools in the genomes of higher organisms, we evaluated their performance on a large, well-characterized sequence contig from the Adh region of Drosophila melanogaster. This experiment, known as the Genome Annotation Assessment Project (GASP), was launched in May 1999. Twelve groups, applying state-of-the-art tools, contributed predictions for features including gene structure, protein homologies, promoter sites, and repeat elements. We evaluated these predictions using two standards, one based on previously unreleased high-quality full-length cDNA sequences and a second based on the set of annotations generated as part of an in-depth study of the region by a group of Drosophila experts. Although these standard sets only approximate the unknown distribution of features in this region, we believe that when taken in context the results of an evaluation based on them are meaningful. The results were presented as a tutorial at the conference on Intelligent Systems in Molecular Biology (ISMB-99) in August 1999. Over 95% of the coding nucleotides in the region were correctly identified by the majority of the gene finders, and the correct intron/exon structures were predicted for >40% of the genes. Homology-based annotation techniques recognized and associated functions with almost half of the genes in the region; the remainder were only identified by the ab initio techniques. This experiment also presents the first assessment of promoter prediction techniques for a significant number of genes in a large contiguous region. We discovered that the promoter predictors' high false-positive rates make their predictions difficult to use. Integrating gene finding and cDNA/EST alignments with promoter predictions decreases the number of false-positive classifications but discovers less than one-third of the promoters in the region. We believe that by establishing standards for evaluating genomic annotations and by assessing the performance of existing automated genome annotation tools, this experiment establishes a baseline that contributes to the value of ongoing large-scale annotation projects and should guide further research in genome informatics.
Figures
![Figure 1](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4128/310877/dafb371e5a72/gr.22f1_T4TT.gif)
(See facing page.) Screen shot from the
CloneCuratorprogram (Harris et al. 1999), featuring the genome annotations of all 12 groups for the 2.9-Mb Adh region. The main panel shows the computational annotations on the forward (above axis) and reverse sequence strands (below axis). Genes located on the top half of each map are transcribed from distal to proximal (with respect to the telomere of chromosome are 2L); those on the bottom are transcribed from proximal to distal. Right below the axis are the two repeat finding results displayed, followed by reference sets from Ashburner et al. (1999b; std1 and std3), followed by the 12 submissions of gene-finding programs, followed by the two protein homology programs, and eventually, farthest away from the axis, the four promoter recognition programs. (Left) The color-coded legend for the program and the number of predictions made by the programs.
![Figure 2](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4128/310877/d2999ce6ccf0/gr.22f2a_T1TT.gif)
(A) Annotations for the following known genes described in Ashburner et al. (1999b) are shown for the region from 2,735,000 to 2,775,000 (from the left to the right of the map): crp (partial, reverse (r)), DS02740.4 (forward (f)), DS02740.5 (f), I(2)35Fb (f), heix (r), DS02740.8 (f), DS02740.9 (r), DS02740.10 (f), anon-35Fa (r), Sed5 (f), cni (r), fzy (f), cact (r). (B) Annotations for the following known gene described in Ashburner et al. (1999b) are shown for the region from 600,000 to 635,000 (left to right): DS01759.1 (r).
![Figure 2](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4128/310877/d2999ce6ccf0/gr.22f2a_T1TT.gif)
(A) Annotations for the following known genes described in Ashburner et al. (1999b) are shown for the region from 2,735,000 to 2,775,000 (from the left to the right of the map): crp (partial, reverse (r)), DS02740.4 (forward (f)), DS02740.5 (f), I(2)35Fb (f), heix (r), DS02740.8 (f), DS02740.9 (r), DS02740.10 (f), anon-35Fa (r), Sed5 (f), cni (r), fzy (f), cact (r). (B) Annotations for the following known gene described in Ashburner et al. (1999b) are shown for the region from 600,000 to 635,000 (left to right): DS01759.1 (r).
![Figure 3](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4128/310877/96ce60626b2f/gr.22f3a_T1TT.gif)
(A) Annotations for the following known genes described in Ashburner et al. (1999b) are shown for the region from 1,109,500 to 1,112,500 (forward strand only) (left to right): Adh, Adhr. (B) Annotations for the following known genes described in Ashburner et al. (1999b) are shown for the region from 1,090,000 to 1,180,000 (left to right): osp (r), Adh (f), Adhr (f), DS09219.1 (r), DS07721.1 (f). (C) Annotations for the following known gene described in Ashburner et al. (1999b) are shown for the region from 2,617,500 to 2,640,000 (forward strand only) (left to right): Ca-α1D. (D) Annotations for the following known genes described in Ashburner et al. (1999b) are shown for the region from 2,894,000 to 2,904,000 (forward strand only) (left to right): idgf1, idgf2, idgf3.
![Figure 3](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4128/310877/96ce60626b2f/gr.22f3a_T1TT.gif)
(A) Annotations for the following known genes described in Ashburner et al. (1999b) are shown for the region from 1,109,500 to 1,112,500 (forward strand only) (left to right): Adh, Adhr. (B) Annotations for the following known genes described in Ashburner et al. (1999b) are shown for the region from 1,090,000 to 1,180,000 (left to right): osp (r), Adh (f), Adhr (f), DS09219.1 (r), DS07721.1 (f). (C) Annotations for the following known gene described in Ashburner et al. (1999b) are shown for the region from 2,617,500 to 2,640,000 (forward strand only) (left to right): Ca-α1D. (D) Annotations for the following known genes described in Ashburner et al. (1999b) are shown for the region from 2,894,000 to 2,904,000 (forward strand only) (left to right): idgf1, idgf2, idgf3.
![Figure 3](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4128/310877/96ce60626b2f/gr.22f3a_T1TT.gif)
(A) Annotations for the following known genes described in Ashburner et al. (1999b) are shown for the region from 1,109,500 to 1,112,500 (forward strand only) (left to right): Adh, Adhr. (B) Annotations for the following known genes described in Ashburner et al. (1999b) are shown for the region from 1,090,000 to 1,180,000 (left to right): osp (r), Adh (f), Adhr (f), DS09219.1 (r), DS07721.1 (f). (C) Annotations for the following known gene described in Ashburner et al. (1999b) are shown for the region from 2,617,500 to 2,640,000 (forward strand only) (left to right): Ca-α1D. (D) Annotations for the following known genes described in Ashburner et al. (1999b) are shown for the region from 2,894,000 to 2,904,000 (forward strand only) (left to right): idgf1, idgf2, idgf3.
![Figure 3](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4128/310877/96ce60626b2f/gr.22f3a_T1TT.gif)
(A) Annotations for the following known genes described in Ashburner et al. (1999b) are shown for the region from 1,109,500 to 1,112,500 (forward strand only) (left to right): Adh, Adhr. (B) Annotations for the following known genes described in Ashburner et al. (1999b) are shown for the region from 1,090,000 to 1,180,000 (left to right): osp (r), Adh (f), Adhr (f), DS09219.1 (r), DS07721.1 (f). (C) Annotations for the following known gene described in Ashburner et al. (1999b) are shown for the region from 2,617,500 to 2,640,000 (forward strand only) (left to right): Ca-α1D. (D) Annotations for the following known genes described in Ashburner et al. (1999b) are shown for the region from 2,894,000 to 2,904,000 (forward strand only) (left to right): idgf1, idgf2, idgf3.
Comment in
-
A biologist's view of the Drosophila genome annotation assessment project.
Ashburner M. Ashburner M. Genome Res. 2000 Apr;10(4):391-3. doi: 10.1101/gr.10.4.391. Genome Res. 2000. PMID: 10779478 Review. No abstract available.
Similar articles
-
Genie--gene finding in Drosophila melanogaster.
Reese MG, Kulp D, Tammana H, Haussler D. Reese MG, et al. Genome Res. 2000 Apr;10(4):529-38. doi: 10.1101/gr.10.4.529. Genome Res. 2000. PMID: 10779493 Free PMC article.
-
Drosophila genomic sequence annotation using the BLOCKS+ database.
Henikoff JG, Henikoff S. Henikoff JG, et al. Genome Res. 2000 Apr;10(4):543-6. doi: 10.1101/gr.10.4.543. Genome Res. 2000. PMID: 10779495 Free PMC article.
-
MAGPIE/EGRET annotation of the 2.9-Mb Drosophila melanogaster Adh region.
Gaasterland T, Sczyrba A, Thomas E, Aytekin-Kurban G, Gordon P, Sensen CW. Gaasterland T, et al. Genome Res. 2000 Apr;10(4):502-10. doi: 10.1101/gr.10.4.502. Genome Res. 2000. PMID: 10779489 Free PMC article.
-
EGASP: the human ENCODE Genome Annotation Assessment Project.
Guigó R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, Castelo R, Eyras E, Ucla C, Gingeras TR, Harrow J, Hubbard T, Lewis SE, Reese MG. Guigó R, et al. Genome Biol. 2006;7 Suppl 1(Suppl 1):S2.1-31. doi: 10.1186/gb-2006-7-s1-s2. Epub 2006 Aug 7. Genome Biol. 2006. PMID: 16925836 Free PMC article. Review.
-
[No authors listed] [No authors listed] Washington (DC): American Society for Microbiology; 2004. Washington (DC): American Society for Microbiology; 2004. PMID: 33001599 Free Books & Documents. Review.
Cited by
-
Andrews J, Bouffard GG, Cheadle C, Lü J, Becker KG, Oliver B. Andrews J, et al. Genome Res. 2000 Dec;10(12):2030-43. doi: 10.1101/gr.10.12.2030. Genome Res. 2000. PMID: 11116097 Free PMC article.
-
Genie--gene finding in Drosophila melanogaster.
Reese MG, Kulp D, Tammana H, Haussler D. Reese MG, et al. Genome Res. 2000 Apr;10(4):529-38. doi: 10.1101/gr.10.4.529. Genome Res. 2000. PMID: 10779493 Free PMC article.
-
A novel algorithm for computational identification of contaminated EST libraries.
Sorek R, Safer HM. Sorek R, et al. Nucleic Acids Res. 2003 Feb 1;31(3):1067-74. doi: 10.1093/nar/gkg170. Nucleic Acids Res. 2003. PMID: 12560505 Free PMC article.
-
Reinhardt JA, Baltrus DA, Nishimura MT, Jeck WR, Jones CD, Dangl JL. Reinhardt JA, et al. Genome Res. 2009 Feb;19(2):294-305. doi: 10.1101/gr.083311.108. Epub 2008 Nov 17. Genome Res. 2009. PMID: 19015323 Free PMC article.
-
Advances in the sequencing of the genome of the adenophorean nematode Trichinella spiralis.
Mitreva M, Jasmer DP. Mitreva M, et al. Parasitology. 2008 Jul;135(8):869-80. doi: 10.1017/S0031182008004472. Parasitology. 2008. PMID: 18598573 Free PMC article. Review.
References
-
- Agarwal P, States DJ. Comparative accuracy of methods for protein sequence similarity search. Bioinformatics. 1998;14:40–47. - PubMed
-
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. - PubMed
-
- Ashburner, M. 2000. A biologist's view of the Drosophila genome annotation assessment. Genome Res. (this issue). - PubMed
-
- Ashburner M, Bork P, Durbin R, Guigó R, Hubbard TJ. GASP1 assessment meeting. Heidelberg, Germany: EMBL; 1999a.
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases
Research Materials