pubmed.ncbi.nlm.nih.gov

Génie: literature-based gene prioritization at multi genomic scale - PubMed

. 2011 Jul;39(Web Server issue):W455-61.

doi: 10.1093/nar/gkr246. Epub 2011 May 23.

Affiliations

Génie: literature-based gene prioritization at multi genomic scale

Jean-Fred Fontaine et al. Nucleic Acids Res. 2011 Jul.

Abstract

Biomedical literature is traditionally used as a way to inform scientists of the relevance of genes in relation to a research topic. However many genes, especially from poorly studied organisms, are not discussed in the literature. Moreover, a manual and comprehensive summarization of the literature attached to the genes of an organism is in general impossible due to the high number of genes and abstracts involved. We introduce the novel Génie algorithm that overcomes these problems by evaluating the literature attached to all genes in a genome and to their orthologs according to a selected topic. Génie showed high precision (up to 100%) and the best performance in comparison to other algorithms in most of the benchmarks, especially when high sensitivity was required. Moreover, the prioritization of zebrafish genes involved in heart development, using human and mouse orthologs, showed high enrichment in differentially expressed genes from microarray experiments. The Génie web server supports hundreds of species, millions of genes and offers novel functionalities. Common run times below a minute, even when analyzing the human genome with hundreds of thousands of literature records, allows the use of Génie in routine lab work.

Availability: http://cbdm.mdc-berlin.de/tools/genie/.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.

Flow chart of the Génie web tool and algorithm. As an example, a user could query human genes related to a disease or a molecular pathway using chicken and rat orthologs. Usage of orthology information is optional. Data are extracted from four NCBI databases: Taxonomy, Gene, MEDLINE and HomoloGene. As the retrieved literature associated to the topic may not be complete, it is used to train a text mining classifier that will select relevant gene literature. The output gene list (human genes in the given example) is ranked using Fisher’s statistics.

Figure 2.
Figure 2.

Benchmarks. (a) Génie confidence scores versus log2-fold expression changes for all up-regulated probes (at least 2-fold expression change) in a zebrafish microarray data set between hearts from 3-day-old zebrafish embryos and whole body tissue. All probes with a positive confidence score were selected by Génie using orthology to zebrafish, mice and humans (red diamonds and black crosses). Probes also selected by Génie using only zebrafish-related abstracts are plotted with black crosses. Genes not selected by Génie have a score equal to zero (blue circles). The scores and gene expression fold changes for each gene are available as

Supplementary Table S6

. (b) Precision when predicting differentially expressed genes using gene ranks given by Génie. From the zebrafish microarray data analysis, differentially expressed genes are selected by a FDR< 0.01 and a minimum 2-fold expression change between heart and body samples. (c) These precision–recall plots show the performance of Génie (red curves), Fable (blue curves) and PolySearch (black curves) when ranking genes from eight randomly chosen KEGG pathways. The three tools were used with default parameters. PolySearch returned no results for two pathways: drug metabolism cytochrome P450 and fructose mannose metabolism (see

Supplementary Methods

). Génie was run without using orthology expansion of the literature.

Similar articles

Cited by

References

    1. Collins FS, McKusick VA. Implications of the human genome project for medical science. JAMA. 2001;285:540–544. - PubMed
    1. Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Federhen S, et al. Database resources of the national center for biotechnology information. Nucleic Acids Res. 2010;38:D5–D16. - PMC - PubMed
    1. Marcotte E, Date S. Exploiting big biology: integrating large-scale biological data for function inference. Brief Bioinform. 2001;2:363–374. - PubMed
    1. Koonin EV. Orthologs, paralogs, and evolutionary genomics. Annu. Rev. Genet. 2005;39:309–338. - PubMed
    1. Andrade MA, Bork P. Automated extraction of information in molecular biology. FEBS Lett. 2000;476:12–17. - PubMed

Publication types

MeSH terms