pubmed.ncbi.nlm.nih.gov

Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research - PubMed

  • ️Thu Jan 01 2015

Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research

Àlex Bravo et al. BMC Bioinformatics. 2015.

Abstract

Background: Current biomedical research needs to leverage and exploit the large amount of information reported in scientific publications. Automated text mining approaches, in particular those aimed at finding relationships between entities, are key for identification of actionable knowledge from free text repositories. We present the BeFree system aimed at identifying relationships between biomedical entities with a special focus on genes and their associated diseases.

Results: By exploiting morpho-syntactic information of the text, BeFree is able to identify gene-disease, drug-disease and drug-target associations with state-of-the-art performance. The application of BeFree to real-case scenarios shows its effectiveness in extracting information relevant for translational research. We show the value of the gene-disease associations extracted by BeFree through a number of analyses and integration with other data sources. BeFree succeeds in identifying genes associated to a major cause of morbidity worldwide, depression, which are not present in other public resources. Moreover, large-scale extraction and analysis of gene-disease associations, and integration with current biomedical knowledge, provided interesting insights on the kind of information that can be found in the literature, and raised challenges regarding data prioritization and curation. We found that only a small proportion of the gene-disease associations discovered by using BeFree is collected in expert-curated databases. Thus, there is a pressing need to find alternative strategies to manual curation, in order to review, prioritize and curate text-mining data and incorporate it into domain-specific databases. We present our strategy for data prioritization and discuss its implications for supporting biomedical research and applications.

Conclusions: BeFree is a novel text mining system that performs competitively for the identification of gene-disease, drug-disease and drug-target associations. Our analyses show that mining only a small fraction of MEDLINE results in a large dataset of gene-disease associations, and only a small proportion of this dataset is actually recorded in curated resources (2%), raising several issues on data prioritization and curation. We propose that joint analysis of text mined data with data curated by experts appears as a suitable approach to both assess data quality and highlight novel and interesting information.

PubMed Disclaimer

Figures

Figure 1
Figure 1

Global and local context kernels to represent a gene-disease association. a) The sentence extracted form a MEDLINE abstract (PMID:22337703) expresses the association between the disease MMD (Major Depressive Disorder) and the genes EHD3 and FREM3. We will focus in the association between EHD3 and MMD to illustrate the features considered in each kernel. b and c) The local context kernel (K LC) uses orthographic and shallow linguistic features (POS, lemma, stem) of the tokens located at the left and right (window size of 2) of the candidate entities (EHD3 and MDD). d) The global context kernel (K GC) is based on the assumption that an association between two entities (in this case EHD3 and MDD) is more likely to be expressed within on of three patterns (fore-between, between, between-after). In this example the association between EHD3 and MDD is expressed in the between pattern. e) In the global context kernel (K GC) we consider both trigrams and sparse bigrams in each pattern.

Figure 2
Figure 2

Dependency graph representation of a gene-disease association. a) Dependency graph representation of the sentence. Solid lines represent the shortest path between the two candidates. The token “associated” is the Least Common Subsumer (LCS) node of both candidates. b) Subgraph representing the shortest path between EHD3 and MDD, where syntactic dependencies are represented as edges and tokens as nodes. c) The e-walk and v-walk features for the node “association” and the syntactic (token, stem, lemma, POS) and semantic features (role) considered in the K DEP kernel.

Figure 3
Figure 3

Depression genes identified by BeFree and their overlap with genes available in other repositories. Venn diagram showing the overlap for the depression genes identified by BeFree trained in GAD or EU-ADR corpora, and the depression genes present in DisGeNET.

Figure 4
Figure 4

Number of gene-disease associations as a function of the number of PMIDs that support each association.

Figure 5
Figure 5

Number of gene-disease associations reported by only one PMID in each calendar year. In red we show the number of associations present in DisGeNET.

Figure 6
Figure 6

Number of gene-disease associations reported by only one PMID in journals classified by their Impact Factor. In red we show the number of associations present in DisGeNET.

Figure 7
Figure 7

Distribution of the number of gene-disease associations reported per MEDLINE abstract.

Figure 8
Figure 8

Decision Tree Workflow for selection of BeFree dataset on gene-disease associations.

Figure 9
Figure 9

Overlap of the gene-disease associations identified by BeFree with the associations available in DisGeNET curated and predicted sources. DisGeNET information coming from expert curated sources such as UniProt are classified as curated, whereas information coming from model animals such as mouse are classified as predicted. For more information see

http://www.disgenet.org/

.

Figure 10
Figure 10

DisGeNET score vs number of supporting publications for the gene-disease associations identified by BeFree. The selected examples discussed in the text are: 1) TP53-Malignant Neoplasm; 2) BRCA1-Breast Carcinoma; 3) ESR1-Breast Carcinoma; 4) ERBB2-Breast Carcinoma; 5) BRCA1-Ovarian Carcinoma; 7) APP-Alzheimer disease; 8) CFTR-Cystic Fibrosis.

Figure 11
Figure 11

Distribution of diseases according to the MeSH disease classification in the BeFree and DisGeNET datasets. Note that more than 40% of diseases in BeFree do not contain a MeSH disease class.

Figure 12
Figure 12

Frequency distribution of the number of associated diseases per gene.

Figure 13
Figure 13

Frequency distribution of the number of associated genes per disease.

Figure 14
Figure 14

Distribution of disease proteins according to the Panther Protein classification. Data from Panther (

http://www.pantherdb.org/

) was used to annotate disease proteins from BeFree and DisGeNET. Note that more than 37% of proteins in BeFree cannot be classified according to Panther.

Similar articles

Cited by

References

    1. Rebholz-Schuhmann D, Oellrich A, Hoehndorf R. Text-mining solutions for biomedical research: enabling integrative biology. Nat Rev Genet. 2012;13:829–39. doi: 10.1038/nrg3337. - DOI - PubMed
    1. Cases M, Furlong LI, Albanell J, Altman RB, Bellazzi R, Boyer S, et al. Improving data and knowledge management to better integrate health care and research. J Intern Med. 2013;274:321–8. doi: 10.1111/joim.12105. - DOI - PMC - PubMed
    1. Hahn U, Cohen KB, Garten Y, Shah NH. Mining the pharmacogenomics literature–a survey of the state of the art. Brief Bioinform. 2012;13:460–94. doi: 10.1093/bib/bbs018. - DOI - PMC - PubMed
    1. Arighi CN, Wu CH, Cohen KB, Hirschman L, Krallinger M, Valencia A, et al. BioCreative-IV virtual issue. Database. 2014;2014:bau039–9. doi: 10.1093/database/bau039. - DOI - PMC - PubMed
    1. Pakhomov S, McInnes BT, Lamba J, Liu Y, Melton GB, Ghodke Y, et al. Using PharmGKB to train text mining approaches for identifying potential gene targets for pharmacogenomic studies. J Biomed Inform. 2012;45:862–9. doi: 10.1016/j.jbi.2012.04.007. - DOI - PMC - PubMed

Publication types

MeSH terms