pubmed.ncbi.nlm.nih.gov

Quantitative assessment of dictionary-based protein named entity tagging - PubMed

  • ️Invalid Date

Quantitative assessment of dictionary-based protein named entity tagging

Hongfang Liu et al. J Am Med Inform Assoc. 2006 Sep-Oct.

Abstract

Objective: Natural language processing (NLP) approaches have been explored to manage and mine information recorded in biological literature. A critical step for biological literature mining is biological named entity tagging (BNET) that identifies names mentioned in text and normalizes them with entries in biological databases. The aim of this study was to provide quantitative assessment of the complexity of BNET on protein entities through BioThesaurus, a thesaurus of gene/protein names for UniProt knowledgebase (UniProtKB) entries that was acquired using online resources.

Methods: We evaluated the complexity through several perspectives: ambiguity (i.e., the number of genes/proteins represented by one name), synonymy (i.e., the number of names associated with the same gene/protein), and coverage (i.e., the percentage of gene/protein names in text included in the thesaurus). We also normalized names in BioThesaurus and measures were obtained twice, once before normalization and once after.

Results: The current version of BioThesaurus has over 2.6 million names or 2.1 million normalized names covering more than 1.8 million UniProtKB entries. The average synonymy is 3.53 (2.86 after normalization), ambiguity is 2.31 before normalization and 2.32 after, while the coverage is 94.0% based on the BioCreAtive data set comprising MEDLINE abstracts containing genes/proteins.

Conclusion: The study indicated that names for genes/proteins are highly ambiguous and there are usually multiple names for the same gene or protein. It also demonstrated that most gene/protein names appearing in text can be found in BioThesaurus.

PubMed Disclaimer

Figures

Figure 1
Figure 1

The construction of BioThesaurus. Annotation fields from Genpept, PSD, RefSeq, Entrez GENE, Swiss-Prot and TrEMBL were extracted and associated with iProClass entries. Several other databases were also included including several model organism databases, HUGO, and ENZYME etc. Terms obtained from the annotation fields comprised the Raw Dictionary. An automatic curation process was performed using the UMLS. We also manually inspected high ambiguous entries in the raw dictionary and removed nonsensical terms. After curation, we obtained BioThesaurus, where terms were associated with entities from iProClass. BioThesaurus could be used for extensive information retrieval, investigating relationships among entities sharing the same name, biological named entity tagging, and serving as a gateway for protein information exploration.

Figure 2
Figure 2

Synonymous protein names from multiple data sources (UniProtKB:

O00151

). Names/synonyms of the protein entry are listed based on their rank of frequency (in parentheses) of unique sources (Source Attribute) from which the names are derived. Names with higher frequency may correlate with their more popular usage than those with lower frequency. Textual variants provide the name as appeared from the source of its origin.

Figure 3
Figure 3

Protein name ambiguities using query of “

CLIM1

” from BioThesaurus. The query name “CLIM1” corresponds to eight UniProtKB entries (ambiguity of 8); each is displayed with corresponding UniRef clusters (90 and 50), as well as PIRSF families and Pfam domains. The UniRef cluster and PIRSF family information can be used to estimate the name ambiguity disregarding the homologous proteins, in this case, the ambiguity drops to 4 based on UniRef90 or 50, or to 3 based on PIRSF. In fact, the eight proteins belong to three functionally heterogeneous groups of proteins.

Similar articles

Cited by

References

    1. Humphreys K, Demetriou G, Gaizauskas R. Two applications of information extraction to biological science journal articlesenzyme interactions and protein structures. Pac Symp Biocomput 2000:505-516. - PubMed
    1. Rindflesch TC, Tanabe L, Weinstein JN, Hunter L. EDGARextraction of drugs, genes and relations from the biomedical literature. Pac Symp Biocomput 2000:517-528. - PMC - PubMed
    1. Yakushiji A, Tateisi Y, Miyao Y, Tsujii J. Event extraction from biomedical papers using a full parser Pac Symp Biocomput 2001:408-419. - PubMed
    1. Andrade MA, Valencia A. Automatic extraction of keywords from scientific textapplication to the knowledge domain of protein families. Bioinformatics 1998;14(7):600-607. - PubMed
    1. Craven M, Kumlien J. Constructing biological knowledge bases by extracting information from text sources Proc Int Conf Intell Syst Mol Biol 1999:77-86. - PubMed

MeSH terms

Substances