BioCreAtIvE: Critical Assessment of Information Extraction systems in Biology
challenge evaluation consists of a community-wide effort for evaluating text mining and
information extraction systems applied to the biological domain.
Curation (Biology): curation of biological databases in this context means
basically the manual extraction of biological information from the literature by a
domain expert. The aim is to transform information contained in free text (scientific
literature) to information stored in form of a structured database
record (biological databases).
EBI: European Bioinformatics Institute (EMBL-EBI). Among others research groups the EBI hosts the GOA-EBI group for
annotation of gene products with GO terms, the IntAct team for protein-protein interaction
annotation and the Rebholz team for biomedical text mining.
F-measure (balanced F-score): is basically the harmonic mean of precision and recall.
F = 2 X precision X recall / (precision + recall). It is a commonly used performance measure in
information retrieval (IR).
GO: Gene Ontology (GO)consists in an initiative
to provide a set of controlled vocabulary terms useful to describe gene and gene product attributes.
There are used to annotate gene products in an consistent way. The three main GO categories are Cellular
Component, Molecular Function and Biological Process.
GOA: Gene Ontology Annotation(GOA) is a
project run by the EBI to provide assignments of gene products to the Gene Ontology (GO) terms.
HUPO: Human Proteome Organisation (HUPO).
IMEx: the IMEX consortium is a group of protein interaction providers
which share the curation effort and also exchange molecular interaction data
records, using an XML format following the PSI MI standard for molecular
interactions. Its partners comprise Intact, MINT, BIND, DIP and MPact.
Information extraction (IE): IE systems perform natural language text analysis in
order to identify information related to pre-defined types of entities (e.g. genes or proteins),
relationships, facts or events.
Information retrieval (IR): ...
IntAct: the IntAct
IntAct is a freely available, open source database system and analysis tools for protein interaction
data. The interactions stored in IntAct are derived from literature curation or direct user submissions.
It distributes software developed within the IntAct project and controlled vocabularies for the
MINT: the MINT,
Molecular INTeraction database is an initiative of the University of Rome (Tor Vergata)
to store data on functional interactions between proteins, focusing on experimentally verified
interactions. It considers both direct and indirect relationships and hosts a team of expert
curators which extract interaction information from the literature. Refer to
Zanzoni et al (2002).
PMID: the PubMed database identifier (PMID) is a unique identifier for each
PubMed citation, e.g. 11911893.
Precision: is the number of answers the system got right divided by the number of answers the system gave.
Protein-protein interaction: molecular interactions of proteins.
Although there are many different types of interactions, often protein interactions are considered as physical
PSI-MI: Proteomics Standards Initiative Molecular Interaction -
PSI-MI XML format.
is a community standard for the representation of protein interaction data followed by several
Refer to Hermjakob et al. (2004).
PubMed: the PubMed
is a database available via the NCBI Entrez retrieval system, and was developed. It is currently
the most important literature database for life sciences and contains over 15 million citations.
Recall: is the number of answers the systems got right divided by the number of possible right answers.