Human gene/protein normalization
Premise
Systems will be required to return the EntrezGene (formerly Locus
Link) identifiers corresponding to the human genes and direct gene
products appearing in a given MEDLINE abstract. This has relevance to
improving document indexing and retrieval, and to linking text mentions to
database identifiers in support of more sophisticated information extraction
tasks. It is similar to Task 1B of BioCreAtIvE I [1].
System Input
Participating groups will be given
a master list of human EntrezGene identifiers with some common gene and
protein names (synonyms) for each identifier in the master list. For the
evaluation task, the input is a a collection of plain text abstracts.
System Output
For each abstract, the system will return a list of the
EntrezGene identifiers and corresponding text excerpts for each human gene
or gene product mentioned in the abstract. The excerpt required is a single
mention of the gene's 'name' found in the abstract. Even if a gene is
mentioned several different places in an abstract with alternate names being
used, only a single excerpt/mention is to be returned by the system. If desired, groups may also include a fourth column which contains a confidence measure that ranges from 0 (no confidence) to 1 (absolute confidence). This is not a part of the main evaluation, and is included as an option for interested groups at the request of some participants. The return format is a single file, with each entry on one line, and the field delimited by tabs. The columns should then be: PUBMED ID, EntrezGene (LocusLink) ID, Mention Text, and optionally Confidence. There should be no column headers or line numbers, and the fields should all be separated with tabs. Although the hand annotated training file contains multiple text excerpts for each identifier, that is just meant to aid in training and only one would be expected from a participating system (any one of the set would be 'correct', although getting the right text is not the main part of the evaluation). An example line with made up identifiers follows:
123456 987 foobar
If interested in the optional confidence numbers:
123456 987 foobar .87
Evaluation
System performance will be evaluated on how well the generated
EntrezGene identifier list corresponds to one generated by human annotators.
In the interest of being better able to understand what variables impact system peformance we will also try to look at various features (e.g. term length or the variation of annotated terms from those in the lexicon) impact performance. We are releasing a preliminary scoring script with the distributed data, but we will score the main evaluation on the single task of returning the list of gene identifiers for each abstract. If participants have alternate techniques for understanding system performance (such as using the optional confidence scores which were not part of the original scoring script), we will try to include them as appropriate. It is hoped this way that we can both try to identify optimal techniques for achieving the main task and also increase our understanding of what factors impact performance.
Data Selection and Annotation
Abstracts were selected from those annotated
by EBI's Human GOA [2] group, since this selection is assumed to be enriched
in mentions of human genes and gene products. A small group of annotators
trained in molecular biology searched through the abstract text (and title),
identifying mentions of genes and gene products using UniProt and the NCBI
Gene interface for identifying the corresponding EntrezGene identifier.
Inter-annotator agreement was measured at over 90%. We will release a hand
annotated training/development set of 281 annotated abstracts and we anticipated another
250-275 to be used in the evaluation. We have also
compiled a lexicon for the human EntrezGene identifiers using common
gene/protein name sources, which will be released along with the training
data. Participating groups may wish to compile their own lexical resources
or discover ways to prune the provided lexicon. Five thousand abstracts
from the GOA annotation set will be released along with the EntrezGene
identifiers that correspond to the EBI GOA annotations. These have
been derived by mapping from the Uniprot to the EntrezGene mapping of
PIR [3] and may provide useful noisy, training data. However, there are a
number of limitations with this dataset set since most gene/proteins
mentioned are not recorded, and the annotations which were done to UniProt
do not completely map into EntrezGene. Participants are requested not to
download or use the EBI human GOA annotations on their own.
Funding
The MITRE contribution to this work s based upon work supported by
the National Science Foundation under Grant No. 0640153. Any opinions,
findings and conclusions or recomendations expressed in this material are
those of the author(s) and do not necessarily reflect the views of the
National Science Foundation (NSF).
References
[1] Hirschman, L., et al., Overview of BioCreAtIvE task 1B:
normalized gene lists. BMC Bioinformatics, 2005. 6 Suppl 1: p. S11.
[2] Camon, E., et al., The Gene Ontology Annotation (GOA)
Database--an integrated resource of GO annotations to the UniProt
Knowledgebase. In Silico Biol, 2004. 4(1): p. 5-6.
[3] Barker, W.C., et al., The protein information resource (PIR).
Nucleic Acids Res, 2000. 28(1): p. 41-4.
[up][home]
|