Gene Mention Tagging
Gene Mention Tagging task is concerned with the named entity extraction
of gene and gene product mentions in text.
Premise
Systems will be required to return the start and end indices
corresponding to all the genes and gene products mentioned in a given
MEDLINE sentence. This named entity task is a crucial first step for
information extraction of relationships between genes and gene products.
System Input
The input file will consist of ascii sentences, one per
line. Each sentence will be preceded on the same line by a sentence
identifier.
System Output
Each system must output an ascii list of reported gene
name mentions, one per line, and formatted as:
sentence-identifier-1|start-offset-1 end-offset-1|optional text...
sentence-identifier-1|start-offset-2 end-offset-2|optional text...
sentence-identifier-1|start-offset-3 end-offset-3|optional text...
sentence-identifier-2|start-offset-1 end-offset-1|optional text...
sentence-identifier-3|start-offset-1 end-offset-1|optional text...
.
.
.
The sentence-identifier is from the sentence of the mention. Multiple
mentions from the same sentence should appear on separate lines. A
sentence is not required to have any mentions. The start-offset is the
number of non-whitespace characters in the sentence preceding the first
character of the mention, and the end-offset is the number of
non-whitespace characters in the sentence preceding the last character
of the mention. If you put anything after the vertical bar following the
end-offset, it will be ignored by the evaluator.
Evaluation
System performance will be scored automatically by how well
the generated gene/gene product list corresponds to one generated by
human annotators. Acceptable alternatives to the gold standard names,
also generated by human annotators, will count as true positives.
Data Selection and Annotation: Sentences were selected at random from
MEDLINE, half of the sentences are likely to contain genes and gene
products based on similarity to sentences with known gene names. A small
group of annotators trained in biochemistry, molecular biology and
genetics searched through each sentence, identifying mentions of genes
and gene products, along with acceptable alternatives.
To date 20,000 sentences have been annotated. 15,000 sentences were
used previously in BioCreative, and will be released as training data.
Gene Mention (GM) Task registration
To receive the test data, we request that you send the following
information to: biocreative-organizer@lists.sourceforge.net
A) E-mail contact
B) Phone contact
C) List of team members and their institutions
D) Tasks which you plan to participate in
If you have already sent this information for the PPI
task and this is the SAME TEAM, please note this information.
We will acknowledge receipt and will issue
a unique USER ID which will be used to
identify results from different teams.
Please register BEFORE OCTOBER 15, so we can send you the data!
On Oct. 15, we will notify the email
contact with information about how to get the
test data.
If you do not hear from us on Oct. 15, please
contact the organizers.
Please use a contact email address capable
of receiving zipped file attachments
(.zip/.gz) of at least 500 KB, as this
address will be our primary means of contacting participants.
By requesting the test data, you also agree to the
guidelines for participation/submission
Submission Guidelines
Participants are requested to halt all system development after
they obtain the test data.
Participants email their GM submissions to mailing list:
biocreative-gm-sub-2006@lists.sourceforge.net
as a .txt attachment.
These are due Oct 15 (PPI subtask 1) or Oct 22 (all other tasks/subtasks).
By submitting results, the groups agree to
have their submission made public in an anonymous form at the end of the
evaluation (e.g. as was done with the BioCreAtIvE 1 Task 2 submissions).
By requesting the test data, you are committed to the submission of
results for that task or sub-task. If, for some reason,
after receiving the test data, you are unable to submit results for a
given task or subtask, you should notify the organizers promptly, and
provide an email explaining why you have been unable to
submit; we also ask that you provide a commitment to delete your copy of
the test data.
System Description
You have to submit a short system description questionnaire (1-2 pps)
by Oct 31. The description should give an
overview of the approach used - please follow the template below.
If you wish, the description may be anonymous; the description will be
linked by user ID to the results for the tasks, to be distributed at the
workshop. This is due Oct 31 and must be submitted to receive scores.
Groups will receive their scores and the gold standard data (by mid Dec)
at the contact email address they provided. We will provide each group
with its scores only - the full set of results will be made available
at the BioCreAtIvE workshop and in the associated Proceedings.
Groups are requested not to publish results of their system
on the goldstandard data until after the workshop.
Submission File Naming
By naming your submission files in the same format, we can keep
everything much more organized.
The format is TeamId_BC2_Task(_Subtask)_Run.txt.
For example, Team 60 submitting 3 runs
(the max for any task/subtask) to the
GM task:
T60_BC2_GM_1.txt
T60_BC2_GM_2.txt
T60_BC2_GM_3.txt
System Description Template/Questionnaire
Please note that any information provided will be made publicly
available, so if you wish to remain anonymous
you do need to be specific with
proprietary system components (e.g. simply note things like "proprietary
gene lexicon"). However, the research community benefits by
participants being as explicit as possible in these descriptions and
complete disclosure is encouraged. If some information only pertains to
a particular run, please note this.
1- Team identifier:.......
2- Which task does this describe (GN, GM or PPI):........
3- Please identify/describe any machine learning techniques used:..........
4- Please identify/describe any NLP techniques/components used:........
5- Please identify/describe any external (marked up text) training data
used:.........
6- Please identify/describe any external lexical resources (terminology
lists)used:........
7- Please describe any rule sets used:.........
8- If your system interacts with or uses data from any biological
database(s), please describe:..........
9- Please identify/describe any other relevant resources used to
train/develop your system:.........
10- Please describe the general data flow in your system:..........
11- Other information of interest:.........
GM Test Set Submission Format
We want to remind participants in the GM task that you are responsible for
submitting result data in a valid format, as described in the file README.GM.
In order to verify that your result data is valid, you should run your system
on the training data and evaluate the output with the perl script alt_eval.perl
(in the train subdirectory, described in the file train/README).
GM Test Set Sentence Identifiers
Additionally, systems should not make any assumptions about the contents or
meaning of sentence identifiers in the test set. When you receive test data
for the final run, sentence identifiers will be randomly assigned strings.
We do not plan to release source information for the test sentences until
after the evaluation is complete. (This statement is not meant to imply any
other limits on resources or methods that may be used.)
Last update of this page: 12 October 2006
[up][home]
|