BioCreAtIvE - Critical Assessment for Information Extracion in Biology
Home - CNIO - MITRE - NCBI - Organization - News - Contact

Gene Mention Tagging

Gene Mention Tagging task is concerned with the named entity extraction of gene and gene product mentions in text.

Systems will be required to return the start and end indices corresponding to all the genes and gene products mentioned in a given MEDLINE sentence. This named entity task is a crucial first step for information extraction of relationships between genes and gene products.

System Input
The input file will consist of ascii sentences, one per line. Each sentence will be preceded on the same line by a sentence identifier.

System Output
Each system must output an ascii list of reported gene name mentions, one per line, and formatted as:

sentence-identifier-1|start-offset-1 end-offset-1|optional text...
sentence-identifier-1|start-offset-2 end-offset-2|optional text...
sentence-identifier-1|start-offset-3 end-offset-3|optional text...
sentence-identifier-2|start-offset-1 end-offset-1|optional text...
sentence-identifier-3|start-offset-1 end-offset-1|optional text...

The sentence-identifier is from the sentence of the mention. Multiple mentions from the same sentence should appear on separate lines. A sentence is not required to have any mentions. The start-offset is the number of non-whitespace characters in the sentence preceding the first character of the mention, and the end-offset is the number of non-whitespace characters in the sentence preceding the last character of the mention. If you put anything after the vertical bar following the end-offset, it will be ignored by the evaluator.

System performance will be scored automatically by how well the generated gene/gene product list corresponds to one generated by human annotators. Acceptable alternatives to the gold standard names, also generated by human annotators, will count as true positives.
Data Selection and Annotation: Sentences were selected at random from MEDLINE, half of the sentences are likely to contain genes and gene products based on similarity to sentences with known gene names. A small group of annotators trained in biochemistry, molecular biology and genetics searched through each sentence, identifying mentions of genes and gene products, along with acceptable alternatives.
To date 20,000 sentences have been annotated. 15,000 sentences were used previously in BioCreative, and will be released as training data.

Gene Mention (GM) Task registration
To receive the test data, we request that you send the following information to:

A) E-mail contact
B) Phone contact
C) List of team members and their institutions
D) Tasks which you plan to participate in

If you have already sent this information for the PPI task and this is the SAME TEAM, please note this information.
We will acknowledge receipt and will issue a unique USER ID which will be used to identify results from different teams.
Please register BEFORE OCTOBER 15, so we can send you the data!
On Oct. 15, we will notify the email contact with information about how to get the test data.
If you do not hear from us on Oct. 15, please contact the organizers.
Please use a contact email address capable of receiving zipped file attachments (.zip/.gz) of at least 500 KB, as this address will be our primary means of contacting participants.
By requesting the test data, you also agree to the guidelines for participation/submission

Submission Guidelines
Participants are requested to halt all system development after they obtain the test data.
Participants email their GM submissions to mailing list:
as a .txt attachment.
These are due Oct 15 (PPI subtask 1) or Oct 22 (all other tasks/subtasks).
By submitting results, the groups agree to have their submission made public in an anonymous form at the end of the evaluation (e.g. as was done with the BioCreAtIvE 1 Task 2 submissions).
By requesting the test data, you are committed to the submission of results for that task or sub-task. If, for some reason, after receiving the test data, you are unable to submit results for a given task or subtask, you should notify the organizers promptly, and provide an email explaining why you have been unable to submit; we also ask that you provide a commitment to delete your copy of the test data.

System Description
You have to submit a short system description questionnaire (1-2 pps) by Oct 31. The description should give an overview of the approach used - please follow the template below. If you wish, the description may be anonymous; the description will be linked by user ID to the results for the tasks, to be distributed at the workshop. This is due Oct 31 and must be submitted to receive scores.
Groups will receive their scores and the gold standard data (by mid Dec) at the contact email address they provided. We will provide each group with its scores only - the full set of results will be made available at the BioCreAtIvE workshop and in the associated Proceedings.
Groups are requested not to publish results of their system on the goldstandard data until after the workshop.

Submission File Naming
By naming your submission files in the same format, we can keep everything much more organized.
The format is TeamId_BC2_Task(_Subtask)_Run.txt.

For example, Team 60 submitting 3 runs (the max for any task/subtask) to the GM task:

System Description Template/Questionnaire
Please note that any information provided will be made publicly available, so if you wish to remain anonymous you do need to be specific with proprietary system components (e.g. simply note things like "proprietary gene lexicon"). However, the research community benefits by participants being as explicit as possible in these descriptions and complete disclosure is encouraged. If some information only pertains to a particular run, please note this.

1- Team identifier:.......
2- Which task does this describe (GN, GM or PPI):........
3- Please identify/describe any machine learning techniques used:..........
4- Please identify/describe any NLP techniques/components used:........
5- Please identify/describe any external (marked up text) training data used:.........
6- Please identify/describe any external lexical resources (terminology lists)used:........
7- Please describe any rule sets used:.........
8- If your system interacts with or uses data from any biological database(s), please describe:..........
9- Please identify/describe any other relevant resources used to train/develop your system:.........
10- Please describe the general data flow in your system:..........
11- Other information of interest:.........

GM Test Set Submission Format
We want to remind participants in the GM task that you are responsible for submitting result data in a valid format, as described in the file README.GM. In order to verify that your result data is valid, you should run your system on the training data and evaluate the output with the perl script alt_eval.perl (in the train subdirectory, described in the file train/README).

GM Test Set Sentence Identifiers
Additionally, systems should not make any assumptions about the contents or meaning of sentence identifiers in the test set. When you receive test data for the final run, sentence identifiers will be randomly assigned strings. We do not plan to release source information for the test sentences until after the evaluation is complete. (This statement is not meant to imply any other limits on resources or methods that may be used.)

Last update of this page: 12 October 2006


© by Martin Krallinger 2006