Protein Interaction Sentences Sub-task 3 (ISS)

A description of this sub-task is available in pdf format.

Premise

System Input

System output

Evaluation

Tentative release dates

Training data

Test data

Data Selection

Data set format

Prediction submission format

Number of runs

Training data release


1 Premise  
In practice, protein-protein interaction information for a given pair of 
proteins might be mentioned several times throughout a full text article. 
To produce a protein interaction summary, for instance, it is useful to 
select the most relevant sentence expressing interaction information for 
a given protein pair. Also for human interpretation, natural language text 
passages describing a given interaction are useful. Therefore in this sub-task,
the interaction sentence sub-task (ISS), we ask participants to provide, for 
each protein interaction pair, a ranked list of (maximum 5) evidence passages 
describing their interaction. Please refer to the ISS relevant question
page for a set of questions and answers related to this task.

 


2 System Input 
Collection of full text articles which contain protein interaction information 
curated by IntAct and MINT.

 

3 System output   
For each protein interaction pair, a ranked list of maximum 5 text passages 
(containing at most 3 sentences per passage)  describing their interactions 
has to be returned.

 

4 Evaluation

For the evaluation, pooling methods will be used, as follows: all the interaction 
evidence passages (sentences) from all the  systems for each document are collected.
Then a curator will categorize these into relevant and irrelevant sentences. This 
eliminates duplicates. This way the "best sentences" don't have to be exhaustively 
pre-selected; it also means that there will be limited training data available. 
The predictions will be evaluated in terms of:

a) Percentage of interaction relevant sentences with respect to the 
     total number of predicted (submitted) sentences.

b) Mean reciprocal rank (MRR) of the ranked list of interaction evidence passages 
     with respect to the manually chosen best interaction sentence.


5 Tentative release dates
The test set of this subtask will be released after the due date of the result submission of PPI subtask 1 
(detection of protein interaction curation relevant articles). 

Training set PPI subtasks 1-4: 	      July and September 2006
Test set PPI subtask ISS:               October 15, 2006 
Test set prediction due for ISS:      October 22, 2006


6 Training data 

Both MINT and IntAct are producing a collection of manually extracted sentences 
derived from full text articles which describe their interactions (contain 
evidence of the interaction). These sentences were mainly extracted using cut 
and paste of the phrase in the text which indicates that this interaction occurs. 
In principle the text should not be altered in any way and only obvious phrases 
should be provided, meaning that if the interaction was only apparent from a 
table of figure this topic should not be entered, but figure and table legends 
may be used. Both phrases as well as full sentences and sentence passages can be 
included. There are cases were several alternative evidences sentences for a 
given interaction are provided.

Examples of extracted interaction evidence phrases:

(1) Human Mitochondrial DNA Polymerase {gamma} Forms a Heterotrimer
(2) Using a biochemical approach to search for such co-regulatory factors, we 
    identified hGCN5, TRRAP, and hMSH2/6 as BRCA1-interacting proteins. 
(3) SRp30c specifically binds to both ESE3 and ESE4

ADDITIONAL RESOURCES:
Also a collection of additional resources will be provided consisting in 
interaction related sentences from: 
(1) the Anne Lise Veuthey corpus, 
(2) the Christine Brun corpus, 
(3) the Prodisen interaction corpus and the 
(4) GeneRif interaction sentences. 
Note that these interaction evidence sentences do not necessarily follow the 
annotation criteria of MINT and IntAct, so they can contain also genetic 
interactions which are not exhaustively annotated by these two databases.


7 Test data 
The interaction databases MINT and IntAct are holding back a set of curated 
records to produce the test set for the BioCreAtIvE contest. The test data set 
will consist of articles belonging to this collection of previously manually 
curated records. For those articles the database curators have extracted 
manually the evidence passages. These passages consist of phrases, sentences 
and sentence passages which indicate the actual protein interaction. In 
principle, this evidence passage should be the most obvious text segment 
indicating the interaction according to the database curator.
The ranked list of predicted evidence text passages returned by the participants
are also compared to the manually curated ones by the expert annotators.
A total of 300 publications are expected to be part of this test set collection.
Note that the test set of this task will be released after the due date for 
subtask 1 (detection of protein interaction curation relevant articles).



8 Data Selection
Note that the text passages referring to the protein interactions are not 
restricted to abstracts but in practice can be derived from any part of the 
full text article, including figure and table legends.


9 Submission format
Each run of predictions has to be provided as a single file with xml-like 
format, containing all the submitted interaction evidence sentences for the 
interaction pairs extracted from an article.

A sample prediction entry for the correct submission format is shown below: 


<ENTRY>
<PPI_SUB_TASK_ID> BC2_PPI_ISS </PPI_SUB_TASK_ID>
<TEAM_ID> T1_BC2_PPI </TEAM_ID>
<RUN_NR> 1 </RUN_NR>
<PMID> 10924507 </PMID>
<INTERACTION_PAIR>
<INTERACTOR_1> DHX9_HUMAN </INTERACTOR_1>
<INTERACTOR_2> NXF1_HUMAN </INTERACTOR_2>
</INTERACTION_PAIR>
<SENTENCE_RANK> 1 </SENTENCE_RANK>
<SENTENCE_PASSAGE>
Specific Interaction between RNA Helicase A and Tap, Two Cellular Proteins That Bind to the Constitutive Transport Element of Type D Retrovirus
</SENTENCE_PASSAGE>
</ENTRY>

Where:
1) ENTRY: 			corresponds to a single evidence passage prediction 
2) PPI_SUB_TASK_ID: 	the identifier of the interaction sentence sub-task, i.e.   BC2_PPI_ISS
3) TEAM_ID: 		       the identifier of the team (as provided to each participating team)
4) RUN_NR: 		       the number of the submission run (maximum of three runs)
5) PMID: 		 	 corresponds to the PubMed identifier of the article, e.g.  10924507
6) INTERACTOR_1 : 	  corresponds to the UniProt ID (or accession number) of the interactor protein 1,
e.g.  DHX9_HUMAN (ATP-dependent RNA helicase A,DHX9)
7) INTERACTOR_2: 	  corresponds to the UniProt ID (or accession number) of the interactor protein 2,
e.g. NXF1_HUMAN (Tip-associating protein, TAP)
8) SENTENCE_RANK: 	 corresponds to the sentence rank (1 to 5)
9) SENTENCE_PASSAGE: corresponds to the actual interaction evidence text passage (maximum 3 sentences).


NOTE 1: the predicted interaction sentences must come from the test set HTML 
full text articles, meaning that you should assure that the predicted text segments 
can be directly matched to the HTML documents! We will not assure evaluation 
of predictions which can not be matched directly to the full text HTML articles.
NOTE 2: the XML-like tags should be case sensitive, meanining in capital letters as shown in the example.
NOTE 3: be sure to use your own team ID, which was sent to each of the contact team e-mails
after registration.


Be sure that your prediction is compliant with this simple output format.



10 Number of runs
For this sub-task, each participating team can submit up to three runs .



11 Useful Links
1) IntAct
2) MINT
3) MI ontology
4) MI ontology browser
5) PSI-MI 2.5 format
6) UniProt
7) UniProt download
8) HTML text extraction tools include: HTML2Text, Tidy, Simpy, HTMLParser, NekoHTML, CyberNeko,
etc..



12 Training data release
People who intend to participate at the protein-protein interaction (PPI) task of the 
second BioCreAtIvE challenge should send the following information:

1) Team contact e-mail (one per team).
2) Tentative list of participant team members (name and e-mail).
3) Institutions.

to: mkrallinger@cnio.es

Last update of this page: 21 September 2006

[up][home]