Protein Interaction Article Sub-task 1 (IAS)

A description of this sub-task is available in pdf format.

Premise

System Input

System output

Evaluation

Tentative release dates

Training data

Test data

Data Selection

Data set format

Prediction submission format

Number of runs

Training data release


1 Premise  
In practice, before deciding to curate an article for protein interactions, 
or detecting protein interaction descriptions in sentences, it is necessary 
to identify those articles which actually contain relevant information 
relative to protein interaction annotation. Although this aspect is relevant 
for subsequent steps, it has often been neglected by previously published 
protein-interaction extraction systems. Thus this sub-task will be concerned 
with the classification of whether a given article contains protein 
interaction information. (You can check out the PPI-IAS relevant Q and A page).

 


2 System Input 
The participants will be given a collection of PubMed article abstracts 
which have been analyzed whether they are useful to derive protein 
interaction annotations according to the 
curation standards used by the IntAct and MINT databases. 
Although the curation was done using the corresponding full text articles, 
we will only be able to provide PubMed abstracts for this sub-task. 
This reflects more the actual textual data which is freely available, as 
in practice there are still serious limitations to obtain large collections 
of updated full text articles (for many curated biomedical journals). 
It is also a way to explore the limits of abstract-based detection of 
annotation relevant articles.

 

3 System output   

For a given collection of articles (abstracts), participants will need to 
return a ranked list of articles (identifiers) based on their relevance 
for protein interaction annotation.

 

4 Evaluation

Although in principle the possibility exists to consider additional 
evaluation metrics (e.g. mean average precision, utility measure, 
confidence weighted score, etc.), we are thinking of evaluating the 
participating systems using the AROC (area under the receiver operating 
characteristic curve) measure based on the ranked predicted collections.


5 Tentative release dates

To avoid the possibility that participants will exploit the overlap between 
test set articles used in subtask 1 and the subtasks 2-4 for detecting the 
protein interaction annotation relevant articles, we will have two test set 
release and result submission dates for the protein interaction task.

First the test set for the interaction article detection subtask will be 
released. After receiving  the results of this sub-task, the test set for 
the other subtasks will be released.

Tentative time schedule:

Training set PPI subtasks 1-4:	  June 2006
Test set PPI IAS subtask :         CHANGED TO October 8, 2006 
PPI IAS prediction due :             October 13, 2006 
Test set PPI subtasks 2-4:         October 15, 2006



6 Training data 
The training data was derived from the content of the IntAct and MINT 
databases. The data files of both databases are freely accessible for 
download and are compliant with the HUPO PSI Molecular Interaction Format. 
(We recommend that you should have a detailed look at this format, 
especially for the other subtasks).

The training collections for this subtask (referred as PPI-IAS) are of three 
basic types:

1) True Positives (TP): collection of PubMed article abstracts which are 
relevant for protein interaction curation in the sense of the annotation 
process and guidelines used by the MINT and IntAct databases (refer to 
the annotation manuals of MINT and IntAct). This means that the articles 
corresponding to these abstracts must contain information which meets the 
curation standards used to extract protein interaction information. Both 
databases have basically the same annotation standards and formats. Thus 
they agree on the curation model, which has been assured using a curator 
agreement study on 5 full text articles. (Note that interaction types 
corresponding to genetic interactions are currently not curated by MINT 
and IntAct!)
2) True Negatives (TN): consists in articles which have been classified 
by domain expert curators from these two databases as not relevant for 
protein interaction curation after extensive full text analysis.
3) Likely True Positives (TP*):  consists of a collection of PubMed 
identifiers of articles  which have been used for protein interaction 
annotation by other interaction databases (namely
BIND, HPRD, MPACT and GRID). Note that this additional collection is a 
NOISY data set and thus not part of the ordinary TP collection, as these 
databases might have different annotation standards compared to MINT and 
IntAct (e.g. regarding the curation of genetic interactions). Thus be 
careful when using this additional collection!

Note that in principle there are no restrictions to use additional data 
collections for purpose of system training. We recommend not to use for 
training purpose articles with more than 20 annotated interactions, as in 
the test set no large scale interaction experiment articles will be used.



7 Test data 
The test set collection will consist of a collection of PubMed article 
abstracts in a format compliant with the training collection format. 
We expect the test collection will have over 600 articles in total 
(including TP and TN) cases. The participating systems will have to 
return, given this test collection, two non-overlapping collections of 
ranked lists, one for the articles predicted as being relevant for 
curation, and one of the articles predicted to be non-relevant (see system 
output section). The predictions will then be compared to previously 
analyzed manual classification (according to whether they are relevant 
for protein interaction curation), done by  MINT and IntAct database 
curators. The actual classification is hold back for the test set.


8 Data Selection

Given the exhaustive journal curation strategy used by MINT and IntAct, 
there should be no bias of initial article selection. Note that these 
databases are not organism specific, so they curate proteins from a number 
of  model organisms.



9 Data set format
The training data for this subtask will be provided in an XML-like format, 
which is easy to parse. 

For a sample entry, please see the example below:

<ENTRY>
<CURATION_RELEVANCE>
1
</CURATION_RELEVANCE>

<PPI_DATABASE>
MINT
</PPI_DATABASE>

<PMID>
10022833
</PMID>
<TITLE>
Socs1 binds to multiple signalling proteins and suppresses steel
factor-dependent proliferation.
</TITLE>
<SOURCE>

EMBO J. 1999 Feb 15;18(4):904-15.
</SOURCE>
<ABSTRACT>

We have identified Socs1 as a downstream component of the Kit receptor tyrosine kinase signalling pathway. We show that 
the expression of Socs1 mRNA is rapidly increased in primary bone marrow-derived mast cells following exposure to Steel 
factor, and Socs1 inducibly binds to the Kit receptor tyrosine kinase via its Src homology 2 (SH2) domain. Previous 
studies have shown that Socs1 suppresses cytokine-mediated differentiation in M1 cells inhibiting Janus family kinases. 
In contrast, constitutive expression of Socs1 suppresses the mitogenic potential of Kit while maintaining Steel 
factor-dependent cell survival signals. Unlike Janus kinases, Socs1 does not inhibit the catalytic activity of the 
Kit tyrosine kinase. In order to define the mechanism by which Socs1-mediated suppression of Kit-dependent mitogenesis 
occurs, we demonstrate that Socs1 binds to the signalling proteins Grb-2 and the Rho-family guanine nucleotide exchange 
factors Vav. We show that Grb2 binds Socs1 via its SH3 domains to putative diproline determinants located in the 
N-terminus of Socs1, and Socs1 binds to the N-terminal regulatory region of Vav. These data suggest that Socs1 is an 
inducible switch which modulates proliferative signals in favour of cell survival signals and functions as an adaptor 
protein in receptor tyrosine kinase signalling pathways.
</ABSTRACT>
</ENTRY>

The curation relevance (CURATION_RELEVANCE  tag) corresponds to whether the 
article is useful for protein interaction curation (1) or is not useful for 
protein interaction curation (0).

In case of the test set, we will provide the data basically in the same format 
as the training data, but the following differences in two fields: 


<CURATION_RELEVANCE>

NONE
</CURATION_RELEVANCE>
<PPI_DATABASE>

NONE

</PPI_DATABASE>

Where NONE corresponds to the fields which are held back for the  test set.


10 Prediction submission format

Regarding the output format, it should consist of tab separated columns with the following information:

<team_id>	<run_id>	<sub_task_id>	<type>	<rank>	<pmid>

where 
team_id:		corresponds to the assigned team identifier (provided to each team), e.g.  T1_BC2_PPI

run_id:	     		  corresponds to the run id (max. of three runs per team), e.g. 1
sub_task_id:  	      the identifier of this subtask, i.e. 'BC2_PPI_IAS'
type: 			  prediction of relevance for protein-protein interaction: 'T' or 'F'
rank: 			  corresponds to the rank of the prediction, must start with 1, up to the total number of 
			     articles in the returned subset.
pmid:	 		  the PubMed identifier of the prediction.

Be sure that your prediction is compliant with this simple output format.




11 Number of runs

For this sub-task, each participating team can submit up to three runs .




12 IAS test set prediction check list

For the test set prediction check list refer to the IAS checklist
page.



13 Training data release
People who intend to participate at the protein-protein interaction (PPI) task of the second BioCreAtIvE challenge should send the following information:
1) Team contact e-mail (one per team).
2) Tentative list of participant team members (name and e-mail).
3) Institutions.

to: mkrallinger@cnio.es

Last time schedule update: 06 October 2006.

[up][home]