BioCreAtIvE - Critical Assessment for Information Extraction in Biology

Protein-Protein Interaction IPS Task: Questions and Answers

1. Is it correct that ONLY the UniProt IDs of the 2 proteins will be taken
into consideration for scoring?
Yes, that is correct (for the baseline evaluation).

2. Does the directionality of the interaction matter? In practice it does
matter, of course, but I am asking only whether "A / interaction / B" will
be considered the same as "B / interaction / A" for scoring purposes.
In many cases, the directionality is not easy to determine.
You are right, in practice the directionality might be biologically important,
but in case of the evaluation we will not take into account this aspect
for scoring purposes.

3. Will the scoring be done manually or automatically? In the latter case,
would it be possible to have the scoring scripts?
For the actual mapping several basic strategies could in principle be used.
The most straightforward way would be mapping based on the correspondence of
the predicted UniProt IDs compared to the predicted UniProt IDs. We will
provide a scoring script for this basic evaluation type using the matrix model.
(We will also carry out additional analysis of the results)

4. Are the participating systems expected to report
an interaction more than once?
NO, participating systems have to report a given interaction ONLY ONCE, and
will be evaluated also using a non-redundant set of interactions for a given article.

5. I have a question about the BioCreAtIvE PPI IPS submission format. Is it
correct to assume that a new ENTRY tag is required for each article
processed?
No, the entry tag delimits each single predicted protein pair. I have actually prepared a
baseline evaluation script together with dummy submissions in the correct format.

I will send the script together with the readme and data file . Note that we
prefer that the protein ID to be the UniProt ID rather than the accession numbers.
I attached a sample prediction run in the correct format (together with the corresponding
baseline evaluation file).

6. The input to the scoring script takes just one file as input, all
interactions for all pubmed-ids included in the one file. Is that the expected format? .
Yes, this is the expected format.

7. What would the impact be of not submitting the interactions in a ranked way?
Note that for the baseline we use the scoring as provided in the script, but
we will also do further result analysis, where the rank is taken into account,
where we look at the biological meaning of the pairs predicted compared to the
annotated ones, etc..

8. Although in general a text mining system might be expected to
be able to cope with different formats, for the scope of the evaluation
it would be sensible (in my opinion) to limit the variability of the
possible source formats. Otherwise you might end up evaluating
the quality of HTML to text conversion.
You are right, in practice text mining systems should be robust enough
to cope with many different journals, so they can be used in real scenarios.
Nevertheless from experiences of peoples which are doing full text mining using
HTML articles , the different journals/formats/publishers are indeed a critical issue.
In the test set, less variability regarding formats will be expected, and thus a smaller
bias related to format parsing should be encountered as well.

9. Our Biology RAs are interested in whether the organisms are all the same
(e.g., yeast) or mixed, also whether the methods used are very diverse.
This is actually a very important question, and I was wondering why I did not get it before.
For the detection of the correct UniProt ID of the interaction partners, the organism source is
a crucial aspect (due to inter-organism gene name ambiguity). There will be no specific selection
of organism for the interaction proteins, this means that in principle the interaction partners
can come from any organism (although in practice certain organism sources might be more often
referred to in papers just because scientist are more interested in them, e.g. model organisms).
This makes the task certainly more difficult and challenging, but also more realistic and more
useful. We actually want that the contest promotes the construction of really useful systems,
which are not restricted to artificial scenarios

10. In which format will we get the full text articles?
In the same formats as the training set articles, i.e. HTML, PDF and automatically
converted full text articles to plain text from HTML (HTML2TEXT) and from PDF (PDF2TEXT).

11. Will there be different / additional journals in the test set which have not been in
the training collection?
No, so if your system is able to process the training set journal format, it should work the
same on the test set articles.

12. Should each run be submitted in a separate file?
Yes, each run should be submitted as a separate file.

13. Is there a naming convention of the submitted result predictions?
Yes, The submission files should follow the naming convention:
1) Team number, BC2" (for BioCreative 2),
2) PPI (Task identifier, Protein-Protein Interaction) ,
3) Subtask identifier (e.g. IPS for Interaction Pair Sub-task) and
4) Run number (e.g. 1, 2 or 3)

The sample prediction files of the three runs of team 5 would be:

T05_BC2_PPI_IPS_1.txt
T05_BC2_PPI_IPS_2.txt
T05_BC2_PPI_IPS_3.txt

The naming convention for each sub-task result submission will be announced again
together with the test set release.

Last update of this page: 18 September 2006

[up][home]