BioCreAtIvE - Critical Assessment for Information Extraction in Biology
Home - CNIO - MITRE - NCBI - Organization - News - Contact
































Protein-Protein Interaction Task: Questions and Answers

 

1 How are the articles chosen by the interaction databases? 
In general there are two main types of article selection strategies 
used in the case of MINT and IntAct. One is based on the exhaustive 
full curation of all the articles of a given collection of predefined 
peer-reviewed journals. The other is topic-based, for example according 
to pathways, protein types, or species. For this competition we consider 
the first type of article selection.   

2 How many protein mentions of interacting proteins can not be mapped
from the articles to a protein identifier by the database curators?
Practically all the proteins can be mapped to the database identifiers, 
although the difficulty or time required in the manual mapping might vary 
a lot. Only in less than 5 percent of the cases this is not possible. 
Those cases are not entered in the database. In some cases if the UniProt 
ID is not available in the given organism we infer the identifier from an 
other organism. There will be a comment reporting the "abuse" .

3 How do curators deal with organism source ambiguity of a given protein
mention?
They use all kind of information provided in the article to unambiguously 
identify the organism source of the proteins.  The curators sometimes had to 
use the cell lines described in the article to obtain a clue for the organism 
source disambiguation (e.g. through the CABRI database).

4 Are figures considered by the database curators to derive their annotations
 in the case of MINT and IntAct?

Yes. They are used, as they often provide experimental evidence information. 
Both figures and figure legends might be used for annotation purposes. In case 
of the BioCreative contest test set interaction which were only apparent from a 
table or figure were not used. 

5 Are tables considered by the database curators to derive their annotations  in the 
case of MINT and IntAct?
Yes. Actually for some large scale interaction experiments (and depending on the 
interaction detection method) often tables are used to extract annotations. 

6 What kind of article document is used by the curators to read and detect the 
interaction annotations? 
In case of regular database annotation, mainly html and pdf files, both in electronic 
and printed forms are used.

7 How are the protein interaction evidence sentences extracted?
After reading carefully all the article, including legends and additional materials, 
the curators mainly cut and paste the best evidence sentence for a given protein interaction. 

8 Is the extracted evidence sentence for a given protein interaction pair the 
overall best?
This depends of course on the curator interpretation, and there might be cases where 
several sentences are equally good evidence passages. For some interaction pairs several 
sentences expressing the protein interaction have been extracted.

9 Are there cases where in a given phrase or sentence,  evidence is provided for more 
then one protein interaction pair?
Yes, there are cases where a given text passage contains interaction evidence for several 
protein interaction pairs.

10 Is the additional material section considered for regular annotation?
Yes, the curators use everything provided for a given publication to extract confidently 
their annotations. They curators sometimes take into consideration the additional material 
section. In these cases this is flagged. 

11 Is it possible that in a given article multiple methods for detecting protein interaction 
are used?
Yes, this can certainly happen. Note that not all the proteins in a given article might be 
studied with all the mentioned protein interaction detection methods. For instance  proteins 
A, B, C and D could be studied with protein interaction detection method X, but only A and B 
are subsequently studied with protein interaction detection method Y.

12 Is the annotation of protein interactions in case of the these two databases 
organism dependent?
In principle not. These databases curate interactions for any organism and are not 
restricted to a single model organism or human proteins.

13 Are there cases where the protein interaction is between two proteins from 
different organisms (e.g. protein A from mouse and protein B from human) ?
Yes, although this is not very common, there are cases.

14 Is there a size limit of the evidence sentence for protein interactions?
Most of the evidence sentences extracted by the annotators have less than 250 characters.

15 Which character encoding will be used for mapping the predicted evidence sentences 
to the curated evidence sentences?
In principle we are expecting to use UNICODE character encoding. 

16 Are there cases of large scale protein interaction experiments on the test set articles?
No, most of the interactions in the test set article have less then 30 interactions.

17 Should I consider the very large scale experiment articles in the training set?
We recommend NOT to use them, as in case of the test set there are no large scale experiment 
papers, and you could get a bias in case of using them. As a cut-off for the total number of 
interactions per article (for the training set) we recommend of using those which have less 
then 21 interactions per article.

18 How should I deal with the mapping between splice variants and the master entry of UniProt 
(normalization step) ?
You should not worry about the splice variant case and the mapping to UniProt master entries. 
This is a not very common problem (less then approximately 5 percent of the cases) and in terms 
of the test set will be handled by the evaluation group.  

19 Did the two interaction databases, MINT and Intact perform an agreement of curators study?
Yes, they performed a comparative annotation study to assure that both databases where following 
the same curation standards and data model. This study was done on 5 full text articles related 
to yeast proteins. 

20 Are there cases where the article authors actually use wrongly the terms (incorrect 
terminology usage)?
Yes, but only in few cases. We call these wrong (confused) term usage 'jargon term usage by 
authors'.  We estimate that there are less than 2 percent of such cases. An example would be 
the use of 'pull down' instead of 'co-immunoprecipitation' for referring to an experiment. 
This happens sometimes due to wrong terminology usages encountered in sub-domains like virology. 
These experiments are mapped by the curators to the correct controlled vocabulary term based 
on the experiment description in the article and the citation reference of the method used in 
the articles.  In the test set there are no such cases. 

21 Could there be a term overlap (a same term used for different concepts within the 
controlled vocabulary hierarchy)?
There can be an overlap between the synonyms of some concepts of the controlled vocabulary 
(but this is very rare). 

22 Which is the used spelling of the controlled vocabulary terms (e.g. US spelling of UK 
spelling) ?
In case of Gene Ontology the US spelling is used. In MI we are not completely sure about this.

23 Do the curators sometimes take into account the references provided in an article for 
the interaction detection experiment?
Yes, there are cases where the reference of the experimental method used to detect the protein 
interaction is taken into account (back reference). Note that for concepts in MI an external 
reference (PMID) is provided, corresponding to the article describing this method.

24 Can I use also additional resources despite the provided training data?
Yes, sure. You can use any additional data resource available. You could nevertheless specify 
them in the system description paper of the evaluation workshop.

25 What is the level of expertise of the database curators of MINT and IntAct?
They have a P.h.D.  or at least a Master degree in Molecular Biology or related disciplines 
and are highly trained and experienced curators. 

26 How long does it take for a curator to annotate an article?
This varies a lot depending on the database, the journals and the articles used. On average 
it takes between 1 paper/curator/day to 4 papers/curator/day.

27 Which is the format used by MINT and IntAct for their annotation entries?
They use a standard called PSI-MI format. You should revise this standard format for 
protein interaction annotation. Refer to Hermjakob et al. (2006), PMID:14755292 and 
the latest version of the standard, described at: http://psidev.sourceforge.net/mi/rel25

28 Do the curators extract interactions between a protein and a protein family?
No, the extracted interactions are based on individual proteins which can be mapped to 
database entries. 

29 What are common naming ambiguities/difficulties encountered for the interaction partner
 proteins?
In addition to the difficulty in linking a protein name to the corresponding organism source 
other aspects which difficult the linking process are: the protein name and protein family 
name ambiguity, and that authors often refer to nucleic acid regions using the same name as 
for proteins.

30 What is the frequency of update of the data contained in the interaction databases?
The IntAct database is weekly updated. However, each entry is probably only updated twice per 
year, normally maintenance updates of the syntax rather than the content of the entry.

31 How do the curators deal with cases where the authors call the protein using homologous
 protein naming?
There are some cases, where the authors do not use the official or common name of a given protein, 
or the corresponding database entry is not complete enough and does not cover the protein name 
mentioned by the author. In these cases the curators sometime use the bioinformatics approach, 
based on protein sequence similarity searches to the homologue protein which does have the name 
the author uses in the article. Example: 
the author mention 'murine protein ZZZ' but no protein ZZZ is encountered for mouse in the 
protein database. Instead a human protein ZZZ is existing. Then using sequence similarity 
searches the curators retrieve a protein in mouse which shares significant similarity to the 
human ZZZ protein. Based on the sequence similarity as well as looking at the database record 
of this protein and the description of this protein in the article the expert curator is able 
to know if they are the same protein.

32 What kind of protein-protein interactions are curated in MINT and IntAct?
The interaction type is given as an attribute of the interaction. According the PSI-MI 2.5 MINT 
and Intact curate colocalisations and Physical interactions (and all their children). 
Generally, physical interactions with experimental evidence shown in the paper are curated.

33 Are symmetric or asymmetric relation considered in case of the protein interactions?
Both are considered, the experimental role of the proteins can be asymmetric. 

34 Are all the interaction types annotated? 
The interaction type is given as an attribute of the interaction. Generally, physical interactions 
with experimental evidence shown in the paper are curated. You should be careful with genetic 
interactions. In some cases the genetic interactions mentioned in articles are not curated 
because they are not trustworthy and the interaction is not direct (e.g. one protein actives 
another protein but through a signaling cascades with intermediate proteins in between). On a 
regular basis genetic interactions aren't curated. 

35 Will the test set collection follow the annotation standards used by IntAct /MINT databases?
Yes,  they will follow their annotation standards.

36 Can I use also additional resources other then provided by the BioCreative organizers to develop/construct my system? Participating teams are not restricted to use only the provided training sets for developing their systems in case of the Protein-Protein Interaction (PPI) task. So this is not a 'closed' task which is restricted to a particular training collection. Nevertheless we will ask participants which will submit results for the test set predictions to provide a short system description including the mention of additional resources they used in order to allow comparative evaluation and to see which approaches are successful.
Last update of this page: 20 September 2006


[
up][home]

© by Martin Krallinger 2006