1 How are the articles chosen by the interaction databases?
In general there are two main types of article selection strategies
used in the case of MINT and IntAct. One is based on the exhaustive
full curation of all the articles of a given collection of predefined
peer-reviewed journals. The other is topic-based, for example according
to pathways, protein types, or species. For this competition we consider
the first type of article selection.
2 How many protein mentions of interacting proteins can not be mapped
from the articles to a protein identifier by the database curators?
Practically all the proteins can be mapped to the database identifiers,
although the difficulty or time required in the manual mapping might vary
a lot. Only in less than 5 percent of the cases this is not possible.
Those cases are not entered in the database. In some cases if the UniProt
ID is not available in the given organism we infer the identifier from an
other organism. There will be a comment reporting the "abuse" .
3 How do curators deal with organism source ambiguity of a given protein
mention?
They use all kind of information provided in the article to unambiguously
identify the organism source of the proteins. The curators sometimes had to
use the cell lines described in the article to obtain a clue for the organism
source disambiguation (e.g. through the CABRI database).
4 Are figures considered by the database curators to derive their annotations
in the case of MINT and IntAct?
Yes. They are used, as they often provide experimental evidence information.
Both figures and figure legends might be used for annotation purposes. In case
of the BioCreative contest test set interaction which were only apparent from a
table or figure were not used.
5 Are tables considered by the database curators to derive their annotations in the
case of MINT and IntAct?
Yes. Actually for some large scale interaction experiments (and depending on the
interaction detection method) often tables are used to extract annotations.
6 What kind of article document is used by the curators to read and detect the
interaction annotations?
In case of regular database annotation, mainly html and pdf files, both in electronic
and printed forms are used.
7 How are the protein interaction evidence sentences extracted?
After reading carefully all the article, including legends and additional materials,
the curators mainly cut and paste the best evidence sentence for a given protein interaction.
8 Is the extracted evidence sentence for a given protein interaction pair the
overall best?
This depends of course on the curator interpretation, and there might be cases where
several sentences are equally good evidence passages. For some interaction pairs several
sentences expressing the protein interaction have been extracted.
9 Are there cases where in a given phrase or sentence, evidence is provided for more
then one protein interaction pair?
Yes, there are cases where a given text passage contains interaction evidence for several
protein interaction pairs.
10 Is the additional material section considered for regular annotation?
Yes, the curators use everything provided for a given publication to extract confidently
their annotations. They curators sometimes take into consideration the additional material
section. In these cases this is flagged.
11 Is it possible that in a given article multiple methods for detecting protein interaction
are used?
Yes, this can certainly happen. Note that not all the proteins in a given article might be
studied with all the mentioned protein interaction detection methods. For instance proteins
A, B, C and D could be studied with protein interaction detection method X, but only A and B
are subsequently studied with protein interaction detection method Y.
12 Is the annotation of protein interactions in case of the these two databases
organism dependent?
In principle not. These databases curate interactions for any organism and are not
restricted to a single model organism or human proteins.
13 Are there cases where the protein interaction is between two proteins from
different organisms (e.g. protein A from mouse and protein B from human) ?
Yes, although this is not very common, there are cases.
14 Is there a size limit of the evidence sentence for protein interactions?
Most of the evidence sentences extracted by the annotators have less than 250 characters.
15 Which character encoding will be used for mapping the predicted evidence sentences
to the curated evidence sentences?
In principle we are expecting to use UNICODE character encoding.
16 Are there cases of large scale protein interaction experiments on the test set articles?
No, most of the interactions in the test set article have less then 30 interactions.
17 Should I consider the very large scale experiment articles in the training set?
We recommend NOT to use them, as in case of the test set there are no large scale experiment
papers, and you could get a bias in case of using them. As a cut-off for the total number of
interactions per article (for the training set) we recommend of using those which have less
then 21 interactions per article.
18 How should I deal with the mapping between splice variants and the master entry of UniProt
(normalization step) ?
You should not worry about the splice variant case and the mapping to UniProt master entries.
This is a not very common problem (less then approximately 5 percent of the cases) and in terms
of the test set will be handled by the evaluation group.
19 Did the two interaction databases, MINT and Intact perform an agreement of curators study?
Yes, they performed a comparative annotation study to assure that both databases where following
the same curation standards and data model. This study was done on 5 full text articles related
to yeast proteins.
20 Are there cases where the article authors actually use wrongly the terms (incorrect
terminology usage)?
Yes, but only in few cases. We call these wrong (confused) term usage 'jargon term usage by
authors'. We estimate that there are less than 2 percent of such cases. An example would be
the use of 'pull down' instead of 'co-immunoprecipitation' for referring to an experiment.
This happens sometimes due to wrong terminology usages encountered in sub-domains like virology.
These experiments are mapped by the curators to the correct controlled vocabulary term based
on the experiment description in the article and the citation reference of the method used in
the articles. In the test set there are no such cases.
21 Could there be a term overlap (a same term used for different concepts within the
controlled vocabulary hierarchy)?
There can be an overlap between the synonyms of some concepts of the controlled vocabulary
(but this is very rare).
22 Which is the used spelling of the controlled vocabulary terms (e.g. US spelling of UK
spelling) ?
In case of Gene Ontology the US spelling is used. In MI we are not completely sure about this.
23 Do the curators sometimes take into account the references provided in an article for
the interaction detection experiment?
Yes, there are cases where the reference of the experimental method used to detect the protein
interaction is taken into account (back reference). Note that for concepts in MI an external
reference (PMID) is provided, corresponding to the article describing this method.
24 Can I use also additional resources despite the provided training data?
Yes, sure. You can use any additional data resource available. You could nevertheless specify
them in the system description paper of the evaluation workshop.
25 What is the level of expertise of the database curators of MINT and IntAct?
They have a P.h.D. or at least a Master degree in Molecular Biology or related disciplines
and are highly trained and experienced curators.
26 How long does it take for a curator to annotate an article?
This varies a lot depending on the database, the journals and the articles used. On average
it takes between 1 paper/curator/day to 4 papers/curator/day.
27 Which is the format used by MINT and IntAct for their annotation entries?
They use a standard called PSI-MI format. You should revise this standard format for
protein interaction annotation. Refer to Hermjakob et al. (2006), PMID:14755292 and
the latest version of the standard, described at: http://psidev.sourceforge.net/mi/rel25
28 Do the curators extract interactions between a protein and a protein family?
No, the extracted interactions are based on individual proteins which can be mapped to
database entries.
29 What are common naming ambiguities/difficulties encountered for the interaction partner
proteins?
In addition to the difficulty in linking a protein name to the corresponding organism source
other aspects which difficult the linking process are: the protein name and protein family
name ambiguity, and that authors often refer to nucleic acid regions using the same name as
for proteins.
30 What is the frequency of update of the data contained in the interaction databases?
The IntAct database is weekly updated. However, each entry is probably only updated twice per
year, normally maintenance updates of the syntax rather than the content of the entry.
31 How do the curators deal with cases where the authors call the protein using homologous
protein naming?
There are some cases, where the authors do not use the official or common name of a given protein,
or the corresponding database entry is not complete enough and does not cover the protein name
mentioned by the author. In these cases the curators sometime use the bioinformatics approach,
based on protein sequence similarity searches to the homologue protein which does have the name
the author uses in the article. Example:
the author mention 'murine protein ZZZ' but no protein ZZZ is encountered for mouse in the
protein database. Instead a human protein ZZZ is existing. Then using sequence similarity
searches the curators retrieve a protein in mouse which shares significant similarity to the
human ZZZ protein. Based on the sequence similarity as well as looking at the database record
of this protein and the description of this protein in the article the expert curator is able
to know if they are the same protein.
32 What kind of protein-protein interactions are curated in MINT and IntAct?
The interaction type is given as an attribute of the interaction. According the PSI-MI 2.5 MINT
and Intact curate colocalisations and Physical interactions (and all their children).
Generally, physical interactions with experimental evidence shown in the paper are curated.
33 Are symmetric or asymmetric relation considered in case of the protein interactions?
Both are considered, the experimental role of the proteins can be asymmetric.
34 Are all the interaction types annotated?
The interaction type is given as an attribute of the interaction. Generally, physical interactions
with experimental evidence shown in the paper are curated. You should be careful with genetic
interactions. In some cases the genetic interactions mentioned in articles are not curated
because they are not trustworthy and the interaction is not direct (e.g. one protein actives
another protein but through a signaling cascades with intermediate proteins in between). On a
regular basis genetic interactions aren't curated.
35 Will the test set collection follow the annotation standards used by IntAct /MINT databases?
Yes, they will follow their annotation standards.
36 Can I use also additional resources other then provided by the BioCreative organizers to
develop/construct my system?
Participating teams are not restricted to use only the provided training sets for developing their systems
in case of the Protein-Protein Interaction (PPI) task. So this is not a 'closed' task which is restricted
to a particular training collection. Nevertheless we will ask participants which will submit results for
the test set predictions to provide a short system description including the mention of additional resources
they used in order to allow comparative evaluation and to see which approaches are successful.
Last update of this page: 20 September 2006
[up][home]