Protein-Protein Interaction Task
This task is organized as a collaboration between the IntAct and
MINT protein
interaction databases and the CNIO Structural Bioinformatics and Biocomputing
group.
A) Background
The study of protein interactions is one of the most pressing biological
problems. Characterizing protein interaction partners is crucial to
understanding not only the functional role of individual proteins but also
the organization of entire biological processes.
The development of high throughput experimental technologies, such as yeast
two-hybrid screening [1] or affinity purification coupled with mass
spectroscopy [2] is now making it possible to study protein interactions on
a much larger scale by means of bioinformatics approaches [3-5]. One limitation
of these large-scale experiments is their accuracy. Protein interactions
databases have been developed [6-7] to integrate protein
interaction information from these disparate sources, e.g., high throughput
methods as well as carefully experimentally characterized individual protein
interactions.
Databases such as IntAct and MINT provide interaction
information in the form of well structured database records in standard
formats, constituting a useful resource for both biologists as well as
bioinformaticians.
Because the molecular biology literature provides detailed descriptions of
protein interaction experiments specifying the individual interaction partners,
as well as the corresponding interaction types, it has been exploited as
a resource to derive protein interaction records for interaction
databases. Due to the rapid growth of the biomedical literature
and the increasing number of newly discovered proteins, it is becoming
difficult for the interaction database curators to keep up with the literature
by manually detecting and curating protein interaction information.
This is motivating the implementation of information extraction and text
mining techniques to automatically extract protein interaction
information from free texts. A number of approaches have
been published, (see [7-33] for some of the strategies) and an
initial challenge evaluation has been carried out [34]. Nevertheless a large scale
evaluation of different methods applied to existing protein
interaction databases is still missing. To produce high quality training and
test data collections, as well as to set up community-wide experiments
which can result in relevant and useful systems, the
collaboration with experts in protein interaction databases is crucial.
B) Introduction to the protein-protein interaction (PPI) extraction task
One of the main limitations for the development and evaluation of
protein-protein interaction extraction methods from text is the lack of
Gold Standard training data sets. This makes it cumbersome to compare
existing automated extraction methods, as most results are reported using author-specific evaluation data sets; furthermore,
some systems have only been evaluated using article abstracts.
In practice biologists who search for protein interactions are not
limited to abstracts, but consider full text articles to derive protein
interaction information. Also the type of protein interaction and the
experimental method used to determine whether two proteins interact is
important information preserved in expert curated databases.
For BioCreAtIvE II, the protein-protein interaction task focuses on the prediction of
protein interactions from full text articles. The second BioCreAtIvE
challenge is gathering expert database curators with experience in protein
interaction annotation together with experts in evaluating information
extraction systems adapted to the biology domain.
Among the main goals posed in this task are:
- (1) To determine the state of the art in extraction of protein-protein
interaction;
- (2) To produce useful resources for training and testing protein interaction
extraction systems;
- (3) To learn which approaches are successful and practical;
- (4) To monitor interesting new approaches;
- (5) To provide the biology community with useful tools to extract
protein-protein interactions from texts
This second BioCreAtIvE challenge provides the opportunity for
participating systems to take advantage of the underlying collaboration
with domain experts, addressing a practical task. The
training and test data sets are characterized by in-depth annotations of
protein interactions of full text articles. These annotations include
all the manually registered mentions of the interacting proteins from
the full texts, as provided by the database curators.
C) Protein-protein interaction task description
Reflecting the process of database curator annotation extraction, several
sub-tasks are posed. Each participant is free to take part at any (or
all) of the proposed sub-tasks. (You can check out the PPI relevant Q & A page).
Protein Interaction Article Sub-task 1 (IAS)
In practice, before detecting protein interaction descriptions in sentences,
it is necessary to select those articles which contain relevant information
relative to protein interactions. Although this aspect is critical for
subsequent steps, it has often been neglected by previously published
protein-interaction extraction systems. Thus this sub-task will be concerned
with the classification of whether a given article contains protein interaction
information.
For a detailed description of this subtask, refer to the Protein Interaction Article Sub-task 1 (IAS)
page.
Participants will need to return a ranked list of articles (identifiers) based
on their relevance for protein interaction annotation. To evaluate the
participating systems, the AROC (area under the receiver operating
characteristic curve) measure based on the ranked predicted collections.
(We had in the beginning also considered using additional evaluation metrics,
e.g. utility measure[35]).
The training collection will contain:
- a) TP: (True Positives) collection of PubMed article abstracts which are
relevant for protein interaction curation.
- b) TN: (True Negatives) consists in articles which have been classified
by domain expert curators from these two databases as not relevant for
protein interaction curation.
- c)*TP: (likely True Positives) consists of a collection of PubMed
identifiers of articles which have been used for protein interaction
annotation by other interaction databases (namely BIND, HPRD, MPACT and GRID).
Protein Interaction Pairs Sub-task 2 (IPS)
This sub-task is related to the identification of protein-protein interaction pairs from
full text articles. As training data the participants will get a collection of articles
with the associated interaction pairs extracted from these articles, as well as the
corresponding gene mention symbols. In case of the test set predictions, participants
have to provide, for each article, a ranked list of
protein-protein interaction pairs. The evaluation will be in terms of
precision and recall of the predicted protein interaction pairs for each
article.
For a detailed description of this subtask, refer to the Protein
Interaction Pairs Sub-task 2 (IPS)
page.
Protein Interaction Sentences Sub-task 3 (ISS)
In practice, protein-protein interaction information for a given pair of
proteins might be mentioned several times throughout a full text article.
To produce a protein interaction summary, for instance, it is useful to
select the most relevant sentence expressing interaction information for
a given pair. Therefore one of the sub-tasks will ask participants to
provide, for each protein interaction pair, a ranked list of maximal 5 text passages
(containing at most 3 sentences per passage) describing their interaction.
For the evaluation, pooling methods will be used, as follows: all the
sentences from all the systems for each document are collected. We will evaluate
according to two aspects: a) the Percentage of interaction relevant sentences with
respect to the total number of predicted (submitted) sentences and b) the Mean
reciprocal rank (MRR) of the ranked list of interaction evidence passages with
respect to the manually chosen best interaction sentence. Point b) is the most important
evaluation criteria.
For a detailed description of this subtask, refer to the
Protein Interaction Sentences Sub-task 3 (ISS)
page.
Protein Interaction Method Sub-task 4 (IMS)
For annotation purposes, as well as to judge the quality of protein
interactions, it is important to know how protein interactions have been
determined experimentally. In case of protein-protein interaction
annotation, considerable effort has been made to develop a controlled
vocabulary about interaction methods. This sub-task refers to the
identification of the type of experiment which was used to confirm a given
protein-protein interaction. The experimental method description has to
be mapped into a previously provided controlled hierarchical vocabulary
of experimental methods [36]. In this case the evaluation will be
measured by the mean reciprocal rank of correctly identified interaction methods (correct MI identifiers) for each protein-protein interaction
pair compared to the previously manually annotated interaction detection methods.
This hierarchical controlled vocabulary
is available at MI.
For a detailed description of this subtask, refer to the Protein Interaction Method Sub-task 4 (IMS)
page.
D) General Data set considerations
For the training and test data sets, the annotation strategy followed by the IntAct and MINT
databases has been considered. Note that proteins are uniquely
identified by UniProt ID. Although IntAct and MINT
annotation is done down to isoform level, the BioCreAtIvE
competition mapping is done to UniProt "master" entries. An UniProt 'light' version will be
distributed to the participants, note that only entries contained in this release will be considered
for evaluation, to avoid the problem of obsolete identifiers.
E) Additional resources
Additional useful data collections, such as protein interaction sentences
derived from PubMed abstracts and links to other interaction-relevant
resources will be provided as well.
References
[1] Uetz, P., et al. (2000) A comprehensive analysis of protein-protein
interactions in Saccharomyces cerevisiae. Nature, 403, 623-627
[2] Gavin, A.C., et al. (2002) Functional organization of the yeast
proteome by systematic analysis of protein complexes. Nature, 415,
141-147
[3] Valencia, A. & Pazos, F. (2003) Prediction of protein-protein
interactions from evolutionary information. Methods Biochem Anal., 44,
411-426
[4] Enright AJ. Et al. (1999) Protein interaction maps for complete
genomes based on gene fusion events. Nature, 402, 86-90
[5] Jansen R., et al (2003) A Bayesian networks approach for predicting
protein-protein interactions from genomic data. Science, 302, 449-453
[6] Hermjakob, H., et al. (2004) IntAct: an open source molecular
interaction database. Nucleic Acids Res., 32, D452-D455
[7] Zanzoni A, et al. (2002) MINT: a Molecular INTeraction database.
FEBS Lett., 513, 135-140
[8] Blaschke, C. and Valencia, A. (2001) The potential use of SUISEKI
as a protein interaction discovery tool. Genome Inform Ser Workshop Genome
Inform., 12, 123-134
[9] Marcotte,E.M., et al (2001) Mining literature for protein-protein
interactions. Bioinformatics, 17, 259--363
[10] Proux,D., et al (2000) A pragmatic information extraction strategy
for gathering data on genetic interactions. Proc Int Conf Intell Syst
Mol Biol, 8, 279-285
[11] Ono,T., et al (2001) Automated extraction of information on
protein-protein interactions from the biological literature.
Bioinformatics, 17, 155-161
[12] Rindflesch,T.C., et al (1999) Mining molecular binding terminology
from biomedical text. Proc AMIA Symp., 127-131
[13] Hatzivassiloglou,V. and Weng,W. (2002) Learning anchor verbs for
biological interaction patterns from published text articles. Int J Med
Inf., 67, 19-32
[14] Hoffmann,R. and Valencia,A. (2003) Protein interaction: same
network, different hubs. Trends Genet., 19, 681-683
[15] Donaldson,I., et al (2003) PreBIND and Textomy--mining the
biomedical literature for protein-protein interactions using a support
vector machine. BMC Bioinformatics, 4, 11
[16] Sekimizu,T., et al (1998) Identifying the Interaction between
Genes and Gene Products Based on Frequently Seen Verbs in Medline Abstracts.
Genome Inform Ser Workshop Genome Inform., 9, 62-71
[17] Daraselia,N., et al (2004) Extracting human protein interactions
from MEDLINE using a full-sentence parser. Bioinformatics, 20, 604-611
[18] Rzhetsky,A., et al (2004) GeneWays: a system for extracting,
analyzing, visualizing, and integrating molecular pathway data. J
Biomed Inform., 37, 43-53
[19] Hu,Z.Z., et al (2004) iProLINK: an integrated protein resource
for literature mining. Comput Biol Chem., 25, 409-416
[20] Koike,A. and Takagi,T. (2005) PRIME: automatically extracted
PRotein Interactions and Molecular Information databasE. In Silico
Biol., 5, 9-20
[21] Domedel-Puig,N. and Wernisch,L. (2005) Applying GIFT, a Gene
Interactions Finder in Text, to fly literature. Bioinformatics, 21,
3582-3583
[22] Blaschke,C. and Valencia,A. (2002) The frame-based module of the
Suiseki information extraction system. IEEE Intelligent Systems., 17,
14-20
[23] Katrenko,S., et al (2005) Learning Biological Interactions from
Medline Abstracts. Proc of ICML05 workshop
[24] Hao Y, et al (2005) Discovering patterns to extract
protein-protein interactions from the literature: Part II. Bioinformatics, 21,
3294-3300
[25] Ahmed,S.T., et al (2005) IntEx: A Syntactic Role Driven
Protein-Protein Interaction extractor for Bio-Medical Text. Proc
workshop ACL-05/ISMB-05, 54-61
[26] Huang,M., et al (2004) Discovering patterns to extract
protein-protein interactions from full texts. Bioinformatics, 20,
3604-3612
[27] Koike,A., et al (2003) Kinase pathway database: an integrated
protein-kinase and NLP-based protein-interaction resource. Genome Res.,
13, 1231-1243
[28] Humphreys,K., et al (2000) Two applications of information
extraction to biological science journal articles: enzyme interactions
and protein structures. Pac Symp Biocomput, 505-516
[29] Blaschke,C., et al (1999) Automatic extraction of biological
information from scientific text: protein-protein interactions. Proc
Int Conf Intell Syst Mol Biol., 60-67
[30] Rindflesch,T.C., et al (2000) EDGAR: extraction of drugs, genes
and relations from the biomedical literature. Pac Symp Biocomput, 517-528
[31] Blaschke,C and Valencia,A. (2001) Can bibliographic pointers for
known biological data be found automatically? Protein interactions as a
case study. Comp. Funct. Genom., 2, 196-206
[32] Sugiyama,K., et al (2003) Extracting Information on
Protein-Protein Interactions from Biological Literature Based on
Machine Learning Approaches. Genome Informatics, 14, 699-700
[33] Friedman,C., et al (2001) GENIES: a natural-language processing
system for the extraction of molecular pathways from journal articles.
Bioinformatics, 17, S74-S82
[34] Nedellec,C. (2005) Learning Language in Logic - Genic Interaction
Extraction Challenge. Proc LLL05 workshop
[35] Hersh, W., et al (2004) TREC 2004 Genomics Track Overview
http://ir.ohsu.edu/genomics/
[36] http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=MI&termId=MI%3A0001&termName=interaction%
20detection%20method
[37] http://psidev.sourceforge.net/mi/controlledVocab/psi-mi.def.html#MI:0026
[38] http://cvs.sourceforge.net/viewcvs.py/psidev/psi/mi/controlledVocab/
[39] http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=MI&termId=MI%3A0001&termName=interaction%
Last update of this page: 17 September 2006
[up][home]
|