*BioCreAtIvE - Bio-NLP corpora*

BioCreAtIvE - Critical Assessment for Information Extraction in Biology

Home

- CNIO

- MITRE

Biology corpora

MedTag: A collection of biomedical annotations (MEDLINE abstracts): the AbGene corpus of annotated sentences of genes and protein named entities, the MedPost corpus of part of speech tagged sentences and the GENETAG corpus for named entity identification used for BioCreAtIvE I.

TREC Genomics Track: A set of data collecions provided by TREC Genomics Track useful for development and evaluation of retrieval and text categorization strategies in the biomedical domain.

BioCreative corpus: Dataset produced by the BioCreative assessment, text passages relevant for GO annotations of human proteins.

GENIA corpus: Annotated corpus of literature related to the MeSH terms: Human, Blood Cells, and Transcription Factors.

Yapex corpus: Training and test data for the protein tagger (NER) YAPEX.

PASBio: Predicate-argument structures of biomedical literature.

LLL05 dataset: Genic Interaction Extraction Challenge: protein/gene interactions IE data set

IEPA corpus: The Interaction Extraction Performance Assessment corpus

BioText Data: Dataset for extraction of disease/treatment entities relations

BioText NC Semantics Dataset: Dataset of Noun Compound Semantics used in experiments described in articles

PennBioIE: UPenn Biomedical Information Extraction datasets of annotated PubMed abstracts: CYP450 domain and oncology domain

Medstract corpus: Biomedical annotation corpus useful for acronym definition and coreference resolution

Medstract corpus: Biomedical annotation corpus useful for acronym definition and coreference resolution

OHSUMED text collection: Document collection used for the TREC-9 contest.

BMC corpus: Open access corpus of full text articles provided by BioMed Central.

FetchProt corpus: Full text journal articles from the biological domain analyzed for experiments on proteins.

PDG Bio-sentence splitter corpus: Small collection of text data sets derived from PubMed abstracts to develop and assess sentence splitting tools.

Bio1 corpus: annotated corpus, same field as GENIA, but annotated to small top-level ontology.

[up][home]