| 
 
 
  
Biology corpora
 
 
 
      
    	
 
          MedTag: 
         A collection of biomedical annotations (MEDLINE abstracts): the AbGene corpus of
	 annotated sentences of genes and protein named entities, the MedPost corpus
	 of part of speech tagged sentences and the GENETAG corpus for named
	 entity identification used for BioCreAtIvE I. 
 
          TREC Genomics Track: 
         A set of data collecions provided by TREC Genomics Track useful for development and evaluation of retrieval and text categorization strategies in the biomedical domain. 
 
          BioCreative corpus: Dataset produced 
	  by the BioCreative assessment, text passages relevant for GO annotations of human proteins. 
 
          GENIA corpus: 
          Annotated corpus of literature related to the MeSH terms: Human, Blood Cells, and Transcription Factors.
 
          Yapex corpus: 
          Training and test data for the protein tagger (NER) YAPEX.
 
          PASBio: 
          Predicate-argument structures of biomedical literature.
 
          LLL05 dataset: 
         Genic Interaction Extraction Challenge: protein/gene interactions IE data set
 
          IEPA corpus: 
          The Interaction Extraction Performance Assessment corpus 
 
          BioText Data: 
          Dataset for extraction of disease/treatment entities relations
 
          BioText NC Semantics Dataset: 
          Dataset of Noun Compound Semantics used in experiments described in articles 
 
          PennBioIE: 
          UPenn Biomedical Information Extraction datasets of annotated  PubMed abstracts: CYP450 domain and oncology
	  domain 
 
          Medstract corpus: 
          Biomedical annotation corpus useful for acronym definition and coreference resolution 
 
          Medstract corpus: 
          Biomedical annotation corpus useful for acronym definition and coreference resolution 
 
          OHSUMED text collection: 
         Document collection used for the TREC-9 contest.
 
          BMC corpus: 
         Open access corpus of full text articles provided by BioMed Central.
 
          FetchProt corpus: 
         Full text journal articles from the biological domain analyzed for experiments on proteins.
 
          PDG Bio-sentence splitter corpus: 
         Small collection of  text data sets derived from PubMed abstracts to develop and assess sentence splitting tools.
 
          Bio1 corpus: 
         annotated corpus, same field as GENIA, but annotated to small top-level ontology.
 
 
 [up][home]
 
 |