Presentation is loading. Please wait.

Presentation is loading. Please wait.

The BIOCREATIVE Task in SEER. Outline Background for biomedical information extraction and BIOCREATIVE BIOCREATIVE NER Task Stanford-Edinburgh System.

Similar presentations


Presentation on theme: "The BIOCREATIVE Task in SEER. Outline Background for biomedical information extraction and BIOCREATIVE BIOCREATIVE NER Task Stanford-Edinburgh System."— Presentation transcript:

1 The BIOCREATIVE Task in SEER

2 Outline Background for biomedical information extraction and BIOCREATIVE BIOCREATIVE NER Task Stanford-Edinburgh System Problems

3 Terms and Resources GeneAn ordered sequence of nucleotides that encodes a product such as a protein. ProteinGene products; composed of chains of amino acids; Have sophisticated structures; kinases, enzymes, etc are types of proteins NucleotideThousands of nucleotides link to form a DNA/RNA molecule Molecular BiologyBranch of biology studying all of the above MEDLINEThe primary research database of the biomedical community, from nursing to drugs to genetics Gene DatabasesFlyBase, MGI (mouse), Saccharomyces Gen. Database (Yeast ) Other DatabasesSwiss-Prot (amino acid sequences of proteins) GenBank (nucleotide sequences of genes)

4 Biotechnology Information Explosion David Landsman NCBI Presentation

5 NER in the Biomedical Domain Many types of entities can be studied in the biomedical domain (drug names, chemicals) Much research has focused on molecular biological entities, particularly genes and proteins

6 Gene Names Genes and gene products are constantly being discovered and new names invented Nomenclatures exist but vary from organism to organism Diverse: –bride of frizzled disco, cheap date, broken heart –REP2, RFM Ambiguous: –With other genes –Acronyms –With proteins, where genes and their products are often referred to by the same name. (1 st gene in LocusLink is officially alpha-1-B- glycoprotein)

7 F-ScoreEvaluation CorpusPublication 0.92/Gene Corpus consisting of 750 sentences from FlyBase where each gene is referred to by its official name, and where each name is a single word, kept only sentences containing at least 2 gene mentions, and those gene mentions appear in the dictionary and all the articles concern drosophila melanogaster Proux et al /Protein30 abstracts on SH3 proteinFukuda et al 1998 (KeX) 0.92/ProteinSWISSPROT annotations on Transpath database Hanisch et al /DNA 0.72/Protein100 MEDLINE abstractsNobata et al /Protein99 MEDLINE abstractsEriksson et al 2002 (Yapex) 0.76 Protein 0.03/RNA100 MEDLINE abstractsCollier et al – 24 classesGENIA corpusKazama et al /Protein MoleculeGENIA corpusYamamoto et al 2003 Varying Tasks, Results and Evaluation Methods

8 BIOCREATIVE Motivations Seeking to be the MUC of the biomedical information extraction field

9 The BIOCREATIVE NER Task Given a single sentence from an abstract, to identify all mentions of genes (or proteins where there is ambiguity) In November changed the task to identify all mentions of genes and proteins (but not distinguishing between them)

10 The BIOCREATIVE NER Data Data SetSentencesWordsGenes Training , Development250070, Evaluation , Data consisted of MEDLINE abstracts annotated for the single NE GENE

11 The BIOCREATIVE NER Evaluation Method Only exact matches to the gold standard (which includes alternate correct boundaries for several cases) are counted as correct. Genes detected with incorrect boundaries are doubly penalized as false negatives and false positives. chloramphenicol acetyl transferase reporter gene (FN) transferase reporter gene (FP)

12 Outline Background for BIOCREATIVE and biomedical information extraction BIOCREATIVE NER Task Stanford-Edinburgh System Problems

13 Baseline System Maximum Entropy Tagger in Java Based on Klein et al (2003) CoNLL submission Baseline Performance: Precision 0.79 Recall 0.74 F-Score 0.76 Efforts were mostly in trying different features, including different POS taggers, NP-chunking, Parsing, Gazetteers, Web, Abbreviations, Word Shapes, Tokenization…

14 Feature Set

15 Features – External Gazetteers1,731,581 entries Adapted from Locus Link, Gene Ontology and BIOCREATIVE data ABGENEA transformation-based NE tagger based on gazetteers and pattern matching GENIABiomedical corpus using a different tag set consisting of 37 Named Entities Web TestInitial tagger output submitted to the Web in patterns such as X gene

16 Postprocessing Discarded results with mismatched parentheses Different boundaries were detected when searching the sentence forwards versus backwards Unioned the results of both; in cases where boundary disagreements meant that one detected gene was contained in the other, we kept the shorter gene

17 Final System and Results PrecisionRecallF-Score Closed Open Preliminary Best-Closed Preliminary Best-Open Trained on training+development data (1000 sentences) 1,247,775 features

18 Outline Background for BIOCREATIVE and biomedical information extraction BIOCREATIVE NER Task Stanford-Edinburgh System Problems

19 Performance Discrepancy C&CPrecisionRecallF-Score CoNLL BIOCREATIVE Klein et alPrecisionRecallF-Score CoNLL BIOCREATIVE

20 Gene Entity Pitfalls Language is complex Stably transfected human kidney 293 cells expressing the wild type rat LH / CG receptor ( rLHR ) or receptors with C-terminal tails truncated at residues 653, 631, or 628 (designated rLHR-t653, rLHR-t631, and rLHR-t628 ) were used to probe the importance of this region on the regulation of hormonal responsiveness. Gene names are frequently uncapitalized The chick axon-associated surface glycoprotein neurofascin is implicated in axonal growth and fasciculation as revealed by antibody perturbation experiments. Looks weird is not indicative A newly synthesized anti-inflammatory agent, Y-8004 demonstrated a greater inhibition than did indomethacin ( IM ). on inflammatory response such as ultraviolet erythema in guinea pigs, carrageenin edema, evans blue and carrageenin-induced pleuritis and acetic acid- induced peritonitis in rats.

21 Boundary Problems Gene names can be long and complex 37% of our false positives and 39% of false negatives were boundary problems Gold: chloramphenicol acetyl transferase reporter gene chloramphenicol acetyl transferase reporter gene deletion Gold: estrogen receptor estrogen receptor ligand

22 Interannotator Agreement MUC-7 interannotator agreement was measured at 97 F-Score Demetriou and Gaizauskas: Interannotator agreement for biomedical terms at 89% F-Score Hirschman measured interannotator agreement for gene names at 87% F-Score


Download ppt "The BIOCREATIVE Task in SEER. Outline Background for biomedical information extraction and BIOCREATIVE BIOCREATIVE NER Task Stanford-Edinburgh System."

Similar presentations


Ads by Google