Presentation is loading. Please wait.

Presentation is loading. Please wait.

SMBM Talks SMBM, Cambridge, April 11-13 (Edinburgh May 2) NLP for Biomedical Text Mining.

Similar presentations


Presentation on theme: "SMBM Talks SMBM, Cambridge, April 11-13 (Edinburgh May 2) NLP for Biomedical Text Mining."— Presentation transcript:

1 SMBM Talks SMBM, Cambridge, April 11-13 (Edinburgh May 2) NLP for Biomedical Text Mining

2 Resources and Tools for Biomedical Text Mining Junichi Tsujii (U of Tokyo) Keywords: GENIA corpus; annotation Main point: progress in text mining depends on the integration of growing GENIA annotation (coreference, eg) with lexical resources for domain knowledge (ontologies) and software development. Take home message: see main point above

3 annotated corpus POS NER coreference (670 abstracts, Singapore) interaction (biological events; cooperation with CNRS) parse trees (1.5 million GENIA abstracts parsed in 10 days using a 100 PC cluster) ontology top nodes: substance; source; other software development POS tagger NER tagger parser IR system (Medusa) IE (event extraction: relation gene/disease) system

4 POS tagger MaxEnt model (Kazama and Tsujii 2003, 2005) Trained on WSJ (>39,000 sent.) and GENIA (18,500 sent.) WSJ GENIA WSJ+GENIA train test WSJGENIA 97.084.3 75.2 96.9 98.1 combines a rule-based and statistical approach on BioNLP: 70.8% (?) -- our system got 70.1% NER tagger

5 HPSG-based parser (Enju) see Miyao et al. ACL05 available on website XML output dependency relations predicate-argument accuracy: PTB: prec=88.3% rec=87.2 GENIA: lower... gene/disease relation extraction pred/arg works better than bag of words or local context (gives best precision)

6 Recognising noun phrases in biomedical text: an evaluation of lab prototypes and a commercial chunker J. Wermter, J. Fluck, J.Stroetgen, S.Geissler, U. Hahn (U. Jena, Temis) Keywords: chunking, portability Main point:take several existing chunkers trained on (or developed for) newspaper text and evaluate their performance on biomedical data (beta version of GENIA syntactic annotation). Take home messages: overall performance drop (~3-6 points) for ML systems when shifting to bio domain no significant difference between statistical and rule-based systems

7 Three statistical chunkers: YamCha (support vector machine) Tbl (transformation-based error-driven learning) BoSS (boundaries predictor by combining observed probabilities of NP boundaries and POS patterns in trainset) One rule-based commercial system Temis 1. Uses words rather than GENIA POS tags 2. Computes morphological information (XeLDA toolkit) 3. HMM POS tagger disambiguates chain of POS tags hand-coded grammar had to be modified (on PTB) tagset had to be translated (not straightforward)

8 Training and Test Sets Train sections 15-18 of Penn Treebank for training (over 200,000 POS-tagged tokens and IOB-chunked) Test GENIA treebank (beta version) (200 MedLine abstracts with syntactic annotation) the GENIA treebank was automatically converted into the IOB format just under 45,000 tokens ~11,000 = devtest for settting Temis’ IOB output ~34,000 = actual test set

9 Results and Errors GENIA CorpusPTB Corpus YamCha BoSS Tbl Temis 94.29 94.15 94.22 Rec Prec F 89.92 90.10 90.01 92.27 91.80 92.03 86.94 86.29 86.61 89.00 89.30 89.15 86.46 86.84 86.65 86.31 85.49 85.90 87.14 85.34 86.23 Errors Coordination bracketed elements... Temis 91.24 90.59 90.91 BoSS 87.25 89.19 88.21 After domain adaptations

10 Automatic Term List Generation for Entity Tagging Ted Sandler, Andrew Schein, and Lyle Ungar (CS, UPenn) Keywords:NER, automatic gazetteer creation Main point: term lists can be obtained automatically, and when integrated in a NER (gene)tagger (CRF) boost its performance to a level comparable with hand-modelled lists Take home messages: unsupervised gazetteer creation is feasible and useful supervised methods for obtaining terms outperform unsupervised methods

11 4 related methods for generating term lists; they differ wrt: (see table) word representation clustering algorithms to partition the words choice of feature weighting Overall Approach choose set of vocabulary items (nouns) to partition into classes choose set of useful syntactic relations frequent informative relatively noise-free parse corpus to extract relations and collect statistics use clustering algorithm to partition the vocabulary resulting partitions are term lists

12 Representation of the base vocabulary vector space where each item is represented by set of syn configurations it occurs in affinity matrix where each item is represented as its similarities to other items in the vocabulary Weighting Schemes Pearson’s chi-square test Generalized Likelihood Ratio (G-square; Dunning 1993; better with sparse data) first better at “common sense” generalisations; second better at domain-specific generalisations Clustering Algorithms kmeans clustering for words in vector space (high recall) agglomerative clustering for data in affinity matrix (high prec) Corpus 15,000 sentences from BioCreative + 1,800,547 Medline abs parsed using Minipar; vocabulary=7782 single token nouns

13 NER (Gene) Tagging McDonald and Pereira’s CRF tagger automatically generated 2,164 overlapping term lists incorporated as features in the model binary feature (0/1) for each term list (in=1; not=0) baseline tagger without lists tagger augmented with hand-compiled lists of genes (57,563) tagger augmented with large list of genes obtained via supervised learning (Tanabe and Wilbur Gene.Lexicon:1,145,913) TRAIN/TEST: 5-fold Xvalidation on 394,661 words of BioCreative (1/5 for training and 4/5 for testing) precrecf-score Baseline0.6980.6130.653 Unsupervised0.7050.6220.661 Supervised0.7090.6210.662 Manual0.7160.6310.671

14 Protein-Protein Interaction Extraction: A Supervised Learning Approach Keywords:relation extraction Main point: a MaxEnt approach to protein-protein relation extraction that exploits simple local features performs better than co-occurrence and rule-based approaches, achieving nearly 94% recall and 88% precision on 303 MedLine abstracts. Take home message: supervised learning with shallow features work well for protein-protein interaction extraction J. Xiao, J. Su, G. Zhou, C. Tan (Inst. For Infocomm Research, Singapore)

15 Task: extract couple of interacting proteins no direction perfect NER (manual annotation) Procedure tokenisation and morphological analysis POS tagging NER sentence analysis (parsing) coreference resolution (including abbreviations and aliases) MaxEnt classifier

16 Features Words all words that appear in two protein names words in between two protein names previous/next words in a n-words window (unordered) Overlap number of protein names in between 2 protein names Keywords occurrence of word from keyword list in surroundings Chunks all heads of base phrases in between 2 protein names all heads surrounding the protein name pair all phrase types between 2 protein names Parse Tree Dependency Tree dependency between two proteins Pair of heads of protein names Pair of abbreviations of two proteins

17 Experiment and Results corpus: IEPA (Iowa University) 303 Medline abstracts 633 positive instances 1080 negative instances POS tagger trained on GENIA using an HMM model Collin’s parser 10-fold Xvalidation best result: rec=93.9%; prec=88%; f=90.9 GOOD Features - words (esp. surrounding) - chunks - pairs of protein heads - pairs of abbreviations - keywords (so and so) NOTSOGOOD Features - overlap - parse trees - dependency relations

18 Challenges of Information Mining in a Pharmaceutical Environment Philippe Sanseau (Glaxo-Smith-Kline, UK) Main point: Q:How do you see the role of NLP in your field? A:Excuse me, could someone explain what NLP is, please. Take home question: are NLP and pharmaceutical communities on the same track?


Download ppt "SMBM Talks SMBM, Cambridge, April 11-13 (Edinburgh May 2) NLP for Biomedical Text Mining."

Similar presentations


Ads by Google