Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.

Slides:



Advertisements
Similar presentations
A Probabilistic Term Variant Generator for Biomedical Terms Yoshimasa Tsuruoka and Jun ichi Tsujii CREST, JST The University of Tokyo.
Advertisements

Application of the NLP techniques to IE and IR CREST.
1 National Centre for Text Mining Mission To provide TM tools for users, in particular, scientists and researchers To coordinate activities in the TM community.
Extracting Disease-Gene Associations from MEDLINE abstracts Tsujii laboratory University of Tokyo.
Feature Forest Models for Syntactic Parsing Yusuke Miyao University of Tokyo.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Learning with lookahead: Can history-based models rival globally optimized models? Yoshimasa Tsuruoka Japan Advanced Institute of Science and Technology.
ThemeInformation Extraction for World Wide Web PaperUnsupervised Learning of Soft Patterns for Generating Definitions from Online News Author Cui, H.,
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
HPSG parser development at U-tokyo Takuya Matsuzaki University of Tokyo.
Learning Accurate, Compact, and Interpretable Tree Annotation Recent Advances in Parsing Technology WS 2011/2012 Saarland University in Saarbrücken Miloš.
GENIA-GR: a Grammatical Relation Corpus for Parser Evaluation in the Biomedical Domain Yuka Tateisi 1, Yusuke Miyao 2, Kenji Sagae 2, Jun'ichi Tsujii 2,3.
Part-of-speech tagging and chunking with log-linear models University of Manchester Yoshimasa Tsuruoka.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Part-Of-Speech Tagging and Chunking using CRF & TBL
Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.
Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Part-of-speech tagging and chunking with log-linear models University of Manchester National Centre for Text Mining (NaCTeM) Yoshimasa Tsuruoka.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
SMBM Talks SMBM, Cambridge, April (Edinburgh May 2) NLP for Biomedical Text Mining.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
1 Data-Driven Dependency Parsing. 2 Background: Natural Language Parsing Syntactic analysis String to (tree) structure He likes fish S NP VP NP VNPrn.
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
Graphical models for part of speech tagging
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
HW7 Extracting Arguments for % Ang Sun March 25, 2012.
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
AQUAINT Workshop – June 2003 Improved Semantic Role Parsing Kadri Hacioglu, Sameer Pradhan, Valerie Krugler, Steven Bethard, Ashley Thornton, Wayne Ward,
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Natural language processing tools Lê Đức Trọng 1.
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.
Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Conditional Markov Models: MaxEnt Tagging and MEMMs
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
5/6/04Biolink1 Integrated Annotation for Biomedical IE Mining the Bibliome: Information Extraction from the Biomedical Literature NSF ITR grant EIA
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Part-of-Speech Tagging & Sequence Labeling Hongning Wang
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
PRESENTED BY: PEAR A BHUIYAN
Improving a Pipeline Architecture for Shallow Discourse Parsing
Probabilistic and Lexicalized Parsing
CS 388: Natural Language Processing: Syntactic Parsing
LING/C SC 581: Advanced Computational Linguistics
Chunk Parsing CS1573: AI Application Development, Spring 2003
CS246: Information Retrieval
CSCI 5832 Natural Language Processing
Using Uneven Margins SVM and Perceptron for IE
By Hossein Hematialam and Wlodek Zadrozny Presented by
Progress report on Semantic Role Labeling
Extracting Information from Diverse and Noisy Scanned Document Images
Presentation transcript:

Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo

Outline NLP resources for bioNLP –GENIA corpus NLP tools –Machine learning Maximum entropy modeling for feature forest Maximum entropy modeling with inequality constraints –Part-of-speech tagger –Chunker (shallow parser) –HPSG Parser Applications of NLP –Extracting disease-gene relationships from MEDLINE abstracts

Application of NLP to the Biomedical domain Plenty of text –MEDLINE database: 12 million abstracts –Needs of effective IE and IR Domain knowledge –Gene ontology, KEGG, UMLS, ICD, … Other Information sources –Molecular databases DNA sequences, motifs, diseases, molecular interactions, etc…

Developing NLP resources Resources for NLP research –Domain knowledge –Training data for ML-based techniques –Test data for evaluating the transferability of a system GENIA resources –Ontology –Corpus

GENIA corpus 4,000 MEDLINE abstracts –Selected by MeSH Terms (Human, Blood cells, Transcription factors) XML format Contents –Named-entity (Kim et al 2003) –Part-of-speech (Tateisi et al 2004) –Parse tree –Co-reference (Institute of Infocomm Research, Singapore)

GENIA part-of-speech corpus Each token is annotated with its part-of-speech tag. Size –2,000 abstracts –20,544 sentences –50,1054 words (about half the size of Penn Treebank) The peri-kappa B site mediates human immunodeficiency virus type 2 enhancer activation in monocytes … DT NN NN NN VBZ JJ NN NN NN CD NN NN IN NNS

The peri-kappa B site mediates human immunodeficiency virus type 2 enhancer activation in monocytes … GENIA named-entity corpus Terms are annotated based on the semantic classes in the GENIA ontology Size –2,000 abstracts –Number of the terms: 92,723 –Vocabulary size: 36,568 DNA virus cell_type

GENIA treebank Based on the standard of the Penn TreeBank Size –500 abstracts –(1500 abstracts by the end of this summer) CD3-episilon expression is controlled by a downstream T lymphocyte-specific enhancer element NP ADJP NP PP VP S

Few known genes (IL-2, members of the IL-8 family, interferon-gamma) are induced in T cells only through the combined effect of phorbol myristic acetatete (PMA) and a Ca(2+)-ionophore, and expression of only these genes can be fully suppressed by Cyclosporin A (CyA). T cell IL-2 Interferon-gamma IL-8 family IL-2 IL-8 IFN-γ Ca(2+)-i PMA Ca(2+)-i PMA Ca(2+)-i PMA CyA × × × Target Interaction Agent Location Event Annotation Few known genes (IL-2, members of the IL-8 family, interferon-gamma) are induced in T cells only through the combined effect of phorbol myristic acetatete (PMA) and a Ca(2+)-ionophore, and expression of only these genes can be fully suppressed by Cyclosporin A (CyA).

T cell IL-2 Interferon-gamma IL-8 family IL-2 IL-8 IFN-γ Ca(2+)-i PMA Ca(2+)-i PMA Ca(2+)-i PMA Target Interaction Agent Location Event annotation

Few known genes (IL-2, members of the IL-8 family, interferon-gamma) are induced in T cells only through the combined effect of phorbol myristic acetatete (PMA) and a Ca(2+)-ionophore, and expression of only these genes can be fully suppressed by Cyclosporin A (CyA). T cell IL-2 Interferon-gamma IL-8 family IL-2 IL-8 IFN-γ CyA × × × Target Interaction Agent Location Event annotation

GENIA corpus Used in more than 240 institutions –Japan (28), Asia (54), North America (63), Europe (62), etc… De facto standard for evaluating biomedical named- entity recognition systems –BioNLP workshop at Coling 2004 Named-entity recognition shared task –Institute for Infocomm Research (Singapore), –Stanford University (USA), –University of Edinburgh (UK), –University of Wisconsin-Madison (USA), –Pohang University of Science and Technology (Korea), –University of Alberta (Canada), –University Duisburg-Essen (Germany), –Korea University (Korea), –National Taiwan University (Taiwan),

NLP tools Biomedical text mining –Huge amount of text. Machine learning –Training set can be very large. –Efficient training algorithms. Taggers (and parsers) –Decoding should be fast.

Machine learning Supervised learning –learns the rules for classifying samples into predefined classes by seeing a large number of training samples with class labels. Algorithms –Naïve Bayes, Decision Tree, SVMs, AdaBoost, Perceptron, Random forests, Maximum Entropy, etc...

Maximum entropy learning Log-linear modeling Maximum likelihood estimation –determines the parameters so that they maximizes the likelihood of the training data Feature function Feature weight

Maximum entropy modeling with inequality constraints (Kazama and Tsujii 2003) Advantages over the standard ME modeling. –Good regularization effects (as good as Gaussian prior) –Sparse solution C++ implementation –offers fast training. –can be used as a library. –can incorporate the model into your source code. The C++ library is used in many NLP programs (e.g. POS tagger, chunkers, IE modules)

Part-of-speech tagging A PoS tagger annotates each token with its part-of- speech tag. The peri-kappa B site mediates human immunodeficiency DT NN NN NN VBZ JJ NN virus type 2 enhancer activation in monocytes … NN NN CD NN NN IN NNS

Chunking (shallow parsing) A chunker (shallow parser) segments a sentence into non-recursive phrases. He reckons the current account deficit will narrow to NP VP NP VP PP only # 1.8 billion in September. NP PP NP

Chunking (shallow parsing) Chunking tasks can be converted into a standard tagging task. He reckons the current account deficit will narrow to B NP B VP B NP I NP I NP I NP B VP I VP B PP only # 1.8 billion in September. B NP I NP I NP I NP B PP B NP

Sequential Classification Approaches Sequence tagging tasks –Find the tag sequence that maximizes the following probability given the observation (e.g. words): Left to right decomposition (with the first-order markov assumption) Right to left decomposition (with the first-order markov assumption) classification problem

Bidirectional Inference Possible decomposition structures Bidirectional inference algorithm (Tsuruoka et al.) –We can find the best structure and tag sequences in polynomial time t1t1 t2t2 t3t3 (a) t1t1 t2t2 t3t3 (b) t1t1 t2t2 t3t3 (c) t1t1 t2t2 t3t3 (d)

State-of-the-art PoS taggers Tagging speed and accuracy on Penn Treebank Tagging SpeedAccuracy Dependency Net (2003)Very slow97.24 Perceptron (2002)?97.11 SVM (2003)Fast97.05 HMM (2000)Extremely fast96.48 Bidirectional MEMMVery fast97.10

State-of-the-art Chunkers Chunking speed an accuracy on Penn Treebank Tagging SpeedAccuracy Perceptron (2003)?93.74 SVM + voting (2003)Slow?93.91 SVM (2000)Fast93.48 Bidirectional MEMMVery fast93.70

The peri-kappa B site mediates human immunodeficiency virus type 2 enhancer activation in monocytes … Named-entity recognition Recognizing named-entities in text Similar to chunking –IOB tagging Named entities in the biomedical domain are long. –Sliding window DNA virus cell_type

A sliding window approach to biomedical NE recogition We want to use rich features on a term. Enumerate all sub-word sequences in a sentence. Classify them into semantic classes. W1W2W3W4W1W2W3W4

Accuracy of biomedical NE recognition RecallPrecisionF-score SVM+HMM (Zho 2004) Sliding window MEMM (Fin 2004) CRF (Set 2004) Shared task at Coling 2004 BioNLP workshop

HPSG parsing HPSG –A few schema –Many lexical entries –Deep syntactic analysis Grammar –Corpus-based grammar construction (Miyao et al 2004) Parser –Beam search (Tsuruoka et al.) Lexical entry HEAD: verb SUBJ: <> COMPS: <> Mary walked slowly HEAD: noun SUBJ: <> COMPS: <> HEAD: verb SUBJ: COMPS: <> HEAD: adv MOD: verb HEAD: verb SUBJ: COMPS: <> Subject-head schema Head-modifier schema

Phrase structure The company is run by him DT NN VBZ VBN IN PRP dt np vp vp pp np np pp vp s

Predicate-argument structure The company is run by him DT NN VBZ VBN IN PRP dt np vp vp pp np np pp vp s arg1 arg2 mod

IR search engine using predicate- argument structures

A maximum entropy model is defined for the entire tree structure –e.g. HPSG parse trees Exponentially-many trees are represented with a packed forest of polynomial size A probability of each tree is estimated without unpacking the feature forest Feature forest model (Miyao and Tsujii 2002) S NP 1 NP 2 VP 1 VP 2 number of trees: size: feature forest

Automatic Generation of Spelling Variants Variant Generator NF-Kappa B(1.0) NF Kappa B (0.9) NF kappa B(0.6) NF kappaB(0.5) NFkappaB(0.3) : Generator NF-Kappa B Each generated variant is associated with its generation probability

Generation Algorithm T cell (1.0) T-cell (0.5)T cells (0.2) T-cells (0.1) Recursive generation P = P x P op

Learning Operation Rules Operations for generating variants –Substitution –Deletion –Insertion Context –Character-level context: preceding (following) two characters Operation Probability

Example of variant generation (1) Generation Probability Generated VariantsFrequency 1.0 (input)antiinflammatory effect anti-inflammatory effect antiinflammatory effects Antiinflammatory effect antiinflammatory-effect anti-inflammatory effects23 :::

Example of variant generation (2) Generation Probabilitiy Generated VariantsFrequency 1.0 (Input)tumour necrosis factor alpha tumor necrosis factor alpha tumour necrosis factor-alpha Tumour necrosis factor alpha tumor necrosis factor alpha Tumor necrosis factor alpha8 :::

Domain Adaptation Newspaper articles are widely used as training data for machine learning-based NLP systems. Domain Adaptability –Part-of-speech tagging –HPSG parsing

Tagging errors by TnT tagger (Brants 2000) … and membrane potential after mitogen binding. CC NN NN IN NN JJ … two factors, which bind to the same kappa B enhancers… CD NNS WDT NN TO DT JJ NN NN NNS … by analysing the Ag amino acid sequence. IN VBG DT VBG JJ NN NN … to contain more T-cell determinants than … TO VB RBR JJ NNS IN Stimulation of interferon beta gene transcription in vitro by NN IN JJ JJ NN NN IN NN IN

Accuracy of TnT tagger on the GENIA corpus Ignoring unessential errors Accuracy TnT (original)84.4% NNP = NN, NNPS = NNS90.0% LS = NN91.3% JJ = NN94.9% About 94% in practice

GENIA tagger training WSJGENIA WSJ GENIA WSJ+GENIA An MEMM tagger trained on WSJ and GENIA corpus The tagger works well on both types of texts.

Parsing MEDLINE with the HPSG parser Parsing accuracy on the GENIA Treebank #sentencesLP / LRUP / UR All sentences1, / / 85.1 Covered sentences1, / / 88.4

Extracting Disease-Gene Associations from MEDLINE abstracts These results suggested that targeted disruption of Cyp19 caused anovulation and precocious depletion of ovarian follicles Furthermore, AML cells with methylated p15(INAK4B) tended to express higher levels of DNMT1 and 3B.

Text 1.5 million MEDLINE abstracts –Selected by MeSH Terms Disease Category AND (Amino Acids, Peptides, and Proteins OR Genetic Structures) Parsing –All the sentences were parsed by the HPSG parser –Using a PC cluster (100 processors with GXP) –Time: 10 days

Training data All foals with OLWS were homozygous for the Ile118Lys EDNRB mutation, and adults that were homozygous were not found. Dominant radial drusen and Arg345Trp EFEMP1 mutation. The 5 year overall survival (OS) and event-free survival (EFS) were 94 and 90 +/- 8%, respectively, with a median follow-up of 48 months. These data may indicate that formation of parathyroid adenoma in young patients is related to a mechanism involving EGFR. All co-occurrences are classified into relevant or irrelevant by a domain expert.

Predicate-argument features (1) Dedifferentiation of adenoid cystic carcinoma: report of a case implicating p53 gene mutation. X gene/disease ARG2

Predicate-argument features (2) These results suggested that targeted disruption of Cyp19 caused anovulation and precocious depletion of ovarian follicles. Furthermore, AML cells with methylated p15(INAK4B) tended to express higher levels of DNMT1 and 3B. X disease/gene ARG2ARG1 gene/disease

Extraction accuracy Training/test data: 2,253 sentences 10-fold cross validation featuresrecallprecisionf-score N/A bag of words local context predicate-argument structures

DGA explorer

Summary The GENIA corpus –Part-of-speech: 2000 abstracts –Named-entities: 2000 abstracts –Parse tree: 500 abstracts Machine learning –Maximum entropy modeling Inequality constraints Feature forests –Bidirectional inference for sequence tagging NLP tools –Part-of-speech tagger: 97.11% –Chunker: 93.7% –HPSG parser: 87.5% –Term variant generation Extracting disease-gene associations from MEDLINE

Software and resources Machine learning packages –Maximum entropy with inequality constraints –Maximum entropy for feature forests Taggers and Parsers –PoS tagger –Chunker –Named-entity tagger –HPSG parser GENIA resource –Named-entity corpus –Part-of-speech corpus –Tree corpus –Co-reference corpus (Singapore Univ.) –HPSG parsed results (100,000 MEDLINE abstracts)