Part-of-speech tagging and chunking with log-linear models University of Manchester National Centre for Text Mining (NaCTeM) Yoshimasa Tsuruoka.

Slides:



Advertisements
Similar presentations
Feature Forest Models for Syntactic Parsing Yusuke Miyao University of Tokyo.
Advertisements

Learning with lookahead: Can history-based models rival globally optimized models? Yoshimasa Tsuruoka Japan Advanced Institute of Science and Technology.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Learning Accurate, Compact, and Interpretable Tree Annotation Recent Advances in Parsing Technology WS 2011/2012 Saarland University in Saarbrücken Miloš.
Part-of-speech tagging and chunking with log-linear models University of Manchester Yoshimasa Tsuruoka.
Part-Of-Speech Tagging and Chunking using CRF & TBL
John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields - A probabilistic graphical model Stefan Mutter Machine Learning Group Conditional Random Fields - A probabilistic graphical.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Learning with Probabilistic Features for Improved Pipeline Models Razvan C. Bunescu Electrical Engineering and Computer Science Ohio University Athens,
A Parallel Implemenation of Conditional Random Fields This was an AUSS/NIP project for the grant Developing an Entity Extractor for the Scalable Constructing.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Hindi POS tagging and chunking : An MEMM approach Aniket Dalal Kumar Nagaraj Uma Sawant Sandeep Shelke Under the guidance of Prof. P. Bhattacharyya.
Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
Conditional Random Fields
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Part-of-Speech Tagging & Sequence Labeling
Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network Kristina Toutanova, Dan Klein, Christopher Manning, Yoram Singer Stanford University.
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
1 Sequence Labeling Raymond J. Mooney University of Texas at Austin.
Generative and Discriminative Models in NLP: A Survey Kristina Toutanova Computer Science Department Stanford University.
Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
Graphical models for part of speech tagging
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
AQUAINT Workshop – June 2003 Improved Semantic Role Parsing Kadri Hacioglu, Sameer Pradhan, Valerie Krugler, Steven Bethard, Ashley Thornton, Wayne Ward,
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
1 Conditional Random Fields Jie Tang KEG, DCST, Tsinghua 24, Nov, 2005.
Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
Tokenization & POS-Tagging
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD.
Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering.
Shallow Parsing Swapnil Chaudhari – 11305R011 Ankur Aher Raj Dabre – 11305R001.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.
MAXIMUM ENTROPY MARKOV MODEL Adapted From: Heshaam Faili University of Tehran – Dikkala Sai Nishanth – Ashwin P. Paranjape
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
John Lafferty Andrew McCallum Fernando Pereira
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Conditional Markov Models: MaxEnt Tagging and MEMMs
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Part-of-Speech Tagging & Sequence Labeling Hongning Wang
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Survey on Semi-Supervised CRFs Yusuke Miyao Department of Computer Science The University of Tokyo.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
Tools for Natural Language Processing Applications
Sequential Learning with Dependency Nets
Presentation transcript:

Part-of-speech tagging and chunking with log-linear models University of Manchester National Centre for Text Mining (NaCTeM) Yoshimasa Tsuruoka

Outline POS tagging and Chunking for English –Conditional Markov Models (CMMs) –Dependency Networks –Bidirectional CMMs Maximum entropy learning Conditional Random Fields (CRFs) Domain adaptation of a tagger

Part-of-speech tagging The tagger assigns a part-of-speech tag to each word in the sentence. The peri-kappa B site mediates human immunodeficiency DT NN NN NN VBZ JJ NN virus type 2 enhancer activation in monocytes … NN NN CD NN NN IN NNS

Algorithms for part-of-speech tagging Tagging speed and accuracy on WSJ Tagging SpeedAccuracy Dependency Net (2003)Slow97.24 SVM (2004)Fast97.16 Perceptron (2002)?97.11 Bidirectional CMM (2005)Fast97.10 HMM (2000)Very fast96.7* CMM (1998)Fast96.6* * evaluated on different portion of WSJ

Chunking (shallow parsing) A chunker (shallow parser) segments a sentence into non-recursive phrases He reckons the current account deficit will narrow to NP VP NP VP PP only # 1.8 billion in September. NP PP NP

Chunking (shallow parsing) Chunking tasks can be converted into a standard tagging task Different approaches: –Sliding window –Semi-Markov CRF –… He reckons the current account deficit will narrow to B NP B VP B NP I NP I NP I NP B VP I VP B PP only # 1.8 billion in September. B NP I NP I NP I NP B PP B NP

Algorithms for chunking Chunking speed and accuracy on Penn Treebank Tagging SpeedAccuracy SVM + voting (2001)Slow?93.91 Perceptron (2003)?93.74 Bidirectional CMM (2005)Fast93.70 SVM (2000)Fast93.48

Conditional Markov Models (CMMs) Left to right decomposition (with the first-order Markov assumption) t1t1 t2t2 t3t3 o

POS tagging with CMMs [Ratnaparkhi 1996; etc.] Left-to-right decomposition –The local classifier uses the information on the preceding tag. He runs fast PRP VBZRB ? ??

Examples of the features for local classification Word unigramw i, w i-1, w i+1 Word bigramw i-1 w i, w i w i+1 Previous tagt i-1 Tag/wordt i-1 w i Prefix/suffixUp to length 10 Lexical featuresHyphen, number, etc.. He runs fast PRP ?

POS tagging with Dependency Network [Toutanova et al. 2003] Use the information on the following tag as well This is no longer a probability You can use the following tag as a feature in the local classification model t1t1 t2t2 t3t3

POS tagging with a Cyclic Dependency Network [Toutanova et al. 2003] Training cost is small – almost equal to CMMs. Decoding can be performed with dynamic programming, but it is still expensive. Collusion – the model can lock onto conditionally consistent but jointly unlikely sequences. t1t1 t2t2 t3t3

Bidirectional CMMs [Tsuruoka and Tsujii, 2005] Possible decomposition structures Bidirectional CMMs –We can find the “ best ” structure and tag sequences in polynomial time t1t1 t2t2 t3t3 (a) t1t1 t2t2 t3t3 (b) t1t1 t2t2 t3t3 (c) t1t1 t2t2 t3t3 (d)

Maximum entropy learning Log-linear modeling Feature function Feature weight

Maximum entropy learning Maximum likelihood estimation –Find the parameters that maximize the (log-) likelihood of the training data Smoothing –Gaussian prior [Berger et al, 1996] –Inequality constrains [Kazama and Tsujii, 2005]

Parameter estimation Algorithms for maximum entropy –GIS [Darroch and Ratcliff, 1972], IIS [Della Pietra et al., 1997] General-purpose algorithms for numerical optimization –BFGS [Nocedal and Wright, 1999], LMVM [Benson and More, 2001] You need to provide the objective function and gradient: –Likelihood of training samples –Model expectation of each feature

Computing likelihood and model expectation Example –Two possible tags: “ Noun ” and “ Verb ” –Two types of features: “ word ” and “ suffix ” Verb He opened it Noun tag = noun tag = verb

Conditional Random Fields (CRFs) A single log-linear model on the whole sentence One can use exactly the same techniques as maximum entropy learning to estimate the parameters. However, the number of classes is HUGE, and it is impossible in practice to do it in a naive way.

Conditional Random Fields (CRFs) Solution –Let ’ s restrict the types of features –Then, you can use a dynamic programming algorithm that drastically reduces the amount of computation Features you can use (in first-order CRFs) –Features defined on the tag –Features defined on the adjacent pair of tags

Features Feature weights are associated with states and edges Noun Verb Noun Verb Noun Verb Noun Verb He has opened it W 0 =He & Tag = Noun Tag left = Noun & Tag right = Noun

A naive way of calculating Z(x) Noun = 7.2 = 1.3 = 4.5 = 0.9 = 2.3 = 11.2 = 3.4 = 2.5 = 4.1 = 0.8 = 9.7 = 5.5 = 5.7 = 4.3 = 2.2 = 1.9 Sum = 67.5 Noun Verb Noun VerbNoun Verb NounVerbNoun VerbNounVerb NounVerb Noun Verb Noun VerbNoun Verb NounVerbNoun VerbNounVerb Noun Verb NounVerb Noun Verb

Dynamic programming Results of intermediate computation can be reused. Noun Verb Noun Verb Noun Verb Noun Verb He has opened it

Maximum entropy learning and Conditional Random Fields Maximum entropy learning –Log-linear modeling + MLE –Parameter estimation Likelihood of each sample Model expectation of each feature Conditional Random Fields –Log-linear modeling on the whole sentence –Features are defined on states and edges –Dynamic programming

Named Entity Recognition We have shown that interleukin-1 (IL-1) and IL-2 control protein protein protein IL-2 receptor alpha (IL-2R alpha) gene transcription in DNA CD4-CD8-murine T lymphocyte precursors. cell_line

Algorithms for Biomedical Named Entity Recognition RecallPrecisionF-score SVM+HMM (2004) Semi-Markov CRF [Okanohara et al., 2006] Sliding window MEMM (2004) CRF (2004) Shared task data for Coling 2004 BioNLP workshop

Domain adaptation Large training data has been available for general domains (e.g. Penn Treebank WSJ) NLP Tools trained with general domain data are less accurate on biomedical domains Development of domain-specific data requires considerable human efforts

Tagging errors made by a tagger trained on WSJ Accuracy of the tagger on the GENIA POS corpus: 84.4% … and membrane potential after mitogen binding. CC NN NN IN NN JJ … two factors, which bind to the same kappa B enhancers… CD NNS WDT NN TO DT JJ NN NN NNS … by analysing the Ag amino acid sequence. IN VBG DT VBG JJ NN NN … to contain more T-cell determinants than … TO VB RBR JJ NNS IN Stimulation of interferon beta gene transcription in vitro by NN IN JJ JJ NN NN IN NN IN

Re-training of maximum entropy models Taggers trained as maximum entropy models Adapting Maximum entropy models to target domains by re-training with domain specific data Feature function (given by the developer) Model parameter

Methods for domain adaptation Combined training data: a model is trained from scratch with the original and domain-specific data Reference distribution: an original model is used as a reference probabilistic distribution of a domain-specific model

Adaptation of the part-of-speech tagger Relationships among training and test data are evaluated for the following corpora –WSJ: Penn Treebank WSJ –GENIA: GENIA POS corpus [Kim et al., 2003] 2,000 MEDLINE abstracts selected by MeSH terms, Human, Blood cells, and Transcription factors –PennBioIE: Penn BioIE corpus [Kulick et al., 2004] 1,100 MEDLINE abstracts about inhibition of the cytochrome P450 family of enzymes 1,157 MEDLINE abstracts about molecular genetics of cancer –Fly: 200 MEDLINE abstracts on Drosophia melanogaster

Training sets Test sets Training and test sets # tokens# sentences WSJ912,34438,219 GENIA450,49218,508 PennBioIE641,83829,422 Fly1,024 # tokens# sentences WSJ129,6545,462 GENIA50,5622,036 PennBioIE70,7133,270 Fly7,615326

Experimental results Accuracy Training time (sec.) WSJGENIAPennBioIEFly WSJ+GENIA +PennBioIE Fly only93.91 Combined ,632 Ref. dist

Corpus size vs. accuracy (combined training data)

Corpus size vs. accuracy (reference distribution)

Summary POS tagging –MEMM-like approaches achieve good performance with reasonable computational cost. CRFs seems to be too computationally expensive at present. Chunking –CRFs yield good performance for NP chunking. Semi- Markov CRFs are promising, but we need to somehow reduce computational cost. Domain Adaptation –One can easily use the information about the original domain as the reference distribution.

References A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra. (1996). A maximum entropy approach to natural language processing. Computational Linguistics. Adwait Ratnaparkhi. (1996). A Maximum Entropy Part-Of-Speech Tagger. Proceedings of EMNLP. Thorsten Brants. (2000). TnT A Statistical Part-Of-Speech Tagger. Proceedings of ANLP. Taku Kudo and Yuji Matsumoto. (2001). Chunking with Support Vector Machines, Proceedings of NAACL. John Lafferty, Andrew McCallum, and Fernando Pereira. (2001). Conditional Random Fields,, Probabilistic Models for Segmenting and Labeling Sequence Data. Proceedings of ICML. Michael Collins. (2002). Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. Proceedings of EMNLP. Fei Sha and Fernando Pereira. (2003). Shallow Parsing with Conditional Random Fields. Proceedings of HLT-NAACL. K. Toutanova, D. Klein, C. Manning, and Y. Singer. (2003). Feature-Rich Part-of- Speech Tagging with a Cyclic Dependency Network. Proceedings of HLT-NAACL.

References Xavier Carreras and Lluis Marquez. (2003). Phrase recognition by filtering and ranking with perceptrons. Proceedings of RANLP. Jes ú s Gim é nez and Llu í s M á rquez. (2004). SVMTool: A general POS tagger generator based on Support Vector Machines. Proceedings of LREC. Sunita Sarawagi and William W. Cohen. (2004). Semimarkov conditional random fields for information extraction. Proceedings of NIPS Yoshimasa Tsuruoka and Jun'ichi Tsujii. (2005). Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data. Proceedings of HLT/EMNLP. Yuka Tateisi,Yoshimasa Tsuruoka and Jun'ichi Tsujii. (2006). Subdomain adaptation of a POS tagger with a small corpus. In Proceedings of HLT- NAACL BioNLP Workshop. Daisuke Okanohara, Yusuke Miyao, Yoshimasa Tsuruoka, and Jun'ichi Tsujii. (2006). Improving the Scalability of Semi-Markov Conditional Random Fields for Named Entity Recognition. Proceedings of COLING/ACL 2006.