Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

Context-based object-class recognition and retrieval by generalized correlograms by J. Amores, N. Sebe and P. Radeva Discussion led by Qi An Duke University.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
CS626: NLP, Speech and the Web
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Discriminative, Unsupervised, Convex Learning Dale Schuurmans Department of Computing Science University of Alberta MITACS Workshop, August 26, 2005.
John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Tagging with Hidden Markov Models CMPT 882 Final Project Chris Demwell Simon Fraser University.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi.
Conditional Random Fields
Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing,
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network Kristina Toutanova, Dan Klein, Christopher Manning, Yoram Singer Stanford University.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Final review LING572 Fei Xia Week 10: 03/11/
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Isolated-Word Speech Recognition Using Hidden Markov Models
Conditional Topic Random Fields Jun Zhu and Eric P. Xing ICML 2010 Presentation and Discussion by Eric Wang January 12, 2011.
Graphical models for part of speech tagging
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging for Bengali with Hidden Markov Model Sandipan Dandapat,
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
Tokenization & POS-Tagging
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
CS Statistical Machine learning Lecture 24
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
John Lafferty Andrew McCallum Fernando Pereira
Natural Language Processing Statistical Inference: n-grams
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
POS Tagging1 POS Tagging 1 POS Tagging Rule-based taggers Statistical taggers Hybrid approaches.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Conditional Random Fields & Table Extraction Dongfang Xu School of Information.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Statistical Models for Automatic Speech Recognition
CSC 594 Topics in AI – Natural Language Processing
Statistical Models for Automatic Speech Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
N-Gram Model Formulas Word sequences Chain rule of probability
Support Vector Machines
Presentation transcript:

Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore

Overview Comparison of previous methods Using context from both sides Lexicon Construction Sequential EM for tag sequence and lexical probabilities Discussion Questions

Previous methods Trigram model P(t_i | t_i-1, t_i-2) Kupiec(1992) divide lexicon into word classes –Words contained within the same equivalence classes posses the same set of POS Brill(1995) UTBL –Uses information from the distribution of unambiguously tagged data to make label decision –Considers both left and right context

Toutanova (2003) Conditional MM –Supervised learning method –Increase accuracy from 96.10% to 96.55% Lafferty (2001) –Compared HMM with MEMM, and CRF

Contextualized HMM Estimate the probability of a word w_i based on t_i-1, t_i and t_i+1 Leads to higher dimensionality in the parameters Standard absolute discounting scheme smoothing

Lexicon construction Lexicons provided for both testing and training Initialize with uniform dist for all possible tags for each word Experiments with using word classes in the Kupiec model

Problems Limiting the possible tags per lexicon –Tags that appeared less than X% of the time for each word are omitted.

HMM Model Training Extracting non-ambiguous tag sequence –Use these n-grams and their counts to bias the initial estimate of state transitions in the HMM Sequential training –Train the transition model probability first, keeping the lexical probabilities constant. –Then train the lexical probabilities, keeping the transition probability constant.

Discussion Sequential training of HMM by training the parameters separately. Is there any theoretical significance? Computational cost? What are the effects if we model the tag context differently using p(t_i | t_i- 1, t_i+1)?

Improved Estimation for Unsupervised POS Tagging month day, year Alex Cheng Ling 575 Winter 08 Qin Iris Wang, Dale Schuurmans

Overview Focus on parameter estimation –Considering only simple models with limited context (using a standard HMM - bigram) Constraint on marginal tag probabilities Smooth lexical parameters using word similarities Discussion Questions

Parameter Estimation Banko and Moore (2004) reduces error rate from 22.8% to 4.1% by reducing the set of possible tags for each word. –Requires tagged data to find the artificially reduced lexicon. EM is guaranteed to converge to a local maximum. HMM tends to have multiple local maxima. –This leads to the resulting quality of the parameters may have more to do with the initial parameter estimation than the EM procedure itself.

Estimations problems Using the standard model –Tag -> tag unifrom over all tags –Tag -> word uniform over all possible tag for word (as specified in complete lexicon) Estimated parameters of the transition probabilities are quite poor. –‘a’ is always tagged LS. Estimated parameters of the lexical probabilities are also quite poor –Treat each parameter b_t_w1, b_t_w2 as independent. –EM tends to over-fit the lexical model and ignore similarity between words.

Marginally Constrained HMMs Tag -> Tag probabilities Maintain a specific marginal distribution over the tag probabilities. –Assuming we are given a target distribution over tags (raw tag frequency) Can be obtained from tagged data Can be approximated (see Toutanova, 2003)

Similarity based Smoothing Tag -> Word probabilities Using a feature vector f for each word w which consists of the context (left and right word) of w. Took 100,000 most frequent words as features

Result

Discussion Compared to Banko and Moore, are methods used here “more or less” unsupervised? –Banko and Moore uses lexicon ablation –Here, we use raw frequency of tags