Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing,

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Albert Gatt Corpora and Statistical Methods Lecture 8.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
Part-of-Speech (POS) tagging See Eric Brill “Part-of-speech tagging”. Chapter 17 of R Dale, H Moisl & H Somers (eds) Handbook of Natural Language Processing,
POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.
Part of speech (POS) tagging
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.
Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Part-of-Speech Tagging
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Graphical models for part of speech tagging
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging for Bengali with Hidden Markov Model Sandipan Dandapat,
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
13-1 Chapter 13 Part-of-Speech Tagging POS Tagging + HMMs Part of Speech Tagging –What and Why? What Information is Available? Visible Markov Models.
Word classes and part of speech tagging Chapter 5.
Tokenization & POS-Tagging
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
Stochastic and Rule Based Tagger for Nepali Language Krishna Sapkota Shailesh Pandey Prajol Shrestha nec & MPP.
Albert Gatt Corpora and Statistical Methods. Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
POS Tagging1 POS Tagging 1 POS Tagging Rule-based taggers Statistical taggers Hybrid approaches.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Stochastic Methods for NLP Probabilistic Context-Free Parsers Probabilistic Lexicalized Context-Free Parsers Hidden Markov Models – Viterbi Algorithm Statistical.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Probabilistic Pronunciation + N-gram Models CMSC Natural Language Processing April 15, 2003.
Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.
Part-Of-Speech Tagging Radhika Mamidi. POS tagging Tagging means automatic assignment of descriptors, or tags, to input tokens. Example: “Computational.
Hidden Markov Models BMI/CS 576
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
CSC 594 Topics in AI – Natural Language Processing
CSCI 5832 Natural Language Processing
N-Gram Model Formulas Word sequences Chain rule of probability
Presentation transcript:

Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing, Ch 17 C D Manning & H Schütze (1999) Foundations of Statistical Natural Language Processing, Ch 10

POS tagging - overview What is a “tagger”? Tagsets How to build a tagger and how a tagger works –Supervised vs unsupervised learning –Rule-based vs stochastic –And some details

What is a tagger? Lack of distinction between … –Software which allows you to create something you can then use to tag input text, e.g. “Brill’s tagger” –The result of running such software, e.g. a tagger for English (based on the such-and-such corpus) Taggers (even rule-based ones) are almost invariably trained on a given corpus “Tagging” usually understood to mean “POS tagging”, but you can have other types of tags (eg semantic tags)

Tagging vs. parsing Once tagger is “trained”, process consists straightforward look-up, plus local context (and sometimes morphology) Will attempt to assign a tag to unknown words, and to disambiguate homographs “Tagset” (list of categories) usually larger with more distinctions

Tagset Parsing usually has basic word-categories, whereas tagging makes more subtle distinctions E.g. noun sg vs pl vs genitive, common vs proper, +is, +has, … and all combinations Parser uses maybe categories, tagger may use

Simple taggers Default tagger has one tag per word, and assigns it on the basis of dictionary lookup –Tags may indicate ambiguity but not resolve it, e.g. nvb for noun-or-verb Words may be assigned different tags with associated probabilities –Tagger will assign most probable tag unless –there is some way to identify when a less probable tag is in fact correct Tag sequences may be defined by regular expressions, and assigned probabilities (including 0 for illegal sequences)

What probabilities do we have to learn? (a)Individual word probabilities: Probability that a given tag t is appropriate for a given word w –Easy (in principle): learn from training corpus: –Problem of “sparse data”: Add a small amount to each calculation, so we get no zeros

(b) Tag sequence probability: Probability that a given tag sequence t 1,t 2,…,t n is appropriate for a given word sequence w 1,w 2,…,w n –P(t 1,t 2,…,t n | w 1,w 2,…,w n ) = ??? –Too hard to calculate entire sequence: P(t 1,t 2,t 3,t 4, … ) = P(t 2 |t 1 )  P(t 3 |t 1,t 2 )  P(t 4 |t 1,t 2,t 3 )  … –Subsequence is more tractable –Sequence of 2 or 3 should be enough: Bigram model: P(t 1,t 2 ) = P(t 2 |t 1 ) Trigram model: P(t 1,t 2,t 3 ) = P(t 2 |t 1 )  P(t 3 |t 2 ) N-gram model:

More complex taggers Bigram taggers assign tags on the basis of sequences of two words (usually assigning tag to word n on the basis of word n-1 ) An nth-order tagger assigns tags on the basis of sequences of n words As the value of n increases, so does the complexity of the statistical calculation involved in comparing probability combinations

History Brown Corpus Created (EN-US) 1 Million Words Brown Corpus Tagged HMM Tagging (CLAWS) 93%-95% Greene and Rubin Rule Based - 70% LOB Corpus Created (EN-UK) 1 Million Words DeRose/Church Efficient HMM Sparse Data 95%+ British National Corpus (tagged by CLAWS) POS Tagging separated from other NLP Transformation Based Tagging (Eric Brill) Rule Based – 95%+ Tree-Based Statistics (Helmut Shmid) Rule Based – 96%+ Neural Network 96%+ Trigram Tagger (Kempe) 96%+ Combined Methods 98%+ Penn Treebank Corpus (WSJ, 4.5M) LOB Corpus Tagged

How do they work? Tagger must be “trained” Many different techniques, but typically … Small “training corpus” hand-tagged Tagging rules learned automatically Rules define most likely sequence of tags Rules based on –Internal evidence (morphology) –External evidence (context)

Rule-based taggers Earliest type of tagging: two stages Stage 1: look up word in lexicon to give list of potential POSs Stage 2: Apply rules which certify or disallow tag sequences Rules originally handwritten; more recently Machine Learning methods can be used cf transformation-based learning, below

Stochastic taggers Nowadays, pretty much all taggers are statistics-based and have been since 1980s (or even earlier... Some primitive algorithms were already published in 60s and 70s) Most common is based on Hidden markov Models (also found in speech processing, etc.)

(Hidden) Markov Models Probability calculations imply Markov models: we assume that P(t|w) is dependent only on the (or, a sequence of) previous word(s) (Informally) Markov models are the class of probabilistic models that assume we can predict the future without taking too much account of the past Markov chains can be modelled by finite state automata: the next state in a Markov chain is always dependent on some finite history of previous states Model is “hidden” if it is actually a succession of Markov models, whose intermediate states are of no interest

Three stages of HMM training Estimating likelihoods on the basis of a corpus: Forward-backward algorithm “Decoding”: applying the process to a given input: Viterbi algorithm Learning (training): Baum-Welch algorithm or Iterative Viterbi

Forward-backward algorithm Denote Claim: Therefore we can calculate all A t (s) in time O(L*T n ). Similar, by going backwards, we can get: Multiplying we can get: Note that summing this for all states at a time t gives the likelihood of w 1 …w L.

Viterbi algorithm (aka Dynamic programming) (see J&M p177ff) Denote Claim: Otherwise, appending s to the prefix would get a path better than Q t+1 (s). Therefore, checking all possible states q at time t, multiplying by the transition probability between q and s and the expression probability of w t+1 given s, and finding the maximum, gives Q t+1 (s). We need to store for each state the previous state in Q t (s). Find the maximal finish state, and reconstruct the path. O(L*T n ) instead of T L.

Baum-Welch algorithm Start with initial HMM Calculate, using F-B, the likelihood to get our observations given that a certain hidden state was used at time i. Re-estimate the HMM parameters Continue until convergence Can be shown to constantly improve likelihood

Unsupervised learning We have an untagged corpus We may also have partial information such as a set of tags, a dictionary, knowledge of tag transitions, etc. Use Baum-Welch to estimate both the context probabilities and the lexical probabilities

Supervised learning Use a tagged corpus Count the frequencies of tag-pairs t,w: C(t,w) Estimate (Maximum Likelihood Estimate): Count the frequencies of tag n-grams C(t 1 …t n ) Estimate (Maximum Likelihood Estimate): What about small counts? Zero counts?

Sparse Training Data - Smoothing Adding a bias: Compensates for estimation (Bayesean approach) Has larger effect on low-count words Solves zero-count word problem Generalized Smoothing: Reduces to bias using:

Decision-tree tagging Not all n-grams are created equal: –Some n-grams contain redundant information that may be expressed well enough with less tags –Some n-grams are too sparse Decision Tree (Schmid, 1994)

Decision Trees Each node is a binary test of tag t i-k. The leaves store probabilities for t i. All HMM algorithms can still be used Learning: –Build tree from root to leaves –Choose tests for nodes that maximize information gain –Stop when branch too sparse –Finally, prune tree

Transformation-based learning Eric Brill (1993) Start from an initial tagging, and apply a series of transformations Transformations are learned as well, from the training data Captures the tagging data in much fewer parameters than stochastic models The transformations learned have linguistic meaning

Transformation-based learning Examples: Change tag a to b when: –The preceding (following) word is tagged z –The word two before (after) is tagged z –One of the 2 preceding (following) words is tagged z –The preceding word is tagged z and the following word is tagged w –The preceding (following) word is W

Transformation-based Tagger: Learning Start with initial tagging Score the possible transformations by comparing their result to the “truth”. Choose the transformation that maximizes the score Repeat last 2 steps