Presentation is loading. Please wait.

Presentation is loading. Please wait.

I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Similar presentations


Presentation on theme: "I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario."— Presentation transcript:

1 I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario

2 Graphical Models Within the Machine Learning framework Probability theory plus graph theory Widely used –NLP –Speech recognition –Error correcting codes –Systems diagnosis –Computer vision –Filtering (Kalman filters) –Bioinformatics

3 (Quick intro to) Graphical Models Nodes are random variables B C D A P(A) P(D) P(B|A) P(C|A,D) Edges are annotated with conditional probabilities Absence of an edge between nodes implies conditional independence “Probabilistic database”

4 Graphical Models A BCD Define a joint probability distribution: P(X 1,..X N ) =  i P(X i | Par(X i ) ) P(A,B,C,D) = P(A)P(D)P(B|A)P(C|A,D) Learning –Given data, estimate the parameters P(A), P(D), P(B|A), P(C | A, D)

5 Graphical Models Define a joint probability distribution: P(X 1,..X N ) =  i P(X i | Par(X i ) ) P(A,B,C,D) = P(A)P(D)P(B|A)P(C,A,D) Learning –Given data, estimate P(A), P(B|A), P(D), P(C | A, D) Inference: compute conditional probabilities, e.g., P(A|B, D) or P(C | D) Inference = Probabilistic queries General inference algorithms (e.g. Junction Tree) A BCD

6 Naïve Bayes models Simple graphical model X i depend on Y Naïve Bayes assumption: all x i are independent given Y Currently used for text classification and spam detection x1x1 x2x2 x3x3 Y

7 Naïve Bayes models Naïve Bayes for document classification w1w1 w2w2 wnwn topic Inference task: P(topic | w 1, w 2 … w n )

8 Naïve Bayes for SWD v1v1 v2v2 v3v3 sksk Recall the general joint probability distribution: P(X1,..XN) =  i P(Xi | Par(Xi) ) P(s k, v 1..v 3 ) = P(s k )  P(v i | Par(v i )) = P(s k ) P(v 1 | s k ) P(v 2 | s k ) P(v 3 | s k )

9 Naïve Bayes for SWD v1v1 v2v2 v3v3 sksk P(s k, v 1..v 3 ) = P(s k )  P(v i | Par(v i )) = P(s k ) P(v 1 | s k ) P(v 2 | s k ) P(v 3 | s k ) Estimation (Training): Given data, estimate: P(s k ) P(v 1 | s k ) P(v 2 | s k ) P(v 3 | s k )

10 Naïve Bayes for SWD v1v1 v2v2 v3v3 sksk P(s k, v 1..v 3 ) = P(s k )  P(v i | Par(v i )) = P(s k ) P(v 1 | s k ) P(v 2 | s k ) P(v 3 | s k ) Estimation (Training): Given data, estimate: P(s k ) P(v 1 | s k ) P(v 2 | s k ) P(v 3 | s k ) Inference (Testing): Compute conditional probabilities of interest: P(s k | v 1, v 2, v 3 )

11 Graphical Models Given Graphical model –Do estimation (find parameters from data) –Do inference (compute conditional probabilities) How do I choose the model structure (i.e. the edges)?

12 How to choose the model structure? v1v1 v2v2 v3v3 sksk v1v1 v2v2 v3v3 sksk v1v1 v2v2 v3v3 sksk v1v1 v2v2 v3v3 sksk

13 Model structure Learn it: structure learning –Difficult & need a lot of data Knowledge of the domain and of the relationships between the variables –Heuristics –The fewer dependencies (edges) we can have, the “better” Sparsity: more edges, need more data Next class… –Direction of arrows v1v1 v2v2 v3v3 sksk P (v 3 | s k, v 1, v 2 )

14 Generative vs. discriminative P(s k, v 1..v 3 ) = P(s k )  P(v i | Par(v i )) = P(s k ) P(v 1 | s k ) P(v 2 | s k ) P(v 3 | s k ) Estimation (Training): Given data, estimate: P(s k ), P(v 1 | s k ), P(v 2 | s k ) and P(v 3 | s k ) Inference (Testing): Compute: P(s k | v 1, v 2, v 3 ) (there are algorithms to find these cond. Pb, not covered here) v1v1 v2v2 v3v3 sksk P(s k, v 1..v 3 ) = P(v 1 ) P(v 2 ) P(v 3 ) P( s k | v 1, v 2 v 3 ) Conditional pb. of interest is “ready”: P(s k | v 1, v 2, v 3 ) i.e. modeled directly Estimation (Training): Given data, estimate: P(v 1 ), P(v 2 ), P(v 3 ), and P( s k | v 1, v 2 v 3 ) Do inference to find Pb of interestPb of interest is modeled directly v1v1 v2v2 v3v3 sksk GenerativeDiscriminative

15 Generative vs. discriminative Don’t worry…. You can use both models If you are interested, let me know But in short: –If the Naive Bayes assumption made by the generative method is not met (conditional independencies not true), the discriminative method can have an edge –But the generative model may converge faster –Generative learning can sometimes be more efficient than discriminative learning; at least when the number of features is large compared to the number of samples

16 Graphical Models Provides a convenient framework for visualizing conditional independent Provides general inference algorithms Next, we’ll see a GM (Hidden Markov Model) for POS

17 Part-of-speech (English) From Dan Klein’s cs 288 slides

18 Modified from Diane Litman's version of Steve Bird's notes 18 Terminology Tagging –The process of associating labels with each token in a text Tags –The labels –Syntactic word classes Tag Set –The collection of tags used

19 19 Example Typically a tagged text is a sequence of white-space separated base/tag tokens: These/DT findings/NNS should/MD be/VB useful/JJ for/IN therapeutic/JJ strategies/NNS and/CC the/DT development/NN of/IN immunosuppressants/NNS targeting/VBG the/DT CD28/NN costimulatory/NN pathway/NN./.

20 Part-of-speech (English) From Dan Klein’s cs 288 slides

21 POS tagging vs. WSD Similar task: assign POS vs. assign WOS –You should butter your toast –Bread and butter Using a word as noun or verb involves a different meaning, like WSD In practice the two topics POS and WOS have been distinguished, because of for their different nature and also because the methods used are different –Nearby structures are most useful for POS (e.g. is the preceding word a determiner?) but are of little use for WOS –Conversely, quite distant content words are very effective for determining the semantic sense, but not POS

22 Part-of-Speech Ambiguity From Dan Klein’s cs 288 slides (particle) (preposition) (adverb)

23 Part-of-Speech Ambiguity Words that are highly ambiguous as to their part of speech tag

24 Sources of information Syntagmatic: tags of the other words –AT JJ NN is common –AT JJ VBP impossible (or unlikely) Lexical: look at the words –The  AT –Flour  more likely to be a noun than a verb –A tagger that always chooses the most common tag is 90% correct (often used as baseline) Most taggers use both

25 Modified from Diane Litman's version of Steve Bird's notes 25 What does Tagging do? 1.Collapses Distinctions Lexical identity may be discarded e.g., all personal pronouns tagged with PRP 2.Introduces Distinctions Ambiguities may be resolved e.g. deal tagged with NN or VB 3.Helps in classification and prediction

26 Modified from Diane Litman's version of Steve Bird's notes 26 Why POS? A word’s POS tells us a lot about the word and its neighbors: –Limits the range of meanings (deal), pronunciation (text to speech) (object vs object, record) or both (wind) –Helps in stemming: saw[v] → see, saw[n] → saw –Limits the range of following words –Can help select nouns from a document for summarization –Basis for partial parsing (chunked parsing)

27 Why POS? From Dan Klein’s cs 288 slides

28 Slide modified from Massimo Poesio's 28 Choosing a tagset The choice of tagset greatly affects the difficulty of the problem Need to strike a balance between –Getting better information about context –Make it possible for classifiers to do their job

29 Slide modified from Massimo Poesio's 29 Some of the best-known Tagsets Brown corpus: 87 tags –(more when tags are combined) Penn Treebank: 45 tags Lancaster UCREL C5 (used to tag the BNC): 61 tags Lancaster C7: 145 tags!

30 NLTK Whenever a corpus contains tagged text, the NLTK corpus interface will have a tagged_words() method.

31 Tagging methods Hand-coded Statistical taggers –N-Gram Tagging –HMM –(Maximum Entropy) Brill (transformation-based) tagger

32 Hand-coded Tagger The Regular Expression Tagger

33 Unigram Tagger Unigram taggers are based on a simple statistical algorithm: for each token, assign the tag that is most likely for that particular token. –For example, it will assign the tag JJ to any occurrence of the word frequent, since frequent is used as an adjective (e.g. a frequent word) more often than it is used as a verb (e.g. I frequent this cafe).

34 Unigram Tagger We train a UnigramTagger by specifying tagged sentence data as a parameter when we initialize the tagger. The training process involves inspecting the tag of each word and storing the most likely tag for any word in a dictionary, stored inside the tagger. We must be careful not to test it on the same data. A tagger that simply memorized its training data and made no attempt to construct a general model would get a perfect score, but would also be useless for tagging new text. Instead, we should split the data, training on 90% and testing on the remaining 10% (or 75% and 25%) Calculate performance on previously unseen text. –Note: this is general procedure for learning systems

35 N-Gram Tagging An n-gram tagger is a generalization of a unigram tagger whose context is the current word together with the part-of-speech tags of the n-1 preceding tokens A 1-gram tagger is another term for a unigram tagger: i.e., the context used to tag a token is just the text of the token itself. 2- gram taggers are also called bigram taggers, and 3-gram taggers are called trigram taggers. trigram tagger

36 N-Gram Tagging Why not 10-gram taggers?

37 N-Gram Tagging Why not 10-gram taggers? As n gets larger, the specificity of the contexts increases, as does the chance that the data we wish to tag contains contexts that were not present in the training data. This is known as the sparse data problem, and is quite pervasive in NLP. As a consequence, there is a trade-off between the accuracy and the coverage of our results (and this is related to the precision/recall trade-off) Next week: sparsity

38 Markov Model Tagger Bigram tagger Assumptions: –Words are independent of each other –A word identity depends only on its tag –A tag depends only on the previous tag How does a GM with these assumption look like?

39 Markov Model Tagger t1t1 w1w1 t2t2 w2w2 tntn wnwn

40 Markov Model Tagger Training For all of tags t i do –For all tags t j do –end For all of tags t i do –For all words w i do C(t j,t i ) = number of occurrences of t j followed by t i C(w j, t j ) = number of occurrences of w i that are labeled as followed as t i

41 Markov Model Tagger Estimation Goal/Estimation: –Find the optimal tag sequence for a given sentence –The Viterbi algorithm

42 Sequence free tagging? From Dan Klein’s cs 288 slides

43 Sequence free tagging? Solution: maximum entropy sequence models (MEMMs- maximum entropy markov models, CRFs– conditional random fields) From Dan Klein’s cs 288 slides

44 Modified from Diane Litman's version of Steve Bird's notes 44 Rule-Based Tagger The Linguistic Complaint –Where is the linguistic knowledge of a tagger? –Just massive tables of numbers –Aren’t there any linguistic insights that could emerge from the data? –Could thus use handcrafted sets of rules to tag input sentences, for example, if input follows a determiner tag it as a noun.

45 Slide modified from Massimo Poesio's 45 The Brill tagger (transformation-based tagger) An example of Transformation-Based Learning –Basic idea: do a quick job first (using frequency), then revise it using contextual rules. Very popular (freely available, works fairly well) –Probably the most widely used tagger (esp. outside NLP) –…. but not the most accurate: 96.6% / 82.0 % A supervised method: requires a tagged corpus

46 Brill Tagging: In more detail Start with simple (less accurate) rules…learn better ones from tagged corpus –Tag each word initially with most likely POS –Examine set of transformations to see which improves tagging decisions compared to tagged corpus –Re-tag corpus using best transformation –Repeat until, e.g., performance doesn’t improve –Result: tagging procedure (ordered list of transformations) which can be applied to new, untagged text

47 Slide modified from Massimo Poesio's 47 An example Examples: –They are expected to race tomorrow. –The race for outer space. Tagging algorithm: 1.Tag all uses of “race” as NN (most likely tag in the Brown corpus) They are expected to race/NN tomorrow the race/NN for outer space 2.Use a transformation rule to replace the tag NN with VB for all uses of “race” preceded by the tag TO: They are expected to race/VB tomorrow the race/NN for outer space

48 What gets learned? [from Brill 95] Tags-triggered transformationsMorphology-triggered transformations Rules are linguistically interpretable

49 Tagging accuracies (overview) From Dan Klein’s cs 288 slides

50 Tagging accuracies From Dan Klein’s cs 288 slides

51 Tagging accuracies Taggers are already pretty good on WSJ journal text… What we need is taggers that work on other text! Performance depends on several factors –The amount of training data –The tag set (the larger, the harder the task) –Difference between training and testing corpus –Unknown words For example, technical domains

52 Common Errors From Dan Klein’s cs 288 slides

53 Next week What happen when ? Sparsity Methods to deal with it –For example: Back-off: if use instead:

54 Administrativia Assignment 2 is out –Due September 22 –Soon grades and “best” solutions to assignment 1 Reading for next class –Chapter 6 Statistical NLP


Download ppt "I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario."

Similar presentations


Ads by Google