Allen’s Chapter 7 J&M’s Chapters 8 and 12

Allen’s Chapter 7 J&M’s Chapters 8 and 12
Statistical Methods Allen’s Chapter 7 J&M’s Chapters 8 and 12

Statistical Methods Large data sets (Corpora) of natural languages allow using statistical methods that were not possible before Brown Corpus includes about words with POS Penn Treebank contains full syntactic annotations

Part of Speech Tagging Determining the most likely category of each word in a sentence with ambiguous words Example: finding POS of words that can be either nouns and/or verbs Need two random variables: C that ranges over POS {N, V} W that ranges over all possible words

Part of Speech Tagging (Cont.)
We don’t have true probabilities We can estimate using large data sets Suppose: There is a Corpus with words There is 1000 uses of flies: 400 with an noun sense, and 600 with a verb sense P(flies) = 1000 / = P(flies & N) = 400 / = P(flies & V) = 600 / = P(V | flies) = P(V & flies) / P(flies) = / = 0.625 So in %60 occasions flies is a verb

Estimating Probabilities
We want to use probability to predict the future events Using the information P(V | flies) = to predict that the next “flies” is more likely to be a verb This is called Maximum Likelihood estimation (MLE) Generally the larger the data set we use, the more accuracy we get

Estimating Probabilities (Cont.)
Estimating the outcome probability of tossing a coin (i.e., 0.5) Acceptable margin of error : ( ) The more tests performed, the more accurate estimation 2 trials: %50 chance of reaching acceptable result 3 trials: %75 chance 4 trials: %87.5 chance 8 trials: %93 chance 12trials: %95 chance …

Estimating tossing a coin outcome

So the larger data set the better, but The problem of sparse data Brown Corpus contains about a million words but there is only different words, so one expect each word occurs about 20 times, But over words occur less than 5 times.

For a random variable X with a set of values Vi, computed from counting number times X = xi P(X = xi)  Vi / i Vi Maximum Likelihood Estimation (MLE) uses Vi = |xi| Expected likelihood Estimation (ELE) Uses Vi = |xi| + 0.5

MLE vs ELE Suppose a word w doesn’t occur in the Corpus
We want to estimate w occurring in one of 40 classes L1 … L40 We have a random variable X, X = xi only if w appears in word category Li By MLE, P(Li | w) is undefined because the divisor is zero ELE , P(Li | w)  0.5 / 20 = 0.025 Suppose w occurs 5 times (4 times as an noun and once as a verb) By MLE, P(N |w) = 4/5 = 0.8, By ELE, P(N | w) =4.5/25 = 0.18

Evaluation Data set is divided into: Cross-Validation:
Training set (%80-%90 of the data) Test set (%10-%20) Cross-Validation: Repeatedly removing different parts of corpus as the test set, Training on the reminder of the corpus, Then evaluating the new test set.

Part of speech tagging Simplest Algorithm: choose the interpretation that occurs most frequently “flies” in the sample corpus was %60 a verb This algorithm success rate is %90 Over %50 of words appearing in most corpora are unambiguous To improve the success rate, Use the tags before or after the word under examination If “flies” is preceded by the word “the” it is definitely a noun

Part of speech tagging (Cont.)
Bayes rule: P(A | B) = P(A) * P(B | A) / P(B) There is a sequence of words w1 … wt, and Find a sequence of lexical categories C1 … Ct, such that P(C1 … Ct | w1 … wt) is maximized Using the Bayes rule: P(C1 … Ct) * P(w1 … wt | C1 … Ct) / P(w1 … wt) The problem is reduced to finding C1 … Ct, such that P(C1 … Ct) * P(w1 … wt | C1 … Ct) is maximized The probabilities can be estimated by some independence assumptions

Using the information about The previous word category: bigram Or two previous word categories: trigram Or n-1 previous word categories: n-gram Using the bigram model P(C1 … Ct)  i=1,t P(Ci | Ci-1) P(Art N V N) = P(Art, ) * P(N | ART) * P( V | N) * P(N | V) P(w1 … wt | C1 … Ct)  i=1,t P(wi | Ci) We are looking for a sequence C1 … Ct such that i=1,t P(Ci | Ci-1) * P(wi | Ci) is maximized

The information needed by this new formula can be extracted from the corpus P(Ci = V | Ci-1 = N) = Count( N at position i-1 & V at position i) / Count (N at position i-1) (Fig. 7-4) P( the | Art) = Count(# times the is an Art) / Count(# times an Art occurs) (Fig. 7-6)

Using an Artificial corpus
An artificial corpus generated with 300 sentences of categories Art, N, V, P 1998 words, 833 nouns, 300 verbs, 558 article, and 307 propositions, To deal with the problem of the problem of the sparse data, a minimum probability of is assumed

Bigram probabilities from the generated corpus

Word counts in the generated corpus
N V ART P TOTAL flies 21 23 44 fruit 49 5 1 55 like 10 30 61 a 201 202 the 300 2 303 flower 53 15 68 flowers 42 16 58 birds 64 65 others 592 210 56 284 1142 833 558 307 1998

How to find the sequence C1 … Ct that maximizes i=1,t P(Ci | Ci-1) * P(wi | Ci) Brute Force search: Finding all possible sequences With N categories and T words, there are NT sequences Using bigram probabilities, the probability wi to be in category Ci depends only on Ci-1 The process can be modeled by a special form of probabilistic finite state machine (Fig. 7-7)

Markov Chain Probability of a sequence of 4 words being in cats:
ART N V N 0.71 * 1 * 0.43 * 0.35 = 0.107 The representation is accurate only if the probability of a category occurring depends only the one category before it. This called the Markov assumption The network is called Markov chain

Hidden Markov Model (HMM)
Markov network can be extended to include the lexical-generation probabilities, too. Each node could have an output probability for its every possible corresponding output The output probabilities are exactly the lexical-generation probabilities shown in fig 7-6 Markov network with output probabilities is called Hidden Markov Model (HMM)

The word hidden indicates that for a specific sequence of words, it is not clear what state the Markov model is in For instance, the word “flies” could be generated from state N with a probability of 0.25, or from state V with a probability of 0.076 Now, it is not trivial to compute the probability of a sequence of words from the network

The probability that the sequence N V ART N generates the output Flies Like a flower is: The probability of path N V ART N is 0.29 * 0.43 * 0.65 * 1 = 0.081 The probability of the output being Flies like a flower is P(flies | N) * P(like | V) * P(a | ART) * P(flower | N) = 0.025 * 0.1 * 0.36 * = 5.4 * 10-5 The likelihood that HMM would generate the sentence is * = * 10-6 Therefore, the probability of a sentence w1 … wt, given a sequence C1 … Ct, is i=1,t P(Ci | Ci-1) * P(wi | Ci)

Markov Chain

Viterbi Algorithm

Flies like a flower

Flies like a flower Brute force search steps are NT
Viterbi algorithm steps are K* T * N2

Getting Reliable Statistics (smoothing)
Suppose we have 40 categories To collect unigrams, at least 40 samples, one for each category, are needed For bigrams, 1600 samples are needed For trigerams, samples are needed For 4-grams, samples are needed P(Ci | C … Ci-1) = 1P(Ci) + 2P(Ci | Ci-1) + 3P(Ci | Ci-2Ci-1) 1+ 2 + 3 = 1

Statistical Parsing Corpus-based methods offer new ways to control parsers We could use statistical methods to identify the common structures of a Language We can choose the most likely interpretation when a sentence is ambiguous This might lead to much more efficient parsers that are almost deterministic

Statistical Parsing What is the input of an statistical parser?
Input is the output of a POS tagging algorithm If POSs are accurate, lexical ambiguity is removed But if tagging is wrong, parser cannot find the correct interpretation, or, may find a valid but implausible interpretation With %95 accuracy, the chance of correctly tagging a sentence of 8 words is 0.67, and that of 15 words is 0.46

Obtaining Lexical probabilities
A better approach is: computing the probability that each word appears in the possible lexical categories. combining these probabilities with some method of assigning probabilities to rule use in the grammar The context independent Lexical category of a word w be Lj can be estimated by: P(Lj | w) = count (Lj & w) / i=1, N count( Li & w)

Context-independent lexical categories
P(Lj | w) = count (Lj & w) / i=1,N count( Li & w) P(Art | the) = 300 /303 =0.99 P(N | flies) = 21 / 44 = 0.48

Context dependent lexical probabilities
A better estimate can be obtained by computing how likely it is that category Li occurs at position t, in all sequences of the input w1 … wt Instead of just finding the sequence with the maximum probability, we add up the probabilities of all sequences that end in wt/Li The probability that flies is a noun in the sentence The flies like flowers is calculated by adding the probability of all sequences that end with flies as a noun

Context-dependent lexical probabilities
Similarly, there are three nonzero sequences ending with flies as a V with a total value of 1.13 * 10 -5 P(The flies) = 9.58 * * = 9.591 * 10 -3 P(flies/N | The flies) = P(flies/N & The flies) / P(The flies) = 9.58 * / * = P(flies/V | The flies) = P(flies/V & The flies) / P(The flies) = 1.13 * / * =

Forward Probabilities
The probability of producing the words w1 … wt and ending is state wt/Li is called the forward probability i(t) and is defined as: i(t) = P(wt/Li & w1 … wt) In the flies like flowers, 2(3) is the sum of values computed for all sequences ending in a V (the second category) in position 3, for the input the flies like P(wt/Li | w1 … wt) = P(wt/Li & w1 … wt) / P(w1 … wt)  i(t) / j=1, N j(t)

Forward Probabilities

Context dependent lexical Probabilities

Backward Probability Backward probability, j(t)), is the probability of producing the sequence wt … wT beginning from the state wt/Lj P(wt/Li) (i(t) * i(t) ) / j=1, N (j(t) * i(t))

Probabilistic Context-free Grammars
CFGs can be generalized to PCFGs We need some statistics on rule use The simplest approach is to count the number of times each rule is used in a corpus with parsed sentences If category C has rules R1 … Rm, then P(Rj | C) = count(# times Rj used) / i=1,m count(# times Ri used)

Probabilistic Context-free Grammars

Independence assumption
You can develop algorithm similar to the Veterbi algorithm that finds the most probable parse tree for an input Certain independence assumptions must be made The probability of a constituent being derived by a rule Rj is independent of how the constituent is used as a sub constituent The probabilities of NP rules are the same no matter the NP is the subject, the object of a verb, or the object of a proposition This assumption is not valid; a subject NP is much more likely to be a pronoun than an object NP

Inside Probability The probability that a constituent C generates a sequence of words wi, wi+1, …, wj (written as wi,j) is called the inside probability and is denoted as P(wi,j | C) It is called inside probability because it assigns a probability to the word sequence inside the constituent

Inside Probabilities How to derive inside probabilities?
For lexical categories, these are the same as lexical-generation probabilities P(flower | N) is the inside probability that the constituent N is realized as the word flower (0.06 in fig. 7-6) Using lexical-generation probabilities, inside probabilities of Non-lexical constituents can be computed

Probabilistic chart parsing
In parsing, we are interested in finding the most likely parse rather than the overall probability of a given sentence. We can a Chart Parser for this propose When entering an entry E of category C using rule i with n sub constituents corresponding to entries E1 … En, then P(E) = P(Rule i | C) * P(E1) * … * P(En) For lexical categories, it is better to use forward probabilities rather than lexical-generation probs.

A flower

Probabilistic Parsing
This technique identifies the correct parse %50 times The reason is that the independence assumption is too radical One of crucial issues is handling of lexical items A context-free model does not consider lexical preferences Parser prefers that PP attached to V rather than NP, and fails to find the correct structure of those that PP should be attached to NP

Best-First Parsing Exploring higher probability constituents first
Much of the search space, containing lower-rated probabilities is not explored Chart parser’s Agenda is organized as a priority queue Arc extension algorithm need to be modified

New arc extension for Prob. Chart Parser

The man put a bird in the house
Best first parser finds the correct parse after generating 65 constituents, Standard bottom-up parser generates 158 constituents Standard algorithm generates 106 constituents to find the first answer So, the best-first parsing is a significant improvement

Best First Parsing It finds the most probable interpretation first
Probability of a constituent is always lower or equal to the probability of any of its sub constituents If S2 with probability of p2 is found after S1 with the probability of p1, then p2 cannot be higher than p1, otherwise: Sub constituents of S2 would have higher probabilities than p1 and would be found sooner than S1 and thus S2 would be found sooner, too

Problem of multiplication
In practice with large grammars, probabilities would drop quickly because of multiplications Other functions can be used Score(C) = MIN (Score(C  C1 … Cn) , Score(C1), …, Score(Cn) But MIN leads to a %39 correct result

Context-dependent probabilistic parsing
The best-first algorithm improves the efficiency, but has no effect on accuracy Computing rules probability based on some context-dependent lexical information can improve accuracy The first word of a constituent is often its head word Computing the probability of rules based on the first word of constituents : P(R | C, w)

Context-dependent probabilistic parsing
P(R | C, w) = Count( # times R used for cat. C starting with w) / Count(# times cat. C starts with w) Singular names rarely occur alone as a noun phrase (NP  N) Plural nouns rarely act as a modifying name (NP  N N) Context-dependent rules also encode verb preferences for sub categorizations

Rule probabilities based on the first word of constituents

Context-Dependent Parser Accuracy

The man put the bird in the house
P(VP  V NP PP | VP, put) = 0.93 * 0.99 * 0.76 * 0.76 = 0.54 P(VP V NP | VP, put) =

The man Likes the bird in the house
P(VP  V NP PP | VP, like) = 0.1 P(VP V NP | VP, like) = 0.054

Context-dependent rules
The accuracy of the parser is still %66 Make the rule probabilities relative to larger fragment of input (bigram, trigram, …) Using other important words, such as prepositions The more selective the lexical categories, the more predictive the estimates can be (provided that there is enough data) Other closed class words such as articles, quantifiers, conjunctions can also be used (i.e., treated individually) But what about open class words such as verbs and nouns (cluster similar words)

Handling Unknown Words
An unknown word will disrupt the parse Suppose we have a trigram model of data If w3 in the sequence of words w1 w2 w3 is unknown, and if w1 and w2 are of categories C1 and C2 Pick the category C for w3 such that P(C | C1 C2) is maximized. For instance, if C2 is ART, then C will probably be a NOUN (or an ADJECTIVE) Morphology can also help Unknown words ending with –ing are likely a VERB, and those ending with –ly are likely an ADVERB

Allen’s Chapter 7 J&M’s Chapters 8 and 12

Similar presentations

Presentation on theme: "Allen’s Chapter 7 J&M’s Chapters 8 and 12"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Allen’s Chapter 7 J&M’s Chapters 8 and 12

Similar presentations

Presentation on theme: "Allen’s Chapter 7 J&M’s Chapters 8 and 12"— Presentation transcript:

Similar presentations

About project

Feedback