Basic Probability and Statistics CIS 8590 – Fall 2008 NLP 1.

Basic Probability and Statistics CIS 8590 – Fall 2008 NLP 1

Outline Basic concepts in probability theory Bayes’ rule Random variable and distributions

Definition of Probability Experiment: toss a coin twice Sample space: possible outcomes of an experiment – S = {HH, HT, TH, TT} Event: a subset of possible outcomes – A={HH}, B={HT, TH} Probability of an event : a number assigned to an event Pr(A) – Axiom 1: Pr(A)  0 – Axiom 2: Pr(S) = 1 – Axiom 3: For every sequence of disjoint events – Example: Pr(A) = n(A)/N: frequentist statistics

Joint Probability For events A and B, joint probability Pr(A,B) stands for the probability that both events happen. Example: A={HH}, B={HT, TH}, what is the joint probability Pr(A,B)?

Independence Two events A and B are independent in case Pr(AB) = Pr(A)Pr(B) A set of events {A i } is independent in case

Independence Two events A and B are independent in case Pr(AB) = Pr(A)Pr(B) A set of events {A i } is independent in case Example: Drug test WomenMen Success2001800 Failure1800200 A = {A patient is a Women} B = {Drug fails} Will event A be independent from event B ?

Independence Consider the experiment of tossing a coin twice Example I: – A = {HT, HH}, B = {HT} – Will event A independent from event B? Example II: – A = {HT}, B = {TH} – Will event A independent from event B? Disjoint  Independence If A is independent from B, B is independent from C, will A be independent from C?

If A and B are events with Pr(A) > 0, the conditional probability of B given A is Conditioning

If A and B are events with Pr(A) > 0, the conditional probability of B given A is Example: Drug test Conditioning WomenMen Success2001800 Failure1800200 A = {Patient is a Women} B = {Drug fails} Pr(B|A) = ? Pr(A|B) = ?

If A and B are events with Pr(A) > 0, the conditional probability of B given A is Example: Drug test Given A is independent from B, what is the relationship between Pr(A|B) and Pr(A)? Conditioning WomenMen Success2001800 Failure1800200 A = {Patient is a Women} B = {Drug fails} Pr(B|A) = ? Pr(A|B) = ?

Chain Rule of Probability Chain Rule (1): Pr(A, B) = Pr(A) Pr(B|A) Chain Rule (general): Pr(A 1,…,A n ) = Pr(A 1 ) Pr(A 2 |A 1 ) Pr(A 3 |A 2,A 1 ) … Pr(A n |A n-1, …, A 1 ) Exercise: Prove Chain Rule 1 11

Marginalization 1. 2. Exercise: prove these rules! CIS 8590 – Fall 2008 NLP 12

Conditional Independence Event A and B are conditionally independent given C in case Pr(AB|C)=Pr(A|C)Pr(B|C) A set of events {A i } is conditionally independent given C in case

Conditional Independence (cont’d) Example: There are three events: A, B, C – Pr(A) = Pr(B) = Pr(C) = 1/5 – Pr(A,C) = Pr(B,C) = 1/25, Pr(A,B) = 1/10 – Pr(A,B,C) = 1/125 – Whether A, B are independent? – Whether A, B are conditionally independent given C? A and B are independent  A and B are conditionally independent

Outline Important concepts in probability theory Bayes’ rule Random variables and distributions

Given two events A and B and suppose that Pr(A) > 0. Then Example: Bayes’ Rule Pr(W|R)R RR W0.70.4 WW 0.30.6 R: It is a rainy day W: The grass is wet Pr(R|W) = ? Pr(R) = 0.8

Bayes’ Rule R RR W0.70.4 WW 0.30.6 R: It rains W: The grass is wet RW Information Pr(W|R) Inference Pr(R|W)

Bayes’ Rule R RR W0.70.4 WW 0.30.6 R: The weather rains W: The grass is wet Hypothesis H Evidence E Information: Pr(E|H) Inference: Pr(H|E) PriorLikelihoodPosterior

Outline Important concepts in probability theory Bayes’ rule Random variable and probability distribution

Random Variable and Distribution A random variable X is a numerical outcome of a random experiment The distribution of a random variable is the collection of possible outcomes along with their probabilities: – Discrete case: – Continuous case:

Random Variable: Example Let S be the set of all sequences of three rolls of a die. Let X be the sum of the number of dots on the three rolls. What are the possible values for X? Pr(X = 5) = ?, Pr(X = 10) = ?

Expectation A random variable X~Pr(X=x). Then, its expectation is – In an empirical sample, x1, x2,…, xN, Continuous case: Expectation of sum of random variables

Expectation: Example Let S be the set of all sequence of three rolls of a die. Let X be the sum of the number of dots on the three rolls. What is E(X)? Let S be the set of all sequence of three rolls of a die. Let X be the product of the number of dots on the three rolls. What is E(X)?

Variance The variance of a random variable X is the expectation of (X-E[x]) 2 :

Normal (Gaussian) Distribution X~N( ,  ) E[X]= , Var(X)=  2 If X 1 ~N(  1,  1 ) and X 2 ~N(  2,  2 ), X= X 1 + X 2 ?

Sequence Labeling

Outline Graphical Models Hidden Markov Models – Probability of a sequence – Viterbi (or decoding) – Baum-Welch Conditional Random Fields

Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging: The cat sat on the mat. DTNNVBDINDTNN.

Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. Another example, partial parsing (aka chunking): The cat sat on the mat B-NPI-NPB-VPB-PPB-NPI-NP

Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. Another example, relation extraction: The cat sat on the mat B-ArgI-ArgB-RelI-RelB-ArgI-Arg

Graphical Models HasAnthrax HasCoughHasFeverHasDifficultyBreathingHasWideMediastinum X1X1 X3X3 X2X2 X4X4 X6X6 X5X5 Example Bayes Net This one happens to be a “Naïve Bayes” model. Markov Random Field

Graphical Models: Pieces GraphProbability Distribution Vertices (aka, ‘nodes’) Edges between vertices - directed edges (Bayes Nets) - or undirected edges (Markov Random Fields) Random Variables (one for each vertex in graph) Local function describing the probability of part of the graph (subset of RVs) A formula for combining local functions into an overall probability

A Bayesian Network A Bayesian network is made up of: AP(A) false0.6 true0.4 A B CD ABP(B|A) false 0.01 falsetrue0.99 truefalse0.7 true 0.3 BCP(C|B) false 0.4 falsetrue0.6 truefalse0.9 true 0.1 BDP(D|B) false 0.02 falsetrue0.98 truefalse0.05 true 0.95 1. A Directed Acyclic Graph (DAG) 2. A set of tables for each node in the graph

Weng-Keen Wong, Oregon State University ©2005 34 A Directed Acyclic Graph A B CD Each node in the graph is a random variable A node X is a parent of another node Y if there is an arrow from node X to node Y eg. A is a parent of B Informally, an arrow from node X to node Y means X has a direct influence on Y

A Set of Tables for Each Node Each node X i has a conditional probability distribution P(X i | Parents(X i )) that quantifies the effect of the parents on the node The parameters are the probabilities in these conditional probability tables (CPTs) AP(A) false0.6 true0.4 ABP(B|A) false 0.01 falsetrue0.99 truefalse0.7 true 0.3 BCP(C|B) false 0.4 falsetrue0.6 truefalse0.9 true 0.1 BDP(D|B) false 0.02 falsetrue0.98 truefalse0.05 true 0.95 A B CD

Weng-Keen Wong, Oregon State University ©2005 36 A Set of Tables for Each Node Conditional Probability Distribution for C given B If you have a Boolean variable with k Boolean parents, this table has 2 k+1 probabilities (but only 2 k need to be stored) BCP(C|B) false 0.4 falsetrue0.6 truefalse0.9 true 0.1 For a given combination of values of the parents (B in this example), the entries for P(C=true | B) and P(C=false | B) must add up to 1 eg. P(C=true | B=false) + P(C=false |B=false )=1

Weng-Keen Wong, Oregon State University ©2005 37 Inference An example of a query would be: P( HasAnthrax = true | HasFever = true, HasCough = true) Note: Even though HasDifficultyBreathing and HasWideMediastinum are in the Bayesian network, they are not given values in the query (ie. they do not appear either as query variables or evidence variables) They are treated as unobserved variables HasAnthrax HasCoughHasFeverHasDifficultyBreathingHasWideMediastinum

Weng-Keen Wong, Oregon State University ©2005 38 The Bad News Exact inference is feasible in small to medium- sized networks Exact inference in large networks takes a very long time We resort to approximate inference techniques which are much faster and give pretty good results

Outline Graphical Models Hidden Markov Models – Probability of a sequence (decoding) – Viterbi (Best hidden layer sequence) – Supervised parameter estimation – Baum-Welch Conditional Random Fields

The Hidden Markov Model A dynamic Bayes Net (dynamic because the size can change). The O i nodes are called observed nodes. The A i nodes are called hidden nodes. CIS 8590 – Fall 2008 NLP 43 A1A1 O1O1 A2A2 O2O2 AnAn OnOn … …

HMMs and Language Processing HMMs have been used in a variety of applications, but especially: – Speech recognition (hidden nodes are text words, observations are spoken words) – Part of Speech Tagging (hidden nodes are parts of speech, observations are words) CIS 8590 – Fall 2008 NLP 44 A1A1 O1O1 A2A2 O2O2 AnAn OnOn … …

HMM Independence Assumptions HMMs assume that: A i is independent of A 1 through A i-2, given A i-1 O i is independent of all other nodes, given A i P(A i | A i-1 ) and P(O i | A i ) do not depend on I Not very realistic assumptions about language – but HMMs are often good enough, and very convenient. CIS 8590 – Fall 2008 NLP 45 A1A1 O1O1 A2A2 O2O2 AnAn OnOn … …

HMM Formula An HMM predicts that the probability of observing a sequence o = with a particular set of hidden states a = is: To calculate, we need: - P(a 1 ) for all values of a 1 - P(o|a) for all values of o and a - P(a|a prev ) for all values of a and a prev

HMM: Pieces 1)A set of states S = {s 1, …, s N } that are the values which hidden nodes may take. 2)A vocabulary, or set of states V = {v 1, …, v M } that are the values which an observed node may take. 3)Initial probabilities P(s i ) for all i -Written as a vector of N initial probabilities, called π 4)Transition probabilities P(s j | s i ) for all j, I -Written as an NxN ‘transition matrix’ A 5)Observation probabilities P(v j |s i ) for all j, i - written as an MxN ‘observation matrix’ B

HMM for POS Tagging 1)S = {DT, NN, VB, IN, …}, the set of all POS tags. 2)V = the set of all words in English. 3)Initial probabilities π i are the probability that POS tag s i can start a sentence. 4)Transition probabilities A ij represent the probability that one tag can follow another 5)Observation probabilities B ij represent the probability that a tag will generate a particular word of that type.

Outline Graphical Models Hidden Markov Models – Probability of a sequence – Viterbi (or decoding) – Supervised parameter estimation – Baum-Welch Conditional Random Fields

What’s the probability of a sentence? Suppose I asked you, ‘What’s the probability of seeing a sentence w1, …, wT on the web?’ If we have an HMM model of English, we can use it to estimate the probability.

Conditional Probability of a Sentence If we knew the hidden states that generated each word in the sentence, it would be easy:

Probability of a Sentence Via marginalization, we have: Unfortunately, if there are N values for each a i (s 1 through s N ), Then there are N T values for a 1,…,a T.

Forward Procedure oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Special structure gives us an efficient solution using dynamic programming. Intuition: Probability of the first t observations is the same for all possible t+1 length state sequences. Define:

oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Backward Procedure Probability of the rest of the states given the first state

oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Decoding Solution Forward Procedure Backward Procedure Combination

oToT o1o1 otot o t-1 o t+1 Best State Sequence Find the state sequence that best explains the observations Viterbi algorithm

oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm The state sequence which maximizes the probability of seeing the observations to time t-1, landing in state j, and seeing the observation at time t x1x1 x t-1 j

oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm Recursive Computation x1x1 x t-1 xtxt x t+1

oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm Compute the most likely state sequence by working backwards x1x1 x t-1 xtxt x t+1 xTxT

Supervised Parameter Estimation Given an observation sequence and states, find the HMM model ( π, A, and B) that is most likely to produce the sequence. For example, POS-tagged data from the Penn Treebank A B AAA BBBB oToT o1o1 otot o t-1 o t+1 x1x1 x t-1 xtxt x t+1 xTxT

Bayesian Parameter Estimation A B AAA BBBB oToT o1o1 otot o t-1 o t+1 x1x1 x t-1 xtxt x t+1 xTxT

Outline Graphical Models Hidden Markov Models – Probability of a sequence – Viterbi (or decoding) – Supervised parameter estimation – Baum-Welch: Unsupervised parameter estimation Conditional Random Fields

oToT o1o1 otot o t-1 o t+1 Unsupervised Parameter Estimation Given an observation sequence, find the model that is most likely to produce that sequence. No analytic method Given a model and observation sequence, update the model parameters to better fit the observations. A B AAA BBBB

oToT o1o1 otot o t-1 o t+1 Parameter Estimation A B AAA BBBB Probability of traversing an arc Probability of being in state i

oToT o1o1 otot o t-1 o t+1 Parameter Estimation A B AAA BBBB Now we can compute the new estimates of the model parameters.

oToT o1o1 otot o t-1 o t+1 Parameter Estimation A B AAA BBBB Guarantee: P(o 1:T |A,B, π ) <= P(o 1:T | A ̂, B ̂, π̂ ) In other words, by repeating this procedure, we can gradually improve how well the HMM fits the unlabeled data. There is no guarantee that this will converge to the best possible HMM, however.

oToT o1o1 otot o t-1 o t+1 The Most Important Thing A B AAA BBBB We can use the special structure of this model to do a lot of neat math and solve problems that are otherwise not solvable.

Outline Graphical Models Hidden Markov Models – Probability of a sequence – Viterbi (or decoding) – Supervised parameter estimation – Baum-Welch: Unsupervised parameter estimation Conditional Random Fields

Discriminative (Conditional) Models HMMs are called generative models: That is, if you want them to, they can tell you P(sentence) by marginalizing over P(sentence, labels) Most often, though, people are most interested in the conditional probability P(labels | sentence) Discriminative (also called conditional) models directly represent P(labels | sentence) You can’t find out P(sentence) or P(sentence,labels) But for P(labels | sentence), they tend to be more accurate

Discriminative vs. Generative HMM (generative)CRF (discriminative) DecodingForward algorithm or Backward algorithm, linear in length of sentence Can’t do it. Find optimal label sequence Viterbi, Linear in length of sentence Viterbi-style algorithm, Linear in length of sentence Supervised parameter estimation Bayesian learning, Easy and fast (linear time) Convex optimization, Can be quite slow Unsupervised parameter estimation Baum-Welch (non-convex optimization), Slow but doable Very difficult, and requires making extra assumptions. Feature functionsParents and children in the graph  Restrictive! Arbitrary functions of a latent state and any portion of the observed nodes

“Features” that depend on many pos. in x Score of a parse depends on all of x at each position Can still do Viterbi because state  i only “looks” at prev. state  i-1 and the constant sequence x 11 x1x1 22 x2x2 33 x3x3 44 x4x4 55 x5x5 66 x6x6 … 11 x1x1 22 x2x2 33 x3x3 44 x4x4 55 x5x5 66 x6x6 … HMM CRF

Feature Functions CRFs define the conditional probability P(labels | sentence) in terms of feature functions As an example, f(labels, sentence) = the number of times that a label ‘NN’ is preceded by ‘DT’ with corresponding word ‘that’ in the sentence. As a designer of a CRF model, you can use any real-valued function of the labels and sentence, so long as it uses at most 2 consecutive labels at a time.  in other words, cannot say: f(labels, sentence) = the number of times ‘DT’ is followed by ‘Adj’ followed by ‘NN’

The CRF Equation Let F = be a vector of the feature functions in a CRF model. A CRF model consists of F and a real-valued vector θ = of weights for each feature function. Let o = be an observed sentence, and let A = be the label variables, and a = a configuration of the label variables

CRF Equation, standard format Note that the denominator depends on o, but not on a (it’s marginalizing over a). To make this observation clear, we typically write this as: where Z, the “partition function”, is given by:

Finding the Best Sequence Best sequence is given by:

oToT o1o1 otot o t-1 o t+1 Viterbi-like Algorithm The state sequence which maximizes the probability of landing in state j at time t, given the full observation sequence. a1a1 a t-1 j

oToT o1o1 otot o t-1 o t+1 Viterbi-like Algorithm a1a1 a t-1 j

oToT o1o1 otot o t-1 o t+1 Viterbi-like Algorithm a1a1 a t-1 j Compute the most likely state sequence by working backwards

Conditional Training Given a set of observations o i and the correct labels a i for each, optimize θ to find: To simplify the math, we usually solve the equivalent problem:

Convex Optimization Like any function optimization task, we can optimize by taking the derivative and finding the roots. However, the roots are hard to find Instead, common practice is to use a numerical optimization procedure called “quasi-Newton optimization”. In particular, the “Limited-memory Broyden-Fletcher-Goldfarb- Shanno” (LBFGS) method is most popular. I won’t explain LBFGS, but I will explain what you need in order to use it.

Using L-BFGS L-BFGS is an iterative procedure that gradually improves an estimate for θ until it reaches a local optimum. To use, you will need: – An initial estimate for θ (often just zero vector) – The current conditional log-likelihood of the data – And the current gradient vector,

Using L-BFGS Supply L-BFGS with these 3 things, and it will return an improved setting for θ. Repeat as necessary, until the conditional log-likelihood doesn’t change much between iterations. Since the negative log-likelihood is convex, this will converge to a global optimum (although it may take a while).

Calculating the Gradient

Basic Probability and Statistics CIS 8590 – Fall 2008 NLP 1.

Similar presentations

Presentation on theme: "Basic Probability and Statistics CIS 8590 – Fall 2008 NLP 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Basic Probability and Statistics CIS 8590 – Fall 2008 NLP 1.

Similar presentations

Presentation on theme: "Basic Probability and Statistics CIS 8590 – Fall 2008 NLP 1."— Presentation transcript:

Similar presentations

About project

Feedback