# Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

## Presentation on theme: "Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red."— Presentation transcript:

Probability and Maximum Likelihood

How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red line doesnt reveal different levels of uncertainty in predictions Cross validation reduced the training data, so the red line isnt as accurate as it should be Choosing a particular M and w seems wrong – we should hedge our bets

Probability Theory Example of a random experiment –We poll 60 users who are using one of two search engines and record the following: X 012345678 Each point corresponds to one of 60 users Two search engines Number of good hits returned by search engine

X 012345678 Probability Theory Random variables –X and Y are called random variables –Each has its own sample space: S X = {0,1,2,3,4,5,6,7,8} S Y = {1,2}

X 012345678 Probability Theory Probability –P(X=i,Y=j) is the probability (relative frequency) of observing X = i and Y = j –P(X,Y) refers to the whole table of probabilities –Properties: 0 P 1, P = 1 368853100 60 000145862 P(X=i,Y=j)P(X=i,Y=j)

Probability Theory Marginal probability –P(X=i) is the marginal probability that X = i, ie, the probability that X = i, ignoring Y X 012345678 P(X)P(X) P(Y)P(Y)

Probability Theory Marginal probability –P(X=i) is the marginal probability that X = i, ie, the probability that X = i, ignoring Y –From the table: P(X=i) = j P(X=i,Y=j) Note that i P(X=i) = 1 and j P(Y=j) = 1 X 012345678 368853100 60 000145862 34 60 26 60 P(Y=j)P(Y=j) 368998962 P(X=i)P(X=i) SUM RULE

Probability Theory Conditional probability –P(X=i|Y=j) is the probability that X=i, given that Y = j –From the table: P(X=i|Y=j) = P(X=i,Y=j) / P(Y=j) X 012345678 P(X|Y=1) P(Y=1)

Probability Theory Conditional probability –How about the opposite conditional probability, P(Y=j|X=i)? –P(Y=j|X=i) = P(X=i,Y=j) / P(X=i) Note that j P(Y=j|X=i)=1 X 012345678 368853100 60 000145862 X 012345678 3 3 0 3 6 6 0 6 8 8 0 8 8 9 1 9 5 9 4 9 3 8 5 8 1 9 8 9 0 6 6 6 0 2 2 2 P(Y=j|X=i)P(Y=j|X=i) P(X=i,Y=j)P(X=i,Y=j) 368998962 P(X=i)P(X=i)

Summary of types of probability Joint probability: P(X,Y) Marginal probability (ignore other variable): P(X) and P(Y) Conditional probability (condition on the other variable having a certain value): P(X|Y) and P(Y|X)

Probability Theory Constructing joint probability –Suppose we know The probability that the user will pick each search engine, P(Y=j), and For each search engine, the probability of each number of good hits, P(X=i|Y=j) –Can we construct the joint probability, P(X=i,Y=j) ? –Yes. Rearranging P(X=i|Y=j) = P(X=i,Y=j) / P(Y=j) we get P(X=i,Y=j) = P(X=i|Y=j) P(Y=j) PRODUCT RULE

Summary of computational rules SUM RULE: P(X) = Y P(X,Y) P(Y) = X P(X,Y) –Notation: We simplify P(X=i,Y=j) for clarity PRODUCT RULE: P(X,Y) = P(X|Y) P(Y) P(X,Y) = P(Y|X) P(X)

Ordinal variables In our example, X has a natural order 0…8 –X is a number of hits, and –For the ordering of the columns in the table below, nearby X s have similar probabilities Y does not have a natural order X 012345678

Probabilities for real numbers Cant we treat real numbers as IEEE DOUBLES with 2 64 possible values? Hah, hah. No! How about quantizing real variables to reasonable number of values? Sometimes works, but… –We need to carefully account for ordinality –Doing so can lead to cumbersome mathematics

Probability theory for real numbers Quantize X using bins of width Then, X {.., -2, -, 0,, 2,..} Define P Q (X=x) = Probability that x X x+ Problem: P Q (X=x) depends on the choice of Solution: Let 0 Problem: In that case, P Q (X=x) 0 Solution: Define a probability density P(x) = lim 0 P Q (X=x) / = lim 0 (Probability that x X x+ ) /

Probability theory for real numbers Probability density –Suppose P(x) is a probability density –Properties P(x) 0 It is NOT necessary that P(x) 1 x P(x) dx = 1 –Probabilities of intervals: P(a X b) = b x=a P(x) dx

Probability theory for real numbers Joint, marginal and conditional densities Suppose P(x,y) is a joint probability density – x y P(x,y) dx dy = 1 – P( (X,Y) R) = R P(x,y) dx dy Marginal density: P(x) = y P(x,y) dy Conditional density: P(x|y) = P(x,y) / P(y) x y R

The Gaussian distribution is the standard deviation

The Gaussian distribution Precision is the standard deviation

Mean and variance The mean of X is E[X] = X X P(X) or E[X] = x x P(x) dx The variance of X is VAR(X) = X ( X-E[X] ) 2 P(X) or VAR(X) = x ( x - E[X] ) 2 P(x)dx The std dev of X is STD(X) = SQRT(VAR(X)) The covariance of X and Y is COV(X,Y) = X Y ( X-E[X] ) ( Y-E[Y] ) P(X,Y) or COV(X,Y) = x y ( x-E[X] ) ( y-E[Y] ) P(x,y) dx dy

Mean and variance of the Gaussian E[X] = VAR(X) = 2 STD(X) =

How can we use probability as a framework for machine learning?

Maximum likelihood estimation Say we have a density P(x| ) with parameter The likelihood of a set of independent and identically drawn (IDD) data x = (x 1,…,x N ) is P( x | ) = n=1 N P(x n | ) The log-likelihood is L = ln P( x | ) = n=1 N lnP(x n | ) The maximum likelihood (ML) estimate of is ML = argmax L = argmax n=1 N ln P(x n | ) Example: For Gaussian likelihood P(x| ) = N (x|, 2 ), L =

Comments on notation from now on Instead of j P(X=i,Y=j), we write X P(X,Y) P() and p() are used interchangeably Discrete and continuous variables treated the same, so X, X, x and x are interchangeable ML and ML are interchangeable argmax f( ) is the value of that maximizes f( ) In the context of data x 1,…,x N, symbols x, X, X and X refer to the entire set of data N (x|, 2 ) = log() = ln() and exp(x) = e x p context (x) and p(x|context) are interchangable

Maximum likelihood estimation Say we have a density P(x| ) with parameter The likelihood of a set of independent and identically drawn (IDD) data x = (x 1,…,x N ) is P( x | ) = n=1 N P(x n | ) The log-likelihood is L = ln P( x | ) = n=1 N lnP(x n | ) The maximum likelihood (ML) estimate of is ML = argmax L = argmax n=1 N ln P(x n | ) Example: For Gaussian likelihood P(x| ) = N (x|, 2 ), L =

Questions?

How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red line doesnt reveal different levels of uncertainty in predictions Cross validation reduced the training data, so the red line isnt as accurate as it should be Choosing a particular M and w seems wrong – we should hedge our bets No progress! But…

Maximum likelihood estimation Say we have a density P(x| ) with parameter The likelihood of a set of independent and identically drawn (IDD) data x = (x 1,…,x N ) is P( x | ) = n=1 N P(x n | ) The log-likelihood is L = ln P( x | ) = n=1 N lnP(x n | ) The maximum likelihood (ML) estimate of is ML = argmax L = argmax n=1 N ln P(x n | ) Example: For Gaussian likelihood P(x| ) = N (x|, 2 ), L =

Maximum likelihood estimation L = Example: For Gaussian likelihood P(x| ) = N (x|, 2 ), Objective of regression: Minimize error E(w) = ½ n ( t n - y(x n,w) ) 2

Download ppt "Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red."

Similar presentations