Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

Slides:



Advertisements
Similar presentations
Bayesian Learning & Estimation Theory
Advertisements

Bayes rule, priors and maximum a posteriori
Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Pattern Recognition and Machine Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Sampling: Final and Initial Sample Size Determination
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Maximum likelihood (ML) and likelihood ratio (LR) test
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Maximum likelihood (ML) and likelihood ratio (LR) test
Evaluating Hypotheses
Machine Learning CMPT 726 Simon Fraser University
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Pattern Recognition Topic 2: Bayes Rule Expectant mother:
Bayesian Learning Part 3+/- σ. Administrivia Final project/proposal Hand-out/brief discussion today Proposal due: Mar 27 Midterm exam: Thurs, Mar 22 (Thurs.
Maximum likelihood (ML)
Lecture II-2: Probability Review
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Elec471 Embedded Computer Systems Chapter 4, Probability and Statistics By Prof. Tim Johnson, PE Wentworth Institute of Technology Boston, MA Theory and.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Introduction and Motivation Approaches for DE: Known model → parametric approach: p(x;θ) (Gaussian, Laplace,…) Unknown model → nonparametric approach Assumes.
Estimation Bias, Standard Error and Sampling Distribution Estimation Bias, Standard Error and Sampling Distribution Topic 9.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
Inferential Statistics Part 1 Chapter 8 P
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
1 Topic 5 - Joint distributions and the CLT Joint distributions –Calculation of probabilities, mean and variance –Expectations of functions based on joint.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
More Examples: There are 4 security checkpoints. The probability of being searched at any one is 0.2. You may be searched more than once in total and all.
INTRODUCTION TO Machine Learning 3rd Edition
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
Lecture 2: Statistical learning primer for biologists
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Machine Learning 5. Parametric Methods.
ES 07 These slides can be found at optimized for Windows)
Lecture 1: Basic Statistical Tools. A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Intro to Probability Slides from Professor Pan,Yan, SYSU.
Maximum likelihood estimators Example: Random data X i drawn from a Poisson distribution with unknown  We want to determine  For any assumed value of.
ESTIMATION METHODS We know how to calculate confidence intervals for estimates of  and  2 Now, we need procedures to calculate  and  2, themselves.
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
Conditional Expectation
1 Ka-fu Wong University of Hong Kong A Brief Review of Probability, Statistics, and Regression for Forecasting.
Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Estimating standard error using bootstrap
CS479/679 Pattern Recognition Dr. George Bebis
Chapter 3: Maximum-Likelihood Parameter Estimation
Probability Theory and Parameter Estimation I
Ch3: Model Building through Regression
Special Topics In Scientific Computing
Review of Probability and Estimators Arun Das, Jason Rebello
Modelling data and curve fitting
EC 331 The Theory of and applications of Maximum Likelihood Method
10701 / Machine Learning Today: - Cross validation,
Pattern Recognition and Machine Learning
CS5112: Algorithms and Data Structures for Applications
Simple Linear Regression
LECTURE 23: INFORMATION THEORY REVIEW
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Probabilistic Surrogate Models
Presentation transcript:

Probability and Maximum Likelihood

How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red line doesnt reveal different levels of uncertainty in predictions Cross validation reduced the training data, so the red line isnt as accurate as it should be Choosing a particular M and w seems wrong – we should hedge our bets

Probability Theory Example of a random experiment –We poll 60 users who are using one of two search engines and record the following: X Each point corresponds to one of 60 users Two search engines Number of good hits returned by search engine

X Probability Theory Random variables –X and Y are called random variables –Each has its own sample space: S X = {0,1,2,3,4,5,6,7,8} S Y = {1,2}

X Probability Theory Probability –P(X=i,Y=j) is the probability (relative frequency) of observing X = i and Y = j –P(X,Y) refers to the whole table of probabilities –Properties: 0 P 1, P = P(X=i,Y=j)P(X=i,Y=j)

Probability Theory Marginal probability –P(X=i) is the marginal probability that X = i, ie, the probability that X = i, ignoring Y X P(X)P(X) P(Y)P(Y)

Probability Theory Marginal probability –P(X=i) is the marginal probability that X = i, ie, the probability that X = i, ignoring Y –From the table: P(X=i) = j P(X=i,Y=j) Note that i P(X=i) = 1 and j P(Y=j) = 1 X P(Y=j)P(Y=j) P(X=i)P(X=i) SUM RULE

Probability Theory Conditional probability –P(X=i|Y=j) is the probability that X=i, given that Y = j –From the table: P(X=i|Y=j) = P(X=i,Y=j) / P(Y=j) X P(X|Y=1) P(Y=1)

Probability Theory Conditional probability –How about the opposite conditional probability, P(Y=j|X=i)? –P(Y=j|X=i) = P(X=i,Y=j) / P(X=i) Note that j P(Y=j|X=i)=1 X X P(Y=j|X=i)P(Y=j|X=i) P(X=i,Y=j)P(X=i,Y=j) P(X=i)P(X=i)

Summary of types of probability Joint probability: P(X,Y) Marginal probability (ignore other variable): P(X) and P(Y) Conditional probability (condition on the other variable having a certain value): P(X|Y) and P(Y|X)

Probability Theory Constructing joint probability –Suppose we know The probability that the user will pick each search engine, P(Y=j), and For each search engine, the probability of each number of good hits, P(X=i|Y=j) –Can we construct the joint probability, P(X=i,Y=j) ? –Yes. Rearranging P(X=i|Y=j) = P(X=i,Y=j) / P(Y=j) we get P(X=i,Y=j) = P(X=i|Y=j) P(Y=j) PRODUCT RULE

Summary of computational rules SUM RULE: P(X) = Y P(X,Y) P(Y) = X P(X,Y) –Notation: We simplify P(X=i,Y=j) for clarity PRODUCT RULE: P(X,Y) = P(X|Y) P(Y) P(X,Y) = P(Y|X) P(X)

Ordinal variables In our example, X has a natural order 0…8 –X is a number of hits, and –For the ordering of the columns in the table below, nearby X s have similar probabilities Y does not have a natural order X

Probabilities for real numbers Cant we treat real numbers as IEEE DOUBLES with 2 64 possible values? Hah, hah. No! How about quantizing real variables to reasonable number of values? Sometimes works, but… –We need to carefully account for ordinality –Doing so can lead to cumbersome mathematics

Probability theory for real numbers Quantize X using bins of width Then, X {.., -2, -, 0,, 2,..} Define P Q (X=x) = Probability that x X x+ Problem: P Q (X=x) depends on the choice of Solution: Let 0 Problem: In that case, P Q (X=x) 0 Solution: Define a probability density P(x) = lim 0 P Q (X=x) / = lim 0 (Probability that x X x+ ) /

Probability theory for real numbers Probability density –Suppose P(x) is a probability density –Properties P(x) 0 It is NOT necessary that P(x) 1 x P(x) dx = 1 –Probabilities of intervals: P(a X b) = b x=a P(x) dx

Probability theory for real numbers Joint, marginal and conditional densities Suppose P(x,y) is a joint probability density – x y P(x,y) dx dy = 1 – P( (X,Y) R) = R P(x,y) dx dy Marginal density: P(x) = y P(x,y) dy Conditional density: P(x|y) = P(x,y) / P(y) x y R

The Gaussian distribution is the standard deviation

The Gaussian distribution Precision is the standard deviation

Mean and variance The mean of X is E[X] = X X P(X) or E[X] = x x P(x) dx The variance of X is VAR(X) = X ( X-E[X] ) 2 P(X) or VAR(X) = x ( x - E[X] ) 2 P(x)dx The std dev of X is STD(X) = SQRT(VAR(X)) The covariance of X and Y is COV(X,Y) = X Y ( X-E[X] ) ( Y-E[Y] ) P(X,Y) or COV(X,Y) = x y ( x-E[X] ) ( y-E[Y] ) P(x,y) dx dy

Mean and variance of the Gaussian E[X] = VAR(X) = 2 STD(X) =

How can we use probability as a framework for machine learning?

Maximum likelihood estimation Say we have a density P(x| ) with parameter The likelihood of a set of independent and identically drawn (IDD) data x = (x 1,…,x N ) is P( x | ) = n=1 N P(x n | ) The log-likelihood is L = ln P( x | ) = n=1 N lnP(x n | ) The maximum likelihood (ML) estimate of is ML = argmax L = argmax n=1 N ln P(x n | ) Example: For Gaussian likelihood P(x| ) = N (x|, 2 ), L =

Comments on notation from now on Instead of j P(X=i,Y=j), we write X P(X,Y) P() and p() are used interchangeably Discrete and continuous variables treated the same, so X, X, x and x are interchangeable ML and ML are interchangeable argmax f( ) is the value of that maximizes f( ) In the context of data x 1,…,x N, symbols x, X, X and X refer to the entire set of data N (x|, 2 ) = log() = ln() and exp(x) = e x p context (x) and p(x|context) are interchangable

Maximum likelihood estimation Say we have a density P(x| ) with parameter The likelihood of a set of independent and identically drawn (IDD) data x = (x 1,…,x N ) is P( x | ) = n=1 N P(x n | ) The log-likelihood is L = ln P( x | ) = n=1 N lnP(x n | ) The maximum likelihood (ML) estimate of is ML = argmax L = argmax n=1 N ln P(x n | ) Example: For Gaussian likelihood P(x| ) = N (x|, 2 ), L =

Questions?

How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red line doesnt reveal different levels of uncertainty in predictions Cross validation reduced the training data, so the red line isnt as accurate as it should be Choosing a particular M and w seems wrong – we should hedge our bets No progress! But…

Maximum likelihood estimation Say we have a density P(x| ) with parameter The likelihood of a set of independent and identically drawn (IDD) data x = (x 1,…,x N ) is P( x | ) = n=1 N P(x n | ) The log-likelihood is L = ln P( x | ) = n=1 N lnP(x n | ) The maximum likelihood (ML) estimate of is ML = argmax L = argmax n=1 N ln P(x n | ) Example: For Gaussian likelihood P(x| ) = N (x|, 2 ), L =

Maximum likelihood estimation L = Example: For Gaussian likelihood P(x| ) = N (x|, 2 ), Objective of regression: Minimize error E(w) = ½ n ( t n - y(x n,w) ) 2