1 Probabilistic ML algorithm Naïve Bayes and Maximum Likelyhood.

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Uncertainty Everyday reasoning and decision making is based on uncertain evidence and inferences. Classical logic only allows conclusions to be strictly.
.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.
Text Categorization CSC 575 Intelligent Information Retrieval.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
What is Statistical Modeling
Parameter Estimation using likelihood functions Tutorial #1
Visual Recognition Tutorial
Probability Review 1 CS479/679 Pattern Recognition Dr. George Bebis.
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
This presentation has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is available at Changes.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
. Maximum Likelihood (ML) Parameter Estimation with applications to reconstructing phylogenetic trees Comput. Genomics, lecture 6b Presentation taken from.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Visual Recognition Tutorial
Class 3: Estimating Scoring Rules for Sequence Alignment.
Text Classification – Naive Bayes
Thanks to Nir Friedman, HU
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
For Monday after Spring Break Read Homework: –Chapter 13, exercise 6 and 8 May be done in pairs.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Crash Course on Machine Learning
Recitation 1 Probability Review
11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University.
1 Naïve Bayes A probabilistic ML algorithm. 2 Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Machine Learning Queens College Lecture 3: Probability and Statistics.
.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.
1 CS 343: Artificial Intelligence Probabilistic Reasoning and Naïve Bayes Raymond J. Mooney University of Texas at Austin.
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.
CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
1 Text Categorization. 2 Categorization Given: –A description of an instance, x  X, where X is the instance language or instance space. –A fixed set.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Classification Techniques: Bayesian Classification
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Elementary manipulations of probabilities Set probability of multi-valued r.v. P({x=Odd}) = P(1)+P(3)+P(5) = 1/6+1/6+1/6 = ½ Multi-variant distribution:
1 Naïve Bayes Classification CS 6243 Machine Learning Modified from the slides by Dr. Raymond J. Mooney
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
Maximum Likelihood Estimation
Statistical Estimation Vasileios Hatzivassiloglou University of Texas at Dallas.
1 Text Categorization Slides based on R. Mooney (UT Austin)
Machine Learning 5. Parametric Methods.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Review of statistical modeling and probability theory Alan Moses ML4bio.
CS 2750: Machine Learning Probability Review Prof. Adriana Kovashka University of Pittsburgh February 29, 2016.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Bayesian Learning. Uncertainty & Probability Baye's rule Choosing Hypotheses- Maximum a posteriori Maximum Likelihood - Baye's concept learning Maximum.
1 Chapter 6 Bayesian Learning lecture slides of Raymond J. Mooney, University of Texas at Austin.
Applied statistics Usman Roshan.
CS 2750: Machine Learning Probability Review Density Estimation
Bayes Net Learning: Bayesian Approaches
Review of Probability and Estimators Arun Das, Jason Rebello
Tutorial #3 by Ma’ayan Fishelson
Classification Techniques: Bayesian Classification
Text Categorization.
CS498-EA Reasoning in AI Lecture #20
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Probabilistic ML algorithm Naïve Bayes and Maximum Likelyhood
Hankz Hankui Zhuo Bayesian Networks Hankz Hankui Zhuo
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

1 Probabilistic ML algorithm Naïve Bayes and Maximum Likelyhood

There is uncertainty in the world Rather than only following the most likely explanation for a given situation, keeping an open mind and considering other possible explanations has proven to be essential in systems that have to work in a real-world environment The language of describing uncertainty, that of probability theory, tremendously simplify reasoning in such worlds. Many machine learning problems can be formulated in probabilistic terms. A probabilistic formulation is the following: –the target of a ML system is to learn a classification function f(x): x  y –Given an unseen instance x, assign a category label y to this instance using the learned function f(). –The corresponding probability formulation is: learn a probability density function p(y/x) which is the conditioned probability of class label y given the observation of x. –A label can be selected based on argmax y p(y/x) 2

BASIC PROBABILITY NOTIONS YOU NEED 3

4 Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false has probability 0. P(true) = 1 P(false) = 0. The probability of disjunction is: A B

5 Conditional Probability P(A | B) is the probability of A given B Assumes that B is all and only information known. Defined by: A B

6 Statistical Independence A and B are independent iff: Therefore, if A and B are independent: These two constraints are logically equivalent

Univariate and Multivariate distributions Univariate distribution is when there is only one random variable, e.g. if instance vectors x in X are described by just one feature Multivariate if many random variables are involved: e.g. x:(x 1 …x d ). Now any feature of x can be described by a random variable and we can estimate P(x i =v ij )) where v ij j=1..m are the possible values for feature i 7

8 Probabilistic Classification Let Y be the random variable for the class C which takes values {y 1,y 2,…y m } (|C|=m possible classifications for our instances). Let X be a random variable describing an instance consisting of a vector of values for d features, let v jk be a possible value for feature X j. For our classification task, we need to compute the (multivariate) conditional probabilities: P(Y=y i | X=x k (X k1,…X kd )) for i=1…m (e.g. P(Y=positive/x k = ) ) E.g. the objective is to classify a new unseen x k by estimating the probability of each possible classification y i, given the observation of feature values of the instance to be classified To estimate P(Y=y i | X=x k ) we use a learning set D of pairs (x i,C(x i ))

How can we compute P(y=y i /X=x k )?? Example: x k =(color=red,shape=circle) y1=positive, y2=negative We need to compute: So we need to compute joint probabilities The joint probability distribution for a set of random INDEPENDENT variables, X 1,…,X d gives the probability of every combination of values (a d-dimensional array with k values if all d variables are discrete with k values, and furthermore Σ j=1..k P(X i =v ij )=1 ) 9

10 Joint probability tables The probability of all possible conjunctions (assignments of values to some subset of variables) can be calculated by summing the appropriate subset of values from the joint distribution. If all joint probabilities are known, all conditional probabilities can be calculated. circlesquare red blue circlesquare red blue0.20 Class=positive Class=negative Pr(shape=circle, color=blue, C=+) Pr(shape=circle, color=blue, C=+)

Probabilistic Classification (2) However, given no other assumptions, this requires tables giving the probability of each category for each possible instance (combination of feature values) in the instance space, which is impossible to accurately estimate from a reasonably-sized training set D. E.g. P(Y=y i /X 1 =v 1k,X 2 =v 2k …X d =v dk ) for all y i and v ij Assuming that Y and all X i are binary valued, and we have d features, we need 2 d entries to specify P(Y=pos | X=x k ) for each of the 2 d possible x k in D since: –P(Y=neg | X=x k ) = 1 – P(Y=1 | X=x k ) –Compared to 2 d+1 – 1 entries for the joint distribution P(Y,X 1,X 2 …X d ) 11

Example: dimension of joint probability tables X:(X 1,X 2..X 4 ), X i : Y: (all binary values) # of features: d=4, # of values for the class Y: m=2) Consider instance x k :(0,1,0,0) Need to estimate Pr(Y=0/(X 1 =0, X 2 =1, X 3 =0, X 4 =0)) If P(Y=0/(0,1,0,0))>P(Y=1/(0,1,0,0))=(1-P(Y=0/(0,1,0,0))) then class is 0, else class is 1 Overall, 2 4 estimates are needed for our probabilistic classifier For large m and n this is not feasible, even with large learning sets 12 0,1

13 Solution: Maximum Likelihood Learning Let M be a probabilistic formulation (MODEL) of our classification task p(y i /x k ) We know exactly the structure of M e.g. we know that: but not the values of its probabilistic parameters,  (e.g. in the example, μ i and α i ). This is like for algebraic models, eg the perceptron: we know that the relation between x and y is linear, but we don’t know the coefficients. We can consider our observations, (x k,y i ), as “generated” by the (unknown) distribution induced by M.  Goal: After observing several examples x1..xN, estimate the model parameters, , that generated the observed data. Stated more intuitively: learn the values of model parameters that MAXIMIZE the likelihood of observing the evidence provided by the learning set Stated more intuitively: learn the values of model parameters that MAXIMIZE the likelihood of observing the evidence provided by the learning set

14 Maximum Likelihood Estimation (MLE) u The likelihood of the observed data, given the model parameters , is the conditional probability that the model, M, with parameters , produces the observations x 1,…, x | D|. L(  )=P (x 1..x |D| | , M) =P(E| , M) u In other terms, in MLE we seek the model parameters, , that maximize the likelihood of observing our EVIDENCE E, represented by the learning set D.

What are these parameters The definition of  is quite general. It can be a set of coefficients in a probabilistic formulation : p(y/x)=f(α 1, α 2,.. α n,x)  =(α 1, α 2,.. α n ) Or a set of probabilities: P(y/x)=P(X 1 =v 1j ) ∧ P(X 2 =v 2k ) ∧.. P(X d =v dn )  =(P(X 1 =v 1j ), P(X 2 =v 2k ),.. P(X d =v dn )) 15 Note: p is a probability distribution and is a general formulation since it applies also to continuous random variables. P is the probability of observing specific values.

16 Maximum Likelihood Estimation (MLE) u In MLE we seek the model parameters, , that maximize the likelihood: it is thus an OPTIMIZATION problem u The MLE principle is applicable in a wide variety of ML problems, from speech recognition, to natural language processing, computational biology, etc. u We will start with the simplest example: Estimating the bias of a thumbtack (we got boared with coins).

17 Example: Binomial Experiment When tossing the thumbtack, it can land in one of two positions: Head (H) or Tail (T) HeadTail  We denote by  the (unknown) probability P( H). Estimation task:  Given a sequence of toss samples x1..xm (our evidence E) we want to estimate the probabilities P(H)=  and P(T) = 1 -  u  is the model parameter u In this case the problem is univariate

18 Statistical Parameter Fitting (general definition for multivariate case) Consider instances D: x 1, x 2, …, x m such that –The set of values that x can take is known –Each xi is sampled from the same distribution –Each xi is sampled independently of the rest i.i.d. Samples u The task is to find a vector of parameters  that have generated the given data E. This vector parameter  can be used to predict future data.

19 The Likelihood Function How good is a particular  ? It depends on how likely it is to generate the observed data Where are the observed data? In the thumbtack example, we can toss the thumbtack several times and observe a sequence of results The likelihood for the sequence H,T, T, H, H is:  L(  )

20 Sufficient Statistics To compute the likelihood in the thumbtack example we only require N H and N T (the number of heads and the number of tails in a sequence of tosses) N H and N T are sufficient statistics for the binomial distribution A sufficient statistic is a function whose value contains all the information needed to compute any estimate of the parameter

21 Maximum Likelihood Estimation MLE Principle: Choose parameters that maximize the likelihood function MLE is one of the most commonly used estimators in statistics One usually maximizes the log-likelihood function, defined as l E (  ) = ln L E (  )

22 Example: MLE in Binomial Data (i.e., Y is boolean so we only need P(Y=1/X) since P(Y=0/X)=1-P (Y=1/X) ) Taking derivative and equating it to 0 we get L(  ) Example: (N H,N T ) = (3,2) MLE estimate is 3/5 = 0.6 (which coincides with what one would expect) Remember, to maximize minimize a function you need to take the derivative Remember, to maximize minimize a function you need to take the derivative

From Binomial to Multinomial Now suppose Y can have the values 1,2,…,K (For example a die has K=6 sides) We want to learn the parameters  1,  2. …,  K (the vector  of probabilities P(Y=y k ) ) Sufficient statistics:  N 1, N 2, …, N K - the number of times each outcome is observed u The optimization problem is: maximize

Maximizing log-likelihood with constraints 24 We consider the log-likelihood and we represent the optimization problem with a Lagrangian (again!! Remember SVM) We consider the log-likelihood and we represent the optimization problem with a Lagrangian (again!! Remember SVM)

25 Another example: proteine sequences Let be a protein sequence We want to learn the parameters θ 1, θ 2,…, θ 20 corresponding to the probabilities of the 20 amino acids N 1, N 2, …, N 20 - the number of times each amino acid is observed in the sequence Likelihood function: MLE:

NAIVE BAYES CLASSIFIER: A ML ALGORITHM BASED ON MLE PRINCIPLE 26

Naive bayes We are given an evidence E represented by an annotated learning set D of classified instances (x k,y i ). We would like to estimate the probabilities p(Y=y i /x) for unobserved instances, given the evidence E In probability calculus, often estimating P(a/b) is easier that estimating P(b/a) so we need the Bayes theorem to invert conditional probabilities 27

28 Bayes Theorem Y=classification E= evidence of data Simple proof from definition of conditional probability: QED: (Def. cond. prob.) Y E

29 Naïve Bayes Model (1) For each classification value y i we have (applying Bayes): P(Y=yi) and P(X=x k ) are called priors and can be estimated from learning set D since categories are complete and disjoint

Naive Bayes Model (2) 30 We assume feature values v jk of different features X j being statistically independent. v jk is the k-th value of feature j where j=1,2..d and k=1…K j (if binary features, k=0 or 1) e.g. P(x(color=blue, shape=circle, dimension=big))= P(color=blue)P(shape=circle)P(dimension=big) and furthermore

Estimating model parameters Θ The parameters are: θ 1 : P(Y=y i ) for all i, and θ 2 : for all i,j and k The likelihood function is: Where N i is the number of times we see class y i in dataset, and N jki is the number of times we see an instance with and Y=y i 31

Estimating model parameters (2) 32 The MLE problem is therefore: maximize L(θ) subject to: Since we have logarithms:

Estimating model parameters (3) 33 I.e. the θ 2 can be estimated as the ratio between the number of times feature X j takes value k when Y=y i and the total number of examples in D for which Y=y i I.e. the θ 2 can be estimated as the ratio between the number of times feature X j takes value k when Y=y i and the total number of examples in D for which Y=y i To maximize the likelyhood we need to find values that take to zero the derivative

How do we predict the category of an unseen instance with Naive Bayes? 34 Note that since the denominator is common to all conditional probabilities, it does not affect the ranking. No need to compute it! Note that since the denominator is common to all conditional probabilities, it does not affect the ranking. No need to compute it!

35 Naïve Bayes Generative Model Size Color Shape Positive Negative pos neg pos neg sm med lg med sm med lg red blue grn circ sqr tri circ sqr tri sm lg med sm lg med lg sm blue red grn blue grn red grn blue circ sqr tri circ sqr circ tri Category K=3,|C|=2,d=3

36 Naïve Bayes Inference Problem Size Color Shape Positive Negative pos neg pos neg sm med lg med sm med lg red blue grn circ sqr tri circ sqr tri sm lg med sm lg med lg sm blue red grn blue grn red grn blue circ sqr tri circ sqr circ tri Category lg red circ ?? I estimate on the learning set the probability of extracting lg, red, circ from the red or blue urns. I estimate on the learning set the probability of extracting lg, red, circ from the red or blue urns.

HOW? 37

38 Naïve Bayes Example ProbabilityY=positiveY=negative P(Y)0.5 P(small | Y)3/8 P(medium | Y)3/82/8 P(large | Y)2/83/8 P(red | Y)5/82/8 P(blue | Y)2/83/8 P(green | Y)1/83/8 P(square | Y)1/83/8 P(triangle | Y)2/83/8 P(circle | Y)5/82/8 Training set Have 3 small out of 8 instances in red “size” urn then P(size=small/pos)=3/8=0,375 (round 4) Have 3 small out of 8 instances in red “size” urn then P(size=small/pos)=3/8=0,375 (round 4) Size Color Shape

39 Naïve Bayes Example Probabilitypositivenegative P(Y)0.5 P(medium | Y)3/82/8 P(red | Y)5/82/8 P(circle | Y)5/82/8 P(positive | X) = P(positive)*P(medium | positive)*P(red | positive)*P(circle | positive)/P(X) 0.5 * 3/8 * 5/8 * 5/8 P(negative | X) = P(negative)*P(medium | negative)*P(red | negative)*P(circle | negative) / P(X) 0.5 * 2/8 * 2/8 * 2/8 = 0,073 = Test Instance: X: Test Instance: X: P(positive/X)>P(negative/X)  positive

Naive summary Classify any new datum instance x k =(x 1,…x n ) as: To do this based on training examples, estimate the parameters from the training examples in D: –For each target value of the classification variable (hypothesis) y i –For each attribute value a t of each datum instance

41 Estimating Probabilities Normally, as in previous example, probabilities are estimated based on observed frequencies in the training data. If D contains N i examples in category y i, and N jki of these N i examples have the k-th value for feature X j, v jk, then: However, estimating such probabilities from small training sets is error-prone. If due only to chance, a rare feature, X j, is always false in the training data,  y k :P(X j =true | Y=y i ) = 0. If X j =true then occurs in a test example, X, the result is that  y k : P(X | Y=y i ) = 0 and  y i : P(Y=y i | X) = 0

42 Probability Estimation Example Probabilitypositivenegative P(Y)0.5 P(small | Y)0.5 P(medium | Y)0.0 P(large | Y)0.5 P(red | Y) P(blue | Y) P(green | Y)0.0 P(square | Y)0.0 P(triangle | Y) P(circle | Y) ExSizeColorShapeCategory 1smallredcirclepositive 2largeredcirclepositive 3smallredtrianglenegitive 4largebluecirclenegitive Test Instance X: P(positive | X) = 0.5 * 0.0 * 1.0 * 1.0 / P(X) = 0 P(negative | X) = 0.5 * 0.0 * 0.5 * 0.5 / P(X) = 0

43 Smoothing To account for estimation from small samples, probability estimates are adjusted or smoothed. Laplace smoothing using an m-estimate assumes that each feature is given a prior probability, p, that is assumed to have been previously observed in a “ virtual ” sample of size m. For binary features, p is simply assumed to be 0.5.

44 Laplace Smothing Example Assume training set contains 10 positive examples: –4: small –0: medium –6: large Estimate parameters as follows (if m=1, p=1/3) –P(small | positive) = (4 + 1/3) / (10 + 1) = –P(medium | positive) = (0 + 1/3) / (10 + 1) = 0.03 –P(large | positive) = (6 + 1/3) / (10 + 1) = –P(small or medium or large | positive) = 1.0

45 Continuous Attributes If X i is a continuous feature rather than a discrete one, need another way to calculate P(X j | Y). Assume that X j has a Gaussian distribution whose mean and variance depends on Y. During training, for each combination of a continuous feature X j and a class value for Y, y i, estimate a mean, μ ji, and standard deviation σ ji based on the values of feature X j in class y i in the training data. μ ji is the mean value of Xj observed over instances for which Y= y i in D During testing, estimate P(X j | Y=y i ) for a given example, using the Gaussian distribution defined by μ ji and σ ji.

46 Comments on Naïve Bayes Tends to work well despite strong assumption of conditional independence. Experiments show it to be quite competitive with other classification methods on standard UCI datasets. Although it does not produce accurate probability estimates when its independence assumptions are violated, it may still pick the correct maximum-probability class in many cases. –Able to learn conjunctive concepts in any case Does not perform any search of the hypothesis space. Directly constructs a hypothesis from parameter estimates that are easily calculated from the training data. –Strong bias Not guaranteed consistency with training data. Typically handles noise well since it does not even focus on completely fitting the training data.