Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 2750: Machine Learning Probability Review Prof. Adriana Kovashka University of Pittsburgh February 29, 2016.

Similar presentations


Presentation on theme: "CS 2750: Machine Learning Probability Review Prof. Adriana Kovashka University of Pittsburgh February 29, 2016."— Presentation transcript:

1 CS 2750: Machine Learning Probability Review Prof. Adriana Kovashka University of Pittsburgh February 29, 2016

2 Plan for today and next two classes Probability review Density estimation Naïve Bayes and Bayesian Belief Networks

3 Procedural View Training Stage: –Raw Data  x (Feature Extraction) –Training Data { (x,y) }  f (Learning) Testing Stage –Raw Data  x (Feature Extraction) –Test Data x  f(x) (Apply function, Evaluate error) (C) Dhruv Batra3

4 Statistical Estimation View Probabilities to rescue: –x and y are random variables –D = (x 1,y 1 ), (x 2,y 2 ), …, (x N,y N ) ~ P(X,Y) IID: Independent Identically Distributed –Both training & testing data sampled IID from P(X,Y) –Learn on training set –Have some hope of generalizing to test set (C) Dhruv Batra4

5 Probability A is non-deterministic event –Can think of A as a boolean-valued variable Examples –A = your next patient has cancer –A = Rafael Nadal wins US Open 2016 (C) Dhruv Batra5

6 Interpreting Probabilities What does P(A) mean? Frequentist View –limit N  ∞ #(A is true)/N –limiting frequency of a repeating non-deterministic event Bayesian View –P(A) is your “belief” about A Market Design View –P(A) tells you how much you would bet (C) Dhruv Batra6

7 Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false has probability 0. P(true) = 1 P(false) = 0 The probability of disjunction is: 7 A B Slide credit: Ray Mooney

8 Interpreting the Axioms 0<= P(A) <= 1 P(false) = 0 P(true) = 1 P(A v B) = P(A) + P(B) – P(A ^ B) (C) Dhruv Batra8Image Credit: Andrew Moore

9 Interpreting the Axioms 0<= P(A) <= 1 P(false) = 0 P(true) = 1 P(A v B) = P(A) + P(B) – P(A ^ B) (C) Dhruv Batra9Image Credit: Andrew Moore

10 Interpreting the Axioms 0<= P(A) <= 1 P(false) = 0 P(true) = 1 P(A v B) = P(A) + P(B) – P(A ^ B) (C) Dhruv Batra10Image Credit: Andrew Moore

11 Interpreting the Axioms 0<= P(A) <= 1 P(false) = 0 P(true) = 1 P(A v B) = P(A) + P(B) – P(A ^ B) (C) Dhruv Batra11Image Credit: Andrew Moore

12 Joint Distribution The joint probability distribution for a set of random variables, X 1,…,X n gives the probability of every combination of values (an n- dimensional array with v n values if all variables are discrete with v values, all v n values must sum to 1): P(X 1,…,X n ) The probability of all possible conjunctions (assignments of values to some subset of variables) can be calculated by summing the appropriate subset of values from the joint distribution. Therefore, all conditional probabilities can also be calculated. 12 circlesquare red0.200.02 blue0.020.01 circlesquare red0.050.30 blue0.20 positive negative Slide credit: Ray Mooney

13 Marginal Distributions (C) Dhruv BatraSlide Credit: Erik Suddherth13 Sum rule y z

14 Conditional Probabilities P(A | B) = In worlds where B is true, fraction where A is true Example –H: “Have a headache” –F: “Coming down with Flu” (C) Dhruv Batra14

15 Conditional Probabilities P(Y=y | X=x) What do you believe about Y=y, if I tell you X=x? P(Rafael Nadal wins US Open 2016)? What if I tell you: –He has won the US Open twice –Novak Djokovic is ranked 1; just won Australian Open (C) Dhruv Batra15

16 Conditional Distributions (C) Dhruv BatraSlide Credit: Erik Sudderth16 Product rule

17 Conditional Probabilities 17 Figures from Bishop

18 Chain rule Generalized product rule: Example: 18 Equations from Wikipedia

19 Independence A and B are independent iff: Therefore, if A and B are independent: 19 These two constraints are logically equivalent Slide credit: Ray Mooney

20 Marginal: P satisfies (X  Y) if and only if –P(X=x,Y=y) = P(X=x) P(Y=y),  x  Val(X), y  Val(Y) Conditional: P satisfies (X  Y | Z) if and only if –P(X,Y|Z) = P(X|Z) P(Y|Z),  x  Val(X), y  Val(Y), z  Val(Z) Independence (C) Dhruv Batra20

21 Independent Random Variables (C) Dhruv BatraSlide Credit: Erik Sudderth21

22 Other Concepts Expectation: Variance: Covariance: 22 Equations from Bishop

23 Central Limit Theorem Let {X 1,..., X n } be a random sample of size n — that is, a sequence of independent and identically distributed random variables drawn from distributions of expected values given by µ and finite variances given by σ 2. Suppose we are interested in the sample averagerandom sampleindependent and identically distributedexpected valuesvariancessample average of these random variables. By the law of large numbers, the sample averages converge in probability and almost surely to the expected value µ as n → ∞ […] For large enough n, the distribution of S n is close to the normal distribution with mean µ and variance σ 2 /n. The usefulness of the theorem is that the distribution of √n(S n − µ) approaches normality regardless of the shape of the distribution of the individual X i ’s.law of large numbersconverge in probabilityalmost surely 23 Text from Wikipedia

24 Entropy (C) Dhruv Batra24Slide Credit: Sam Roweis

25 KL-Divergence / Relative Entropy (C) Dhruv Batra25Slide Credit: Sam Roweis

26 26 Bayes Theorem Simple proof from definition of conditional probability: QED: (Def. cond. prob.) Adapted from Ray Mooney

27 27 Probabilistic Classification Let Y be the random variable for the class which takes values {y 1,y 2,…y m }. Let X be the random variable describing an instance consisting of a vector of values for n features, let x k be a possible value for X and x ij a possible value for X i. For classification, we need to compute P(Y=y i | X=x k ) for i=1…m However, given no other assumptions, this requires a table giving the probability of each category for each possible instance in the instance space, which is impossible to accurately estimate from a reasonably-sized training set. –Assuming Y and all X i are binary, we need 2 n entries to specify P(Y=pos | X=x k ) for each of the 2 n possible x k ’s since P(Y=neg | X=x k ) = 1 – P(Y=pos | X=x k ) –Compared to 2 n+1 – 1 entries for the joint distribution P(Y,X 1,X 2 …X n ) Slide credit: Ray Mooney

28 28 Bayesian Categorization Determine category of x k by determining for each y i P(X=x k ) can be determined since categories are complete and disjoint. Adapted from Ray Mooney posterior priorlikelihood

29 29 Bayesian Categorization (cont.) Need to know: –Priors: P(Y=y i ) –Conditionals (likelihood): P(X=x k | Y=y i ) P(Y=y i ) are easily estimated from data. –If n i of the examples in D are in y i then P(Y=y i ) = n i / |D| Too many possible instances (e.g. 2 n for binary features) to estimate all P(X=x k | Y=y i ). Need to make some sort of independence assumptions about the features to make learning tractable (more details later). Adapted from Ray Mooney

30 Likelihood / Prior / Posterior A hypothesis is denoted as h; it is one member of the hypothesis space H A set of training examples is denoted as D, a collection of (x, y) pairs for training Pr(h) – the prior probability of the hypothesis – without observing any training data, what’s the probability that h is the target function we want? 30 Slide content from Rebecca Hwa

31 Likelihood / Prior / Posterior Pr(D) – the prior probability of the observed data – chance of getting the particular set of training examples D Pr(h|D) – the posterior probability of h – what is the probability that h is the target given that we’ve observed D? Pr(D|h) –the probability of getting D if h were true (a.k.a. likelihood of the data) Pr(h|D) = Pr(D|h)Pr(h)/Pr(D) 31 Slide content from Rebecca Hwa

32 MAP vs MLE Estimation Maximum-a-posteriori (MAP) estimation: –h MAP = argmax h Pr(h|D) = argmax h Pr(D|h)Pr(h)/Pr(D) = argmax h Pr(D|h)Pr(h) Maximum likelihood estimation (MLE): –h ML = argmax Pr(D|h) 32 Slide content from Rebecca Hwa


Download ppt "CS 2750: Machine Learning Probability Review Prof. Adriana Kovashka University of Pittsburgh February 29, 2016."

Similar presentations


Ads by Google