Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bayesian Learning, Part 1 of (probably) 4 Reading: DH&S, Ch. 2.{1-5}, 3.{1-4}

Similar presentations


Presentation on theme: "Bayesian Learning, Part 1 of (probably) 4 Reading: DH&S, Ch. 2.{1-5}, 3.{1-4}"— Presentation transcript:

1 Bayesian Learning, Part 1 of (probably) 4 Reading: DH&S, Ch. 2.{1-5}, 3.{1-4}

2 Administrivia Finalizing the straw poll... Reminder: I’m out of town, Mar 1-3

3 Blatant advertisement! 2nd CS UNM Student Conference (CSUSC 2006) March 3, 2006 http://www.cs.unm.edu/~csgsa/conference/ See cool work going on in CS Learn what constitutes good research Support your fellow grad students Free food...

4 Reading #2 Kolter, J.Z., & Maloof, M.A. (2005). “Using additive expert ensembles to cope with concept drift.” In Proceedings of the Twenty-second International Conference on Machine Learning (ICML-2005), 449–456. New York, NY: ACM Press. http://www.cs.georgetown.edu/~maloof/pubs/icml05.php Due: Tues, Mar 7

5 Yesterday, today, and... Last time: Finish up SVMs (informal notes) Discussion of R1 This time: Intro to statistical/generative modeling The Bayesian viewpoint (maybe) Maximum likelihood (ML) estimation Next time: ML for sure...

6 ML trivia of the day... Which data mining techniques [have] you used in a successfully deployed application? http://www.kdnuggets.com/

7 Assumptions “Assume makes an a** out of U and ME”... Bull**** Assumptions are unavoidable It is not possible to have an assumption-free learning algorithm Must always have some assumption about how the data works Makes learning faster, more accurate, more robust

8 Example assumptions Decision tree: Axis orthogonality Impurity-based splitting Greedy search ok Accuracy (0/1 loss) objective function

9 Example assumptions k-NN: Distance function/metric Accuracy objective Data drawn from probability distribution k controls “smoothness” of prob. estimate

10 Example assumptions Linear discriminant (hyperplane classifier) via MSE: Data is linearly separable Squared-error cost

11 Example assumptions Support vector machines Data is (close to) linearly separable...... in some high-dimensional projection of input space Interesting nonlinearities can be captured by kernel functions Max margin objective function

12 Specifying assumptions Bayesian learning assumes: Data were generated by some stochastic process Can write down (some) mathematical form for that process CDF/PDF/PMF Mathematical form needs to be parameterized Have some “prior beliefs” about those params

13 Specifying assumptions Makes strong assumptions about form (distribution) of data Essentially, an attempt to make assumptions explicit and to divorce them from learning algorithm In practice, not a single learning algorithm, but a recipe for generating problem-specific algs. Will work well to the extent that these assumptions are right

14 Example F ={height, weight} Ω={male, female} Q1: Any guesses about individual distributions of height/weight by class? What probability function (PDF)? Q2: What about the joint distribution? Q3: What about the means of each? Reasonable guess for the upper/lower bounds on the means?

15 Some actual data* * Actual synthesized data, anyway...

16 General idea Find probability distribution that describes classes of data Find decision surface in terms of those probability distributions

17 H/W data as PDFs

18 Or, if you prefer...

19 General idea Find probability distribution that describes classes of data Find decision surface in terms of those probability distributions What would be a good rule?

20 Recall: Bayes optimality For 0/1 loss (accuracy), we showed that optimal decision is (Lecture 7, Feb 7): Equivalently, it’s sometimes useful to use log odds ratio test:

21 Recall: Bayes optimality In pictures:

22 Bayesian learning process So where do the probability distributions come from? The art of Bayesian data modeling is: Deciding what probability models to use Figuring out how to find the parameters In Bayesian learning, the “learning” is (almost) all in finding the parameters

23 Back to the H/W data

24 Gaussian (a.k.a. normal or bell curve) is a reasonable assumption for this data Other distributions better for other data Can make reasonable guesses about means Probably not -3 kg or 2 million lightyears Assumptions like these are called Model assumptions (Gaussian) Parameter priors (means) How do we incorporate these into learning? Prior knowledge

25 5 minutes of math... Our friend the Gaussian distribution 1n 1-dimension: Mean: Std deviation: Both parameters scalar Usually, we talk about variance rather than std dev:

26 Gaussian: the pretty picture

27 Location parameter: μ

28 Gaussian: the pretty picture Scale parameter: σ

29 5 minutes of math... In d dimensions: Where: Mean vector: Covariance matrix: Determinant of covariance:

30 Exercise: For the 1-d Gaussian: Given two classes, with means μ 1 and μ 2 and std devs σ 1 and σ 2 Find a description of the decision point if the std devs are the same, but diff means And if means are the same, but std devs are diff For the d -dim Gaussian, What shapes are the isopotentials? Why? Repeat above exercise for d -dim Gaussian


Download ppt "Bayesian Learning, Part 1 of (probably) 4 Reading: DH&S, Ch. 2.{1-5}, 3.{1-4}"

Similar presentations


Ads by Google