Bayesian Learning, Part 1 of (probably) 4 Reading: DH&S, Ch. 2.{1-5}, 3.{1-4}

Administrivia Finalizing the straw poll... Reminder: I’m out of town, Mar 1-3

Blatant advertisement! 2nd CS UNM Student Conference (CSUSC 2006) March 3, 2006 http://www.cs.unm.edu/~csgsa/conference/ See cool work going on in CS Learn what constitutes good research Support your fellow grad students Free food...

Reading #2 Kolter, J.Z., & Maloof, M.A. (2005). “Using additive expert ensembles to cope with concept drift.” In Proceedings of the Twenty-second International Conference on Machine Learning (ICML-2005), 449–456. New York, NY: ACM Press. http://www.cs.georgetown.edu/~maloof/pubs/icml05.php Due: Tues, Mar 7

Yesterday, today, and... Last time: Finish up SVMs (informal notes) Discussion of R1 This time: Intro to statistical/generative modeling The Bayesian viewpoint (maybe) Maximum likelihood (ML) estimation Next time: ML for sure...

ML trivia of the day... Which data mining techniques [have] you used in a successfully deployed application? http://www.kdnuggets.com/

Assumptions “Assume makes an a** out of U and ME”... Bull**** Assumptions are unavoidable It is not possible to have an assumption-free learning algorithm Must always have some assumption about how the data works Makes learning faster, more accurate, more robust

Example assumptions Decision tree: Axis orthogonality Impurity-based splitting Greedy search ok Accuracy (0/1 loss) objective function

Example assumptions k-NN: Distance function/metric Accuracy objective Data drawn from probability distribution k controls “smoothness” of prob. estimate

Example assumptions Linear discriminant (hyperplane classifier) via MSE: Data is linearly separable Squared-error cost

Example assumptions Support vector machines Data is (close to) linearly separable...... in some high-dimensional projection of input space Interesting nonlinearities can be captured by kernel functions Max margin objective function

Specifying assumptions Bayesian learning assumes: Data were generated by some stochastic process Can write down (some) mathematical form for that process CDF/PDF/PMF Mathematical form needs to be parameterized Have some “prior beliefs” about those params

Specifying assumptions Makes strong assumptions about form (distribution) of data Essentially, an attempt to make assumptions explicit and to divorce them from learning algorithm In practice, not a single learning algorithm, but a recipe for generating problem-specific algs. Will work well to the extent that these assumptions are right

Example F ={height, weight} Ω={male, female} Q1: Any guesses about individual distributions of height/weight by class? What probability function (PDF)? Q2: What about the joint distribution? Q3: What about the means of each? Reasonable guess for the upper/lower bounds on the means?

Some actual data* * Actual synthesized data, anyway...

General idea Find probability distribution that describes classes of data Find decision surface in terms of those probability distributions

H/W data as PDFs

Or, if you prefer...

General idea Find probability distribution that describes classes of data Find decision surface in terms of those probability distributions What would be a good rule?

Recall: Bayes optimality For 0/1 loss (accuracy), we showed that optimal decision is (Lecture 7, Feb 7): Equivalently, it’s sometimes useful to use log odds ratio test:

Recall: Bayes optimality In pictures:

Bayesian learning process So where do the probability distributions come from? The art of Bayesian data modeling is: Deciding what probability models to use Figuring out how to find the parameters In Bayesian learning, the “learning” is (almost) all in finding the parameters

Back to the H/W data

Gaussian (a.k.a. normal or bell curve) is a reasonable assumption for this data Other distributions better for other data Can make reasonable guesses about means Probably not -3 kg or 2 million lightyears Assumptions like these are called Model assumptions (Gaussian) Parameter priors (means) How do we incorporate these into learning? Prior knowledge

5 minutes of math... Our friend the Gaussian distribution 1n 1-dimension: Mean: Std deviation: Both parameters scalar Usually, we talk about variance rather than std dev:

Gaussian: the pretty picture

Location parameter: μ

Gaussian: the pretty picture Scale parameter: σ

5 minutes of math... In d dimensions: Where: Mean vector: Covariance matrix: Determinant of covariance:

Exercise: For the 1-d Gaussian: Given two classes, with means μ 1 and μ 2 and std devs σ 1 and σ 2 Find a description of the decision point if the std devs are the same, but diff means And if means are the same, but std devs are diff For the d -dim Gaussian, What shapes are the isopotentials? Why? Repeat above exercise for d -dim Gaussian

Bayesian Learning, Part 1 of (probably) 4 Reading: DH&S, Ch. 2.{1-5}, 3.{1-4}

Similar presentations

Presentation on theme: "Bayesian Learning, Part 1 of (probably) 4 Reading: DH&S, Ch. 2.{1-5}, 3.{1-4}"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bayesian Learning, Part 1 of (probably) 4 Reading: DH&S, Ch. 2.{1-5}, 3.{1-4}

Similar presentations

Presentation on theme: "Bayesian Learning, Part 1 of (probably) 4 Reading: DH&S, Ch. 2.{1-5}, 3.{1-4}"— Presentation transcript:

Similar presentations

About project

Feedback