Crash Course on Machine Learning

Slides:

Advertisements

Similar presentations

Bayes rule, priors and maximum a posteriori

Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

An Overview of Machine Learning

PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.

Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)

Naïve Bayes Classifier

PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.

What is Statistical Modeling

Visual Recognition Tutorial

Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.

Today Linear Regression Logistic Regression Bayesians v. Frequentists

Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.

Machine Learning CMPT 726 Simon Fraser University

Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(

Visual Recognition Tutorial

Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.

CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.

Thanks to Nir Friedman, HU

Bayes Classifier, Linear Regression 10701/15781 Recitation January 29, 2008 Parts of the slides are from previous years’ recitation and lecture notes,

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.

Review of Lecture Two Linear Regression Normal Equation

Recitation 1 Probability Review

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

ECSE 6610 Pattern Recognition Professor Qiang Ji Spring, 2011.

Crash Course on Machine Learning Part II

CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 2012

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Naïve Bayes Readings: Barber

Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

Lecture note for Stat 231: Pattern Recognition and Machine Learning 4. Maximum Likelihood Prof. A.L. Yuille Stat 231. Fall 2004.

1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.

CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein.

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)

CHAPTER 6 Naive Bayes Models for Classification. QUESTION????

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.

Maximum Likelihood Estimation

1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

CSE 446 Logistic Regression Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.

Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,

1 BN Semantics 3 – Now it’s personal! Parameter Learning 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 22 nd, 2006 Readings:

Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang.

Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.

Oliver Schulte Machine Learning 726

Usman Roshan CS 675 Machine Learning

Probability Theory and Parameter Estimation I

Naïve Bayes Classifiers

Bayes Net Learning: Bayesian Approaches

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning

Oliver Schulte Machine Learning 726

ECE 5424: Introduction to Machine Learning

Parametric Methods Berlin Chen, 2005 References:

Speech recognition, machine learning

Multivariate Methods Berlin Chen

BN Semantics 3 – Now it’s personal! Parameter Learning 1

Multivariate Methods Berlin Chen, 2005 References:

Recap: Naïve Bayes classifier

Mathematical Foundations of BME Reza Shadmehr

Speech recognition, machine learning

Naïve Bayes Classifier

Presentation transcript:

Crash Course on Machine Learning Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar

Typical Paradigms of Recognition Feature Computation Model

Visual Recognition Identification Classification Is this your car?

Visual Recognition Verification Classification Is this a car?

Visual Recognition Classification Classification: Is there a car in this picture? Classification

Visual Recognition Classification Structure Learning Detection: Where is the car in this picture? Classification Structure Learning

Visual Recognition Classification Activity Recognition: What is he doing? What is he doing?

Visual Recognition Pose Estimation: Regression Structure Learning

Visual Recognition Structure Learning Object Categorization: Sky Person Tree Horse Car Person Bicycle Road

Visual Recognition Classification Structure Learning Segmentation Sky Tree Car Person

What kind of problems? Classification Regression Structure Learning Generative vs. Discriminative Supervised, unsupervised, semi-supervised, weakly supervised Linear, nonlinear Ensemble methods Probabilistic Regression Linear regression Structured output regression Structure Learning Graphical Models Margin based approaches

What kind of problems? Classification Regression Structure Learning Generative vs. Discriminative Supervised, unsupervised, semi-supervised, weakly supervised Linear, nonlinear Ensemble methods Probabilistic Regression Linear regression Structured output regression Structure Learning Graphical Models Margin based approaches

What kind of problems? Classification Regression Structure Learning Generative vs. Discriminative Supervised, unsupervised, semi-supervised, weakly supervised Linear, nonlinear Ensemble methods Probabilistic Regression Linear regression Structured output regression Structure Learning Graphical Models Margin based approaches

What kind of problems? Classification Regression Structure Learning Generative vs. Discriminative Supervised, unsupervised, semi-supervised, weakly supervised Linear, nonlinear Ensemble methods Probabilistic Regression Linear regression Structured output regression Structure Learning Graphical Models Margin based approaches

Let’s play with probability for a bit Remembering simple stuff

Thumbtack & Probabilities P(Heads) = , P(Tails) = 1- Flips are i.i.d.: Independent events Identically distributed according to Binomial distribution Sequence D of H Heads and T Tails … D={xi | i=1…n}, P(D | θ ) = ΠiP(xi | θ )

Maximum Likelihood Estimation Data: Observed set D of H Heads and T Tails Hypothesis: Binomial distribution Learning: finding  is an optimization problem What’s the objective function? MLE: Choose  to maximize probability of D

Parameter learning Set derivative to zero, and solve!

But, how many flips do I need? 3 heads and 2 tails.  = 3/5, I can prove it! What if I flipped 30 heads and 20 tails? Same answer, I can prove it! What’s better? Umm… The more the merrier???

A bound (from Hoeffding’s inequality) For N =H+T, and Let  * be the true parameter, for any >0: N Prob. of Mistake Exponential Decay! Hoeffding’s inequality: provides a bound on the probability that a sum of random variables deviates from its expectation.

What if I have prior beliefs? Wait, I know that the thumbtack is “close” to 50-50. What can you do for me now? Rather than estimating a single , we obtain a distribution over possible values of  In the beginning After observations Observe flips e.g.: {tails, tails}

How to use Prior Use Bayes rule: Or equivalently: Data Likelihood Posterior Normalization Or equivalently: Also, for uniform priors: Only made it to Gaussians. Probability review was pretty painful, look 15+ minutes.  reduces to MLE objective

Beta prior distribution – P() Likelihood function: Posterior:

MAP for Beta distribution MAP: use most likely parameter:

What about continuous variables?

We like Gaussians because Affine transformation (multiplying by scalar and adding a constant) are Gaussian X ~ N(,2) Y = aX + b  Y ~ N(a+b,a22) Sum of Gaussians is Gaussian X ~ N(X,2X) Y ~ N(Y,2Y) Z = X+Y  Z ~ N(X+Y, 2X+2Y) Easy to differentiate

Learning a Gaussian Collect a bunch of data Learn parameters xi i = Exam Score 85 1 95 2 100 3 12 … 99 89 Collect a bunch of data Hopefully, i.i.d. samples e.g., exam scores Learn parameters Mean: μ Variance: σ

MLE for Gaussian: Prob. of i.i.d. samples D={x1,…,xN}: Log-likelihood of data:

MLE for mean of a Gaussian What’s MLE for mean?

MLE for variance Again, set derivative to zero:

Learning Gaussian parameters MLE:

MAP Conjugate priors Prior for mean: Mean: Gaussian prior Variance: Wishart Distribution Prior for mean:

Supervised Learning: find f Given: Training set {(xi, yi) | i = 1 … n} Find: A good approximation to f : X  Y What is x? What is y?

Simple Example: Digit Recognition Input: images / pixel grids Output: a digit 0-9 Setup: Get a large collection of example images, each labeled with a digit Note: someone has to hand label all this data! Want to learn to predict labels of new, future digit images Features: ? 1 2 1 Screw You, I want to use Pixels :D ??

Lets take a probabilistic approach!!! Can we directly estimate the data distribution P(X,Y)? How do we represent these? How many parameters? Prior, P(Y): Suppose Y is composed of k classes Likelihood, P(X|Y): Suppose X is composed of n binary features

Conditional Independence X is conditionally independent of Y given Z, if the probability distribution for X is independent of the value of Y, given the value of Z e.g., Equivalent to:

Naïve Bayes Naïve Bayes assumption: Features are independent given class: More generally:

The Naïve Bayes Classifier Given: Prior P(Y) n conditionally independent features X given the class Y For each Xi, we have likelihood P(Xi|Y) Decision rule: Y X1 X2 Xn

A Digit Recognizer Input: pixel grids Output: a digit 0-9

Naïve Bayes for Digits (Binary Inputs) Simple version: One feature Fij for each grid position <i,j> Possible feature values are on / off, based on whether intensity is more or less than 0.5 in underlying image Each input maps to a feature vector, e.g. Here: lots of features, each is binary valued Naïve Bayes model: Are the features independent given class? What do we need to learn?

Example Distributions 1 0.1 2 3 4 5 6 7 8 9 1 0.01 2 0.05 3 4 0.30 5 0.80 6 0.90 7 8 0.60 9 0.50 1 0.05 2 0.01 3 0.90 4 0.80 5 6 7 0.25 8 0.85 9 0.60

MLE for the parameters of NB Given dataset Count(A=a,B=b) number of examples where A=a and B=b MLE for discrete NB, simply: Prior: Likelihood:

Violating the NB assumption Usually, features are not conditionally independent: NB often performs well, even when assumption is violated [Domingos & Pazzani ’96] discuss some conditions for good performance

Smoothing 2 wins!! Does this happen in vision?

NB & Bag of words model

What about real Features? What if we have continuous Xi ? Eg., character recognition: Xi is ith pixel Gaussian Naïve Bayes (GNB): Sometimes assume variance is independent of Y (i.e., i), or independent of Xi (i.e., k) or both (i.e., )

Estimating Parameters Maximum likelihood estimates: Mean: Variance: jth training example (x)=1 if x true, else 0

What you need to know about Naïve Bayes Naïve Bayes classifier What’s the assumption Why we use it How do we learn it Why is Bayesian estimation important Bag of words model Gaussian NB Features are still conditionally independent Each feature has a Gaussian distribution given class Optimal decision using Bayes Classifier

another probabilistic approach!!! Naïve Bayes: directly estimate the data distribution P(X,Y)! challenging due to size of distribution! make Naïve Bayes assumption: only need P(Xi|Y)! But wait, we classify according to: maxY P(Y|X) Why not learn P(Y|X) directly?