Presentation is loading. Please wait.

Presentation is loading. Please wait.

Crash Course on Machine Learning

Similar presentations


Presentation on theme: "Crash Course on Machine Learning"— Presentation transcript:

1 Crash Course on Machine Learning
Several slides from Luke Xettlemoyer, Carlos Guestrin and Ben Taskar

2 Typical Paradigms of Recognition
Feature Computation Model

3 Visual Recognition Identification Classification Is this your car?

4 Visual Recognition Verification Classification Is this a car?

5 Visual Recognition Classification Classification:
Is there a car in this picture? Classification

6 Visual Recognition Classification Structure Learning Detection:
Where is the car in this picture? Classification Structure Learning

7 Visual Recognition Classification Activity Recognition:
What is he doing? What is he doing?

8 Visual Recognition Pose Estimation: Regression Structure Learning

9 Visual Recognition Structure Learning Object Categorization: Sky
Person Tree Horse Car Person Bicycle Road

10 Visual Recognition Classification Structure Learning Segmentation Sky
Tree Car Person

11 What kind of problems? Classification Regression Structure Learning
Generative vs. Discriminative Supervised, unsupervised, semi-supervised, weakly supervised Linear, nonlinear Ensemble methods Probabilistic Regression Linear regression Structured output regression Structure Learning Graphical Models Margin based approaches

12 What kind of problems? Classification Regression Structure Learning
Generative vs. Discriminative Supervised, unsupervised, semi-supervised, weakly supervised Linear, nonlinear Ensemble methods Probabilistic Regression Linear regression Structured output regression Structure Learning Graphical Models Margin based approaches

13 What kind of problems? Classification Regression Structure Learning
Generative vs. Discriminative Supervised, unsupervised, semi-supervised, weakly supervised Linear, nonlinear Ensemble methods Probabilistic Regression Linear regression Structured output regression Structure Learning Graphical Models Margin based approaches

14 What kind of problems? Classification Regression Structure Learning
Generative vs. Discriminative Supervised, unsupervised, semi-supervised, weakly supervised Linear, nonlinear Ensemble methods Probabilistic Regression Linear regression Structured output regression Structure Learning Graphical Models Margin based approaches

15 Let’s play with probability for a bit Remembering simple stuff

16 Thumbtack & Probabilities
P(Heads) = , P(Tails) = 1- Flips are i.i.d.: Independent events Identically distributed according to Binomial distribution Sequence D of H Heads and T Tails D={xi | i=1…n}, P(D | θ ) = ΠiP(xi | θ )

17 Maximum Likelihood Estimation
Data: Observed set D of H Heads and T Tails Hypothesis: Binomial distribution Learning: finding  is an optimization problem What’s the objective function? MLE: Choose  to maximize probability of D

18 Parameter learning Set derivative to zero, and solve!

19 But, how many flips do I need?
3 heads and 2 tails.  = 3/5, I can prove it! What if I flipped 30 heads and 20 tails? Same answer, I can prove it! What’s better? Umm… The more the merrier???

20 A bound (from Hoeffding’s inequality)
For N =H+T, and Let  * be the true parameter, for any >0: N Prob. of Mistake Exponential Decay! Hoeffding’s inequality: provides a bound on the probability that a sum of random variables deviates from its expectation.

21 What if I have prior beliefs?
Wait, I know that the thumbtack is “close” to What can you do for me now? Rather than estimating a single , we obtain a distribution over possible values of  In the beginning After observations Observe flips e.g.: {tails, tails}

22 How to use Prior Use Bayes rule: Or equivalently:
Data Likelihood Posterior Normalization Or equivalently: Also, for uniform priors: Only made it to Gaussians. Probability review was pretty painful, look 15+ minutes.  reduces to MLE objective

23 Beta prior distribution – P()
Likelihood function: Posterior:

24 MAP for Beta distribution
MAP: use most likely parameter:

25 What about continuous variables?

26 We like Gaussians because
Affine transformation (multiplying by scalar and adding a constant) are Gaussian X ~ N(,2) Y = aX + b  Y ~ N(a+b,a22) Sum of Gaussians is Gaussian X ~ N(X,2X) Y ~ N(Y,2Y) Z = X+Y  Z ~ N(X+Y, 2X+2Y) Easy to differentiate

27 Learning a Gaussian Collect a bunch of data Learn parameters
xi i = Exam Score 85 1 95 2 100 3 12 99 89 Collect a bunch of data Hopefully, i.i.d. samples e.g., exam scores Learn parameters Mean: μ Variance: σ

28 MLE for Gaussian: Prob. of i.i.d. samples D={x1,…,xN}:
Log-likelihood of data:

29 MLE for mean of a Gaussian
What’s MLE for mean?

30 MLE for variance Again, set derivative to zero:

31 Learning Gaussian parameters
MLE:

32 MAP Conjugate priors Prior for mean: Mean: Gaussian prior
Variance: Wishart Distribution Prior for mean:

33 Supervised Learning: find f
Given: Training set {(xi, yi) | i = 1 … n} Find: A good approximation to f : X  Y What is x? What is y?

34 Simple Example: Digit Recognition
Input: images / pixel grids Output: a digit 0-9 Setup: Get a large collection of example images, each labeled with a digit Note: someone has to hand label all this data! Want to learn to predict labels of new, future digit images Features: ? 1 2 1 Screw You, I want to use Pixels :D ??

35 Lets take a probabilistic approach!!!
Can we directly estimate the data distribution P(X,Y)? How do we represent these? How many parameters? Prior, P(Y): Suppose Y is composed of k classes Likelihood, P(X|Y): Suppose X is composed of n binary features

36 Conditional Independence
X is conditionally independent of Y given Z, if the probability distribution for X is independent of the value of Y, given the value of Z e.g., Equivalent to:

37 Naïve Bayes Naïve Bayes assumption:
Features are independent given class: More generally:

38 The Naïve Bayes Classifier
Given: Prior P(Y) n conditionally independent features X given the class Y For each Xi, we have likelihood P(Xi|Y) Decision rule: Y X1 X2 Xn

39 A Digit Recognizer Input: pixel grids Output: a digit 0-9

40 Naïve Bayes for Digits (Binary Inputs)
Simple version: One feature Fij for each grid position <i,j> Possible feature values are on / off, based on whether intensity is more or less than 0.5 in underlying image Each input maps to a feature vector, e.g. Here: lots of features, each is binary valued Naïve Bayes model: Are the features independent given class? What do we need to learn?

41 Example Distributions
1 0.1 2 3 4 5 6 7 8 9 1 0.01 2 0.05 3 4 0.30 5 0.80 6 0.90 7 8 0.60 9 0.50 1 0.05 2 0.01 3 0.90 4 0.80 5 6 7 0.25 8 0.85 9 0.60

42 MLE for the parameters of NB
Given dataset Count(A=a,B=b) number of examples where A=a and B=b MLE for discrete NB, simply: Prior: Likelihood:

43 Violating the NB assumption
Usually, features are not conditionally independent: NB often performs well, even when assumption is violated [Domingos & Pazzani ’96] discuss some conditions for good performance

44 Smoothing 2 wins!! Does this happen in vision?

45 NB & Bag of words model

46 What about real Features? What if we have continuous Xi ?
Eg., character recognition: Xi is ith pixel Gaussian Naïve Bayes (GNB): Sometimes assume variance is independent of Y (i.e., i), or independent of Xi (i.e., k) or both (i.e., )

47 Estimating Parameters
Maximum likelihood estimates: Mean: Variance: jth training example (x)=1 if x true, else 0

48 What you need to know about Naïve Bayes
Naïve Bayes classifier What’s the assumption Why we use it How do we learn it Why is Bayesian estimation important Bag of words model Gaussian NB Features are still conditionally independent Each feature has a Gaussian distribution given class Optimal decision using Bayes Classifier

49 another probabilistic approach!!!
Naïve Bayes: directly estimate the data distribution P(X,Y)! challenging due to size of distribution! make Naïve Bayes assumption: only need P(Xi|Y)! But wait, we classify according to: maxY P(Y|X) Why not learn P(Y|X) directly?


Download ppt "Crash Course on Machine Learning"

Similar presentations


Ads by Google