Classification III Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Classification III Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer, Rob Pless, Killian Weinberger, Deva Ramanan 1

Announcements Pick-up your midterm from TAs if you haven’t gotten it yet Assignment 4 due today 2

Discriminant Function It can be arbitrary functions of x, such as: Nearest Neighbor Decision Tree Linear Functions 3

Linear classifier Find a linear function to separate the classes f(x) = sgn(w 1 x 1 + w 2 x 2 + … + w D x D ) = sgn(w  x)

Perceptron x1x1 x2x2 xDxD w1w1 w2w2 w3w3 x3x3 wDwD Input Weights...... Output: sgn(w  x + b) Can incorporate bias as component of the weight vector by always including a feature with value set to 1

Loose inspiration: Human neurons

Perceptron training algorithm Initialize weights Cycle through training examples in multiple passes (epochs) For each training example: –If classified correctly, do nothing –If classified incorrectly, update weights

Perceptron update rule

Implementation details Bias (add feature dimension with value fixed to 1) vs. no bias Initialization of weights: all zeros vs. random Number of epochs (passes through the training data) Order of cycling through training examples

Multi-class perceptrons

Differentiable perceptron x1x1 x2x2 xdxd w1w1 w2w2 w3w3 x3x3 wdwd Sigmoid function: Input Weights...... Output:  (w  x + b)

Update rule for differentiable perceptron Define total classification error or loss on the training set: Update weights by gradient descent: For a single training point, the update is:

Multi-Layer Neural Network Can learn nonlinear functions Training: find network weights to minimize the error between true and estimated labels of training examples: Minimization can be done by gradient descent provided f is differentiable –This training method is called back-propagation

Deep convolutional neural networks Zeiler, M., and Fergus, R. Visualizing and Understanding Convolutional Neural Networks, tech report, 2013. Krizhevsky, A., Sutskever, I., and Hinton, G.E. ImageNet classication with deep convolutional neural networks. NIPS, 2012.

Demo from Berkeley http://decaf.berkeleyvision.org/

Demo in the browser! https://www.jetpac.com/deepbelief

Linear classifier Find a linear function to separate the classes f(x) = sgn(w 1 x 1 + w 2 x 2 + … + w D x D ) = sgn(w  x) 17

Linear Discriminant Function f(x) is a linear function: x1x1 x2x2 w T x + b = 0 w T x + b < 0 w T x + b > 0 A hyper-plane in the feature space denotes +1 denotes -1 x1x1 18

How would you classify these points using a linear discriminant function in order to minimize the error rate? Linear Discriminant Function denotes +1 denotes -1 x1x1 x2x2 Infinite number of answers! 19

How would you classify these points using a linear discriminant function in order to minimize the error rate? Linear Discriminant Function x1x1 x2x2 Infinite number of answers! denotes +1 denotes -1 20

How would you classify these points using a linear discriminant function in order to minimize the error rate? Linear Discriminant Function x1x1 x2x2 Infinite number of answers! denotes +1 denotes -1 21

x1x1 x2x2 How would you classify these points using a linear discriminant function in order to minimize the error rate? Linear Discriminant Function Infinite number of answers! Which one is the best? denotes +1 denotes -1 22

Large Margin Linear Classifier “safe zone” The linear discriminant function (classifier) with the maximum margin is the best Margin is defined as the width that the boundary could be increased by before hitting a data point Why it is the best?  strong generalization ability Margin x1x1 x2x2 Linear SVM 23

Large Margin Linear Classifier x1x1 x2x2 Margin w T x + b = 0 w T x + b = -1 w T x + b = 1 x+x+ x+x+ x-x- Support Vectors 24

Support vector machines Find hyperplane that maximizes the margin between the positive and negative examples Margin Support vectors C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998A Tutorial on Support Vector Machines for Pattern Recognition Distance between point and hyperplane: For support vectors, Therefore, the margin is 2 / ||w||

Finding the maximum margin hyperplane 1.Maximize margin 2 / ||w|| 2.Correctly classify all training data: Quadratic optimization problem: C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998A Tutorial on Support Vector Machines for Pattern Recognition

Solving the Optimization Problem The linear discriminant function is: Notice it relies on a dot product between the test point x and the support vectors x i 27

Linear separability 28

Non-linear SVMs: Feature Space General idea: the original input space can be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x) Slide courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt 29

Nonlinear SVMs: The Kernel Trick With this mapping, our discriminant function becomes: No need to know this mapping explicitly, because we only use the dot product of feature vectors in both the training and test. A kernel function is defined as a function that corresponds to a dot product of two feature vectors in some expanded feature space: 30

Nonlinear SVMs: The Kernel Trick  Linear kernel: Examples of commonly-used kernel functions:  Polynomial kernel:  Gaussian (Radial-Basis Function (RBF) ) kernel:  Sigmoid: 31

Support Vector Machine: Algorithm 1. Choose a kernel function 2. Choose a value for C and any other parameters (e.g. σ) 3. Solve the quadratic programming problem (many software packages available) 4. Classify held out validation instances using the learned model 5. Select the best learned model based on validation accuracy 6. Classify test instances using the final selected model 32

Some Issues Choice of kernel - Gaussian or polynomial kernel is default - if ineffective, more elaborate kernels are needed - domain experts can give assistance in formulating appropriate similarity measures Choice of kernel parameters - e.g. σ in Gaussian kernel - In the absence of reliable criteria, applications rely on the use of a validation set or cross-validation to set such parameters. This slide is courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt 33

Summary: Support Vector Machine 1. Large Margin Classifier –Better generalization ability & less over-fitting 2. The Kernel Trick –Map data points to higher dimensional space in order to make them linearly separable. –Since only dot product is needed, we do not need to represent the mapping explicitly. 34

SVMs in Computer Vision 35

Detection features ? classify +1 pos -1 neg We slide a window over the image Extract features for each window Classify each window into pos/neg x F(x)y ??

Sliding Window Detection 37

38 Representation

Classification III Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Similar presentations

Presentation on theme: "Classification III Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Classification III Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Similar presentations

Presentation on theme: "Classification III Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,"— Presentation transcript:

Similar presentations

About project

Feedback