CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
CHAPTER 10: Linear Discrimination
An Introduction of Support Vector Machine
Support Vector Machines
SVM—Support Vector Machines
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
Machine Learning Neural Networks
Lecture 14 – Neural Networks
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
x – independent variable (input)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Artificial Intelligence Statistical learning methods Chapter 20, AIMA (only ANNs & SVMs)
Chapter 5 NEURAL NETWORKS
Support Vector Machines
CS 4700: Foundations of Artificial Intelligence
Lecture 4 Neural Networks ICS 273A UC Irvine Instructor: Max Welling Read chapter 4.
Lecture 10: Support Vector Machines
MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Artificial Neural Networks
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
ML Concepts Covered in 678 Advanced MLP concepts: Higher Order, Batch, Classification Based, etc. Recurrent Neural Networks Support Vector Machines Relaxation.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
An Introduction to Support Vector Machines Martin Law.
Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Artificial Neural Networks
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Computer Science and Engineering
Machine Learning Chapter 4. Artificial Neural Networks
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
CS 478 – Tools for Machine Learning and Data Mining Backpropagation.
10/18/ Support Vector MachinesM.W. Mak Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs References: 1. S.Y. Kung,
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
An Introduction to Support Vector Machines (M. Law)
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
Linear Discrimination Reading: Chapter 2 of textbook.
Non-Bayes classifiers. Linear discriminants, neural networks.
Linear Classification with Perceptrons
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
CS621 : Artificial Intelligence
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Support Vector Machine
Machine Learning Supervised Learning Classification and Regression
Neural networks and support vector machines
CS 9633 Machine Learning Support Vector Machines
Fall 2004 Backpropagation CS478 - Machine Learning.
A Simple Artificial Neuron
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Machine Learning Today: Reading: Maria Florina Balcan
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Artificial Neural Networks
Multilayer Perceptron & Backpropagation
Support Vector Machines
COSC 4368 Machine Learning Organization
Machine Learning Support Vector Machine Supervised Learning
Presentation transcript:

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models

Extending Linear Classification ● Linear classifiers can’t model nonlinear class boundaries ● Simple trick: ● Map attributes into new space consisting of combinations of attribute values ● E.g.: all products of n factors that can be constructed from the attributes ● Example with two attributes and n = 3: 2

Problems with this Approach ● 1 st problem: speed ● 10 attributes, and n = 5  >2000 coefficients ● Use linear regression with attribute selection ● Run time is cubic in number of attributes ● 2 nd problem: overfitting ● Number of coefficients is large relative to the number of training instances ● Curse of dimensionality kicks in 3

Support Vector Machines ● Support vector machines are algorithms for learning linear classifiers ● Resilient to overfitting because they learn a particular linear decision boundary: ● The maximum margin hyperplane ● Fast in the nonlinear case ● Use a mathematical trick to avoid creating “pseudo-attributes” ● The nonlinear space is created implicitly 4

The Maximum Margin Hyperplane ● The instances closest to the maximum margin hyperplane are called support vectors 5

Support Vectors 6 ● The support vectors define the maximum margin hyperplane ● All other instances can be deleted without changing its position and orientation

Nonlinear SVMs ● Overfitting not a problem because the maximum margin hyperplane is stable ● There are usually few support vectors relative to the size of the training set ● Computation time still an issue ● Each time the dot product is computed, all the “pseudo attributes” must be included 7

A Mathematical Trick ● Avoid computing the “pseudo attributes” ● Compute the dot product before doing the nonlinear mapping ● Example: ● Corresponds to a map into the instance space spanned by all products of n attributes 8

Other Kernel Functions ● Mapping is called a “kernel function” ● Polynomial kernel ● We can use others: ● Only requirement: ● Examples: 9

Noise ● Have assumed that the data is separable (in original or transformed space) ● Can apply SVMs to noisy data by introducing a “noise” parameter C ● C bounds the influence of any one training instance on the decision boundary ● Corresponding constraint: 0   i  C ● Still a quadratic optimization problem ● Have to determine C by experimentation 10

Sparse Data ● SVM algorithms speed up dramatically if the data is sparse (i.e. many values are 0) ● Why? Because they compute lots and lots of dot products ● Sparse data  compute dot products very efficiently ● Iterate only over non-zero values ● SVMs can process sparse datasets with 10,000s of attributes 11

Applications ● Machine vision: e.g face identification ● Outperforms alternative approaches (1.5% error) ● Handwritten digit recognition: USPS data ● Comparable to best alternative (0.8% error) ● Bioinformatics: e.g. prediction of protein secondary structure ● Text classifiation ● Can modify SVM technique for numeric prediction problems 12

Support Vector Regression ● Maximum margin hyperplane only applies to classification ● However, idea of support vectors and kernel functions can be used for regression ● Basic method same as in linear regression: want to minimize error ● Difference A: ignore errors smaller than  and use absolute error instead of squared error ● Difference B: simultaneously aim to maximize flatness of function ● User-specified parameter  defines “tube” 13

More on SVM Regression ● If there are tubes that enclose all the training points, the flattest of them is used ● Eg.: mean is used if 2  > range of target values ● Model can be written as: ● Support vectors: points on or outside tube ● Dot product can be replaced by kernel function ● Note: coefficients  i may be negative ● No tube that encloses all training points? ● Requires trade-off between error and flatness ● Controlled by upper limit C on absolute value of coefficients  i 14

Examples 15  = 2  = 1  = 0.5

The Kernel Perceptron ● Can use “kernel trick” to make non-linear classifier using perceptron rule ● Observation: weight vector is modified by adding or subtracting training instances 16

Comments on Kernel Perceptron ● Finds separating hyperplane in space created by kernel function (if it exists) ● But: doesn't find maximum-margin hyperplane ● Easy to implement, supports incremental learning ● Linear and logistic regression can also be upgraded using the kernel trick ● But: solution is not “sparse”: every training instance contributes to solution ● Perceptron can be made more stable by using all weight vectors encountered during learning, not just last one (voted perceptron) ● Weight vectors vote on prediction (vote based on number of successful classifications since inception) 17

Multilayer Perceptrons ● Using kernels is only one way to build nonlinear classifier based on perceptrons ● Can create network of perceptrons to approximate arbitrary target concepts ● Multilayer perceptron is an example of an artificial neural network ● Consists of: input layer, hidden layer(s), and output layer ● Structure of MLP is usually found by experimentation ● Parameters can be found using backpropagation 18

Backpropagation ● How to learn weights given network structure? ● Cannot simply use perceptron learning rule because we have hidden layer(s) ● Function we are trying to minimize: error ● Can use a general function minimization technique called gradient descent ● Need differentiable activation function: use sigmoid function instead of threshold function ● Need differentiable error function: can't use zero-one loss, but can use squared error 19

The Two Activation Functions 20

Gradient Descent Example ● Function: x 2 +1 ● Derivative: 2x ● Learning rate: 0.1 ● Start value: 4 Can only find a local minimum! 21

Remarks ● Same process works for multiple hidden layers and multiple output units (eg. for multiple classes) ● Can update weights after all training instances have been processed or incrementally: ● Batch learning vs. stochastic backpropagation ● Weights are initialized to small random values ● How to avoid overfitting? ● Early stopping: use validation set to check when to stop ● Weight decay: add penalty term to error function ● How to speed up learning? ● Momentum: re-use proportion of old weight change ● Use optimization method that employs 2nd derivative 22

The Mystery Noise  What is that?!? 23