CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification - SVM CS 685: Special Topics in Data Mining Jinze Liu.

Slides:



Advertisements
Similar presentations
Support Vector Machine & Its Applications
Advertisements

Introduction to Support Vector Machines (SVM)
Support Vector Machines
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Support Vector Machines
1 Support Vector Machines Some slides were borrowed from Andrew Moore’s PowetPoint slides on SVMs. Andrew’s PowerPoint repository is here:
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
LOGO Classification IV Lecturer: Dr. Bo Yuan
1 CSC 463 Fall 2010 Dr. Adam P. Anthony Class #27.
Discriminative and generative methods for bags of features
Support Vector Machine
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines Kernel Machines
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
CS 4700: Foundations of Artificial Intelligence
Support Vector Machines
Support Vector Machines
Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.
Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.
Copyright © 2001, Andrew W. Moore Support Vector Machines Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification - SVM CS 685: Special Topics in Data Mining Jinze Liu.
Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural.
Data Mining Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE: Support Vector Machines.
1 CSC 4510, Spring © Paula Matuszek CSC 4510 Support Vector Machines 2 (SVMs)
Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification - SVM CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
1 CMSC 671 Fall 2010 Class #24 – Wednesday, November 24.
1 Support Vector Machines Chapter Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School.
1 Support Vector Machines. Why SVM? Very popular machine learning technique –Became popular in the late 90s (Vapnik 1995; 1998) –Invented in the late.
1 CSC 4510, Spring © Paula Matuszek CSC 4510 Support Vector Machines (SVMs)
Machine Learning Lecture 7: SVM Moshe Koppel Slides adapted from Andrew Moore Copyright © 2001, 2003, Andrew W. Moore.
Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.
CS 1699: Intro to Computer Vision Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh October 29, 2015.
Dec 21, 2006For ICDM Panel on 10 Best Algorithms Support Vector Machines: A Survey Qiang Yang, for ICDM 2006 Panel Partially.
Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
1 Support Vector Machines Some slides were borrowed from Andrew Moore’s PowetPoint slides on SVMs. Andrew’s PowerPoint repository is here:
Support Vector Machines Louis Oliphant Cs540 section 2.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Support Vector Machines Chapter 18.9 and the paper “Support vector machines” by M. Hearst, ed., 1998 Acknowledgments: These slides combine and modify ones.
Support Vector Machine & Its Applications. Overview Intro. to Support Vector Machines (SVM) Properties of SVM Applications  Gene Expression Data Classification.
Classification - CBA CS 485: Special Topics in Data Mining Jinze Liu.
Support Vector Machine Slides from Andrew Moore and Mingyue Tan.
Support Vector Machines
CS 9633 Machine Learning Support Vector Machines
Support Vector Machines
Support Vector Machines
Introduction to SVMs.
Support Vector Machines
Machine Learning Week 2.
Support Vector Machines
CS 485: Special Topics in Data Mining Jinze Liu
Class #212 – Thursday, November 12
CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu
Support Vector Machines
Support Vector Machines
CS 685: Special Topics in Data Mining Spring 2009 Jinze Liu
Presentation transcript:

CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification - SVM CS 685: Special Topics in Data Mining Jinze Liu

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) Any of these would be fine....but which is best?

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Classifier Margin f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Maximum Margin f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Maximum Margin f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against Linear SVM

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Why Maximum Margin? denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against 1.Intuitively this feels safest. 2.If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification. 3.LOOCV is easy since the model is immune to removal of any non- support-vector datapoints. 4.There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing. 5.Empirically it works very very well.

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Specifying a line and margin How do we represent this mathematically? …in m input dimensions? Plus-Plane Minus-Plane Classifier Boundary “Predict Class = +1” zone “Predict Class = -1” zone

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Specifying a line and margin Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } Plus-Plane Minus-Plane Classifier Boundary “Predict Class = +1” zone “Predict Class = -1” zone Classify as..+1ifw. x + b >= 1 ifw. x + b <= -1 Universe explodes if-1 < w. x + b < 1 wx+b=1 wx+b=0 wx+b=-1

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Computing the margin width Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } Claim: The vector w is perpendicular to the Plus Plane. Why? “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width How do we compute M in terms of w and b?

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Computing the margin width Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } Claim: The vector w is perpendicular to the Plus Plane. Why? “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width How do we compute M in terms of w and b? Let u and v be two vectors on the Plus Plane. What is w. ( u – v ) ? And so of course the vector w is also perpendicular to the Minus Plane

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Computing the margin width Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } The vector w is perpendicular to the Plus Plane Let x - be any point on the minus plane Let x + be the closest plus-plane-point to x -. “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width How do we compute M in terms of w and b? x-x- x+x+ Any location in  m : not necessarily a datapoint Any location in R m : not necessarily a datapoint

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Computing the margin width Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } The vector w is perpendicular to the Plus Plane Let x - be any point on the minus plane Let x + be the closest plus-plane-point to x -. Claim: x + = x - + w for some value of. Why? “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width How do we compute M in terms of w and b? x-x- x+x+

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Computing the margin width Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } The vector w is perpendicular to the Plus Plane Let x - be any point on the minus plane Let x + be the closest plus-plane-point to x -. Claim: x + = x - + w for some value of. Why? “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width How do we compute M in terms of w and b? x-x- x+x+ The line from x - to x + is perpendicular to the planes. So to get from x - to x + travel some distance in direction w.

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Computing the margin width What we know: w. x + + b = +1 w. x - + b = -1 x + = x - + w |x + - x - | = M It’s now easy to get M in terms of w and b “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width x-x- x+x+

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Computing the margin width What we know: w. x + + b = +1 w. x - + b = -1 x + = x - + w |x + - x - | = M It’s now easy to get M in terms of w and b “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width w. (x - + w) + b = 1 => w. x - + b + w.w = 1 => -1 + w.w = 1 => x-x- x+x+

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Computing the margin width What we know: w. x + + b = +1 w. x - + b = -1 x + = x - + w |x + - x - | = M “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width = M = |x + - x - | =| w |= x-x- x+x+

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Learning the Maximum Margin Classifier Given a guess of w and b we can Compute whether all data points in the correct half-planes Compute the width of the margin So now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the datapoints. How? Gradient descent? Simulated Annealing? Matrix Inversion? EM? Newton’s Method? “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width = x-x- x+x+

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Learning via Quadratic Programming QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints.

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Quadratic Programming Find And subject to n additional linear inequality constraints e additional linear equality constraints Quadratic criterion Subject to

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Quadratic Programming Find Subject to And subject to n additional linear inequality constraints e additional linear equality constraints Quadratic criterion There exist algorithms for finding such constrained quadratic optima much more efficiently and reliably than gradient ascent. (But they are very fiddly…you probably don’t want to write one yourself)

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Learning the Maximum Margin Classifier “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = What should our quadratic optimization criterion be? How many constraints will we have? What should they be? Given guess of w, b we can Compute whether all data points are in the correct half- planes Compute the margin width Assume R datapoints, each (x k,y k ) where y k = +/- 1

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Learning the Maximum Margin Classifier Given guess of w, b we can Compute whether all data points are in the correct half- planes Compute the margin width Assume R datapoints, each (x k,y k ) where y k = +/- 1 “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = What should our quadratic optimization criterion be? Minimize w.w How many constraints will we have? R What should they be? w. x k + b >= 1 if y k = 1 w. x k + b <= -1 if y k = -1

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do?

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 1: Find minimum w.w, while minimizing number of training set errors. Problemette: Two things to minimize makes for an ill- defined optimization

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 1.1: Minimize w.w + C (#train errors) There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is? Tradeoff parameter

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 1.1: Minimize w.w + C (#train errors) There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is? Tradeoff parameter Can’t be expressed as a Quadratic Programming problem. Solving it may be too slow. (Also, doesn’t distinguish between disastrous errors and near misses) So… any other ideas?

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 2.0: Minimize w.w + C (distance of error points to their correct place)

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Learning Maximum Margin with Noise Given guess of w, b we can Compute sum of distances of points to their correct zones Compute the margin width Assume R datapoints, each (x k,y k ) where y k = +/- 1 wx+b=1 wx+b=0 wx+b=-1 M = What should our quadratic optimization criterion be? How many constraints will we have? What should they be?

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Learning Maximum Margin with Noise Given guess of w, b we can Compute sum of distances of points to their correct zones Compute the margin width Assume R datapoints, each (x k,y k ) where y k = +/- 1 wx+b=1 wx+b=0 wx+b=-1 M = What should our quadratic optimization criterion be? Minimize 77  11 22 How many constraints will we have? R What should they be? w. x k + b >= 1-  k if y k = 1 w. x k + b <= -1+  k if y k = -1

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Learning Maximum Margin with Noise Given guess of w, b we can Compute sum of distances of points to their correct zones Compute the margin width Assume R datapoints, each (x k,y k ) where y k = +/- 1 wx+b=1 wx+b=0 wx+b=-1 M = What should our quadratic optimization criterion be? Minimize 77  11 22 Our original (noiseless data) QP had m+1 variables: w 1, w 2, … w m, and b. Our new (noisy data) QP has m+1+R variables: w 1, w 2, … w m, b,  k,  1,…  R m = # input dimension s How many constraints will we have? R What should they be? w. x k + b >= 1-  k if y k = 1 w. x k + b <= -1+  k if y k = -1 R= # records

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore How many constraints will we have? R What should they be? w. x k + b >= 1-  k if y k = 1 w. x k + b <= -1+  k if y k = -1 Learning Maximum Margin with Noise Given guess of w, b we can Compute sum of distances of points to their correct zones Compute the margin width Assume R datapoints, each (x k,y k ) where y k = +/- 1 wx+b=1 wx+b=0 wx+b=-1 M = What should our quadratic optimization criterion be? Minimize 77  11 22 There’s a bug in this QP. Can you spot it?

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Learning Maximum Margin with Noise Given guess of w, b we can Compute sum of distances of points to their correct zones Compute the margin width Assume R datapoints, each (x k,y k ) where y k = +/- 1 wx+b=1 wx+b=0 wx+b=-1 M = What should our quadratic optimization criterion be? Minimize How many constraints will we have? 2R What should they be? w. x k + b >= 1-  k if y k = 1 w. x k + b <= -1+  k if y k = -1  k >= 0 for all k 77  11 22

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore An Equivalent QP Maximize where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w. x - b)

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore An Equivalent QP Maximize where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w. x - b) Datapoints with  k > 0 will be the support vectors..so this sum only needs to be over the support vectors.

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore An Equivalent QP Maximize where Subject to these constraints: Then define: Then classify with: f(x,w,b) = sign(w. x - b) Datapoints with  k > 0 will be the support vectors..so this sum only needs to be over the support vectors. Why did I tell you about this equivalent QP? It’s a formulation that QP packages can optimize more quickly Because of further jaw-dropping developments you’re about to learn.

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Estimate the Margin What is the distance expression for a point x to a line wx+b= 0? denotes +1 denotes -1 x wx +b = 0

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Estimate the Margin What is the expression for margin? denotes +1 denotes -1 wx +b = 0 Margin

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Maximize Margin denotes +1 denotes -1 wx +b = 0 Margin

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Maximize Margin denotes +1 denotes -1 wx +b = 0 Margin Min-max problem  game problem

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Maximize Margin denotes +1 denotes -1 wx +b = 0 Margin Strategy:

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Learning via Quadratic Programming QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints.

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Suppose we’re in 1-dimension What would SVMs do with this data? x=0

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Suppose we’re in 1-dimension Not a big surprise Positive “plane” Negative “plane” x=0

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Harder 1-dimensional dataset That’s wiped the smirk off SVM’s face. What can be done about this? x=0

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Harder 1-dimensional dataset Remember how permitting non-linear basis functions made linear regression so much nicer? Let’s permit them here too x=0

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Harder 1-dimensional dataset Remember how permitting non-linear basis functions made linear regression so much nicer? Let’s permit them here too x=0

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Common SVM basis functions z k = ( polynomial terms of x k of degree 1 to q ) z k = ( radial basis functions of x k ) z k = ( sigmoid functions of x k )

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore SVM Performance Anecdotally they work very very well indeed. Example: They are currently the best-known classifier on a well-studied hand-written- character recognition benchmark There has been a lot of excitement and religious fervor about SVMs. Despite this, some practitioners (including your lecturer) are a little skeptical.

CS685 : Special Topics in Data Mining, UKY Copyright © 2001, 2003, Andrew W. Moore Doing multi-class classification SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2). What can be done? Answer: with output arity N, learn N SVM’s – SVM 1 learns “Output==1” vs “Output != 1” – SVM 2 learns “Output==2” vs “Output != 2” – : – SVM N learns “Output==N” vs “Output != N” Then to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region.

CS685 : Special Topics in Data Mining, UKY SVM Related Links C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998.A Tutorial on Support Vector Machines for Pattern Recognition SVM light – Software (in C) BOOK: An Introduction to Support Vector Machines N. Cristianini and J. Shawe-Taylor Cambridge University Press

CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification - CBA CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu

CS685 : Special Topics in Data Mining, UKY Association Rules Itemset X = {x 1, …, x k } Find all the rules X  Y with minimum support and confidence support, s, is the probability that a transaction contains X  Y confidence, c, is the conditional probability that a transaction having X also contains Y Let sup min = 50%, conf min = 50% Association rules: A  C (60%, 100%) C  A (60%, 75%) Customer buys diaper Customer buys both Customer buys beer Transaction- id Items bought 100f, a, c, d, g, I, m, p 200a, b, c, f, l,m, o 300b, f, h, j, o 400b, c, k, s, p 500a, f, c, e, l, p, m, n

CS685 : Special Topics in Data Mining, UKY Classification based on Association Classification rule mining versus Association rule mining Aim – A small set of rules as classifier – All rules according to minsup and minconf Syntax – X  y – X  Y

CS685 : Special Topics in Data Mining, UKY Why & How to Integrate Both classification rule mining and association rule mining are indispensable to practical applications. The integration is done by focusing on a special subset of association rules whose right-hand-side are restricted to the classification class attribute. – CARs: class association rules

CS685 : Special Topics in Data Mining, UKY CBA: Three Steps Discretize continuous attributes, if any Generate all class association rules (CARs) Build a classifier based on the generated CARs.

CS685 : Special Topics in Data Mining, UKY Our Objectives To generate the complete set of CARs that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf) constraints. To build a classifier from the CARs.

CS685 : Special Topics in Data Mining, UKY Rule Generator: Basic Concepts Ruleitem :condset is a set of items, y is a class label Each ruleitem represents a rule: condset->y condsupCount The number of cases in D that contain condset rulesupCount The number of cases in D that contain the condset and are labeled with class y Support =(rulesupCount/|D|)*100% Confidence =(rulesupCount/condsupCount)*100%

CS685 : Special Topics in Data Mining, UKY RG: Basic Concepts (Cont.) Frequent ruleitems – A ruleitem is frequent if its support is above minsup Accurate rule – A rule is accurate if its confidence is above minconf Possible rule – For all ruleitems that have the same condset, the ruleitem with the highest confidence is the possible rule of this set of ruleitems. The set of class association rules (CARs) consists of all the possible rules (PRs) that are both frequent and accurate.

CS685 : Special Topics in Data Mining, UKY RG: An Example A ruleitem: – assume that the support count of the condset (condsupCount) is 3, the support of this ruleitem (rulesupCount) is 2, and |D|=10 – then (A,1),(B,1) -> (class,1) supt=20% (rulesupCount/|D|)*100% confd=66.7% (rulesupCount/condsupCount)*100%

CS685 : Special Topics in Data Mining, UKY RG: The Algorithm 1 F 1 = {large 1-ruleitems}; 2 CAR 1 = genRules (F 1 ); 3 prCAR 1 = pruneRules (CAR 1 ); //count the item and class occurrences to determine the frequent 1-ruleitems and prune it 4 for (k = 2; F k-1  Ø; k++) do 5C k = candidateGen (F k-1 ); //generate the candidate ruleitems C k using the frequent ruleitems F k-1 6 for each data case d  D do //scan the database 7C d = ruleSubset (C k, d); //find all the ruleitems in C k whose condsets are supported by d 8 for each candidate c  C d do 9 c.condsupCount++; 10 if d.class = c.class then c.rulesupCount++; //update various support counts of the candidates in C k 11 end 12 end

CS685 : Special Topics in Data Mining, UKY RG: The Algorithm(cont.) 13F k = {c  C k | c.rulesupCount  minsup}; //select those new frequent ruleitems to form F k 14 CAR k = genRules(F k ); //select the ruleitems both accurate and frequent 15 prCAR k = pruneRules(CAR k ); 16 end 17 CARs =  k CAR k ; 18 prCARs =  k prCAR k ;