Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and Slack Variables. Duality and Support Vectors. Read 5.10, A.3, 5.11 Duda, Hart, Stork. Or better: 12.1, 12.2, Hastie, Tibshirani, Friedman.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 2. Separation by Hyperplanes Data Hyperplane Linear Classifier: By simple geometry, the signed distance of a point x to the plane is The line through x perpendicular to the plane is: Hits the plane when which implies that the distance is

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 3. Margin, Support Vectors We introduce new concepts: (I) Margin, (II) Support Vectors. This will enable us to understand performance in non-seperable case. Technical methods: quadratic optimization with linear constraints, Lagrange multipliers and duality. Margins will also be important when studying generalization. Everything in this lecture can be extended beyond hyperplanes (next lecture).

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 4. Margin for Separable Data Assume there is a separating hyperplane: We seek to find the classifier with the biggest margin:

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 5. Margin Non-Separable Use concept of margin to define optimal criterion for non- separable data. For data samples, define slack variables Seek the hyperplane that maximizes the margin allowing for a limited amount K of misclassified data (slack). One criterion: Subject to

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 6. Margin Non-Separable If, then the data point j is correctly classified by the hyperplane. If is the proportional amount that data is on the wrong side of the margin. From this criterion, the points closest to the hyperplane are the ones that most influence its form (more details later). These are the datapoints that are hardest to classify. These will become the support vectors. By contrast, data points away from the hyperplane are less important. This differs from probability estimation.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 7. Quadratic Programming Remove the restriction that Define criterion: This is a quadratic primal problem with linear constraints (unique soln.). It can be formulated using Lagrange multipliers. Variables,

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 8. Quadratic Programming Extremize (differentiate) wrt respectively yields, The solution is The solution only depends on the support vectors:

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 9. Duality Any quadratic optimization problem L_p with linear constraints can be reformulated in terms of a dual problem L_d. The variables of the dual problem are the Lagrange parameters of the primal problem. In this case Linear algebra gives: Subject to: Standard packages to solve the primal and dual (easier) problem.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning Primal to Dual To obtain the dual formulation Rewrite the primal as: Extremize w.r.t. All terms cancel except Set

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 10. Support Vectors The form of the solution is Now These are the Support Vectors. Two types: (1) those on the margin, for which (2) those past the margin, for which Karush-Kuhn-Tucker conditions.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 11. Karush-Kuhn-Tucker KKT Conditions: Lagrange multipliers and constraints Use any margin point to solve for b.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 12. Perceptron and Margins. The Perceptron rule can be re-interpreted in terms of the margin and given a formulation in dual space. Perceptron convergence. The critical quantity for the convergence of the perceptron is, where R is the radius of the smallest ball containing the data and m is the margin. Define Then number of Perceptron errors in one sweep is bounded above by

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 13. Perceptron in Dual Space The Perceptron learning algorithm works by adding misclassified data examples to the weights. Set initial weights to be zero. The weight hypothesis will always be of form Perceptron Rule in Dual Space: Update rule for If data is misclassified, I.e.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning Summary Linear Separability and Margins. Slack Variables z’s – formulate for non-separable case. Quadratic optimization with linear constraints. Primal problem L_p and Dual Problem L_d (standard techniques for solution). Dual Perceptron Rule (separable case only). Solution of form Dual variables alpha’s determine the support vectors Support Vectors – hard to classify data (no analog in probability).

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and.

Similar presentations

Presentation on theme: "Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and.

Similar presentations

Presentation on theme: "Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and."— Presentation transcript:

Similar presentations

About project

Feedback