Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, 11-12 a Machine Learning.

Mehdi Ghayoumi MSB rm 132 mghayoum@kent.edu Ofc hr: Thur, 11-12 a Machine Learning

Announcements Hw1 Feb-11 and Hw2 Mar-11 Programming HW1 Feb-25 and Programming Hw2 Apr-1 Quiz 1 March-11 and Quiz 2 Apr-8 First project report Mar-16,18. Second project report Apr-13,15. April 29 – Exam review May 4 – Final Exam

Machine Learning Announcements This Monday from 2 to 4 Mar-2 Project presentation Room 160 MSB

Machine Learning

Regularization: ridge regression, Lasso, group Lasso, graph-guided group lasso, (weighted) total variation, `1-`2 mixed norms, elastic net, fused Lasso, `p, `0 regularization, sparse coding, hierarchical sparse coding, nuclear norm, trace norm, K-SVD, basis pursuit denoising, quadratic basis pursuit denoising, sparse portfolio optimization, lowrank approximation, overlapped Schatten 1-norm, latent Schatten 1- norm, tensor completion, tensor denoising, composite penalties, sparse multicomposite problems, sparse PCA, structured sparsity;

Machine Learning Optimization: forward-backward splitting, alternating directions method of multipliers (ADMM), alternating proximal algorithms, gradient projection, stochastic generalized gradient, bilevel optimization, non-convex proximal splitting, block coordinate descent algorithm, parallel coordinate descent algorithm, stochastic gradient descent, semidefinite programming, robust iterative hard thresholding, alternating nonnegative least squares, hierarchical alternating least squares, conditional gradient algorithm, Frank-Wolfe algorithm, hybrid conditional gradientsmoothing algorithm;

Machine Learning Kernels and support vector machines: support estimation, kernel PCA, subspace learning, reproducing kernel Hilbert spaces, output kernel learning, multi-task learning, support vector machines, least squares support vector machines, Lagrange duality, primal and dual representations, feature map, Mercer kernel, Nyström approximation, Fenchel duality, kernel ridge regression, kernel mean embedding, causality, importance reweighting, domain adaptation, multilayer support vector machine, Gaussian process regression, kernel recursive least-squares, dictionary learning, quantized kernel LMS;

Machine Learning

SVM is related to statistical learning theory SVM was first introduced in 1992 SVM became popular because of its success in handwritten digit recognition SVM is now regarded as an important example of “kernel methods”, one of the key area in machine learning Machine Learning

An SVM is an abstract learning machine which will learn from a training data set and attempt to generalize and make correct predictions on novel data.

r ρ Distance from example x i to the separator is Machine Learning

Let training set {(x i, y i )} i=1..n, x i  R d, y i  {-1, 1} be separated by a hyperplane with margin ρ. Then for each training example (x i, y i ): w T x i + b ≤ - ρ/2 if y i = -1 w T x i + b ≥ ρ/2 if y i = 1 y i (w T x i + b) ≥ ρ/2  Machine Learning

For every support vector x s the above inequality is an equality. After rescaling w and b by ρ/2 in the equality, we obtain that distance between each x s and the hyperplane is Then the margin can be expressed through (rescaled) w and b as: Machine Learning

Then we can formulate the quadratic optimization problem: Find w and b such that is maximized and for all (x i, y i ), i=1..n : y i (w T x i + b) ≥ 1 Machine Learning

Which can be reformulated as: Find w and b such that Φ(w) = ||w|| 2 =w T w is minimized and for all (x i, y i ), i=1..n : y i (w T x i + b) ≥ 1 Machine Learning

Need to optimize a quadratic function subject to linear constraints. The solution involves constructing a dual problem where a Lagrange multiplier α i is associated with every inequality constraint in the primal problem: Find α 1 …α n such that Q( α ) = Σ α i - ½ ΣΣ α i α j y i y j x i T x j is maximized and (1) Σ α i y i = 0 (2) α i ≥ 0 for all α i Machine Learning

We will call this the primal formulation: where α i are Lagrange multipliers, and thus α i ≥ 0. At the minimum, we can take the derivatives with respect to b and w and set these to zero:

Machine Learning we get the dual formulation, also known as the Wolfe dual: which must be maximized with respect to the α i subject to the constraints:

Given a solution α 1 …α n to the dual problem, solution to the primal is: Each non-zero α i indicates that corresponding x i is a support vector. Then the classifying function is: keep in mind that solving the optimization problem involved computing the inner products x i T x j between all training points. w = Σ α i y i x i b = y k - Σ α i y i x i T x k for any α k > 0 f(x) = Σ α i y i x i T x + b Machine Learning

What if the training set is not linearly separable? Slack variables ξ i can be added to allow misclassification of difficult or noisy examples, resulting margin called soft. ξiξi ξiξi Machine Learning

The old formulation: Modified formulation incorporates slack variables: Parameter C can be viewed as a way to control over fitting Find w and b such that Φ(w) =w T w is minimized and for all (x i,y i ), i=1..n : y i (w T x i + b) ≥ 1 Find w and b such that Φ(w) =w T w + C Σ ξ i is minimized and for all (x i,y i ), i=1..n : y i (w T x i + b) ≥ 1 – ξ i,, ξ i ≥ 0 Machine Learning

Dual problem is identical to separable case : Again, x i with non-zero α i will be support vectors. Solution to the dual problem is: Find α 1 …α N such that Q( α ) = Σ α i - ½ ΣΣ α i α j y i y j x i T x j is maximized and (1) Σ α i y i = 0 (2) 0 ≤ α i ≤ C for all α i w = Σ α i y i x i b= y k (1- ξ k ) - Σ α i y i x i T x k for any k s.t. α k >0 f(x) = Σ α i y i x i T x + b Machine Learning

Vapnik has proved the following: The class of optimal linear separators has VC dimension h bounded as where ρ is the margin, D is the diameter of the smallest sphere that can enclose all of the training examples, and m 0 is the dimensionality. Machine Learning

0 0 0 x2x2 x x x Datasets that are linearly separable with some noise work out great: But what are we going to do if the dataset is just too hard? How about… mapping data to a higher-dimensional space: Machine Learning

General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x) Machine Learning

Thank you!

Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, 11-12 a Machine Learning.

Similar presentations

Presentation on theme: "Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, 11-12 a Machine Learning."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, 11-12 a Machine Learning.

Similar presentations

Presentation on theme: "Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, 11-12 a Machine Learning."— Presentation transcript:

Similar presentations

About project

Feedback