Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tópicos Especiais em Aprendizagem

Similar presentations


Presentation on theme: "Tópicos Especiais em Aprendizagem"— Presentation transcript:

1 Tópicos Especiais em Aprendizagem
Reinaldo Bianchi Centro Universitário da FEI 2010

2 4a. Aula Parte A

3 Objetivos desta aula - Apresentar mais duas técnicas de Statistical Machine Learning: - SVM e métodos de Kernel. - K-means e Ransac. Aula de hoje: - Capítulos 12 e 13 do Hastie. - Wikipedia.

4 Support Vector Machines
Máquinas de vetor de suporte (blerg)

5 Introduction Vapnik’s philosophy, 1990’s
“If you possess a restricted amount of information for solving some problem, try to solve the problem directly and never solve a more general one as an intermediate step. It is possible that the available information is sufficient for a direct solution but is insufficient for solving a more general intermediate problem.”

6 Introduction Vapnik’s philosophy, 1990’s
“In contrast to classical methods of statistics where in order to control performance one decreases the dimensionality of a feature space, the Support Vector Machine (SVM) dramatically increases dimensionality and relies on the so-called large margin factor.”

7 Introduction What is SVM?
SVM is primarily a two-class* classifier that maximises the width of the margin between classes, that is, the empty area around the decision boundary defined by the distance to the nearest training patterns. * It is feasible to extend the system to more than 2 classes

8 For example If {xi} is a set of points in n-dimensional space with corresponding classes {yi:yi  {-1,1}} then the training algorithm attempts to place an hyperplane between points where yi = 1 and points where yi = -1. Once this has been achieved a new pattern x can then be classified by testing which side of the hyper-plane the point lies on.

9 Geometric Idea Margin Optimal hyperplane

10 (x1,y1), …, (xN,yN), x  Rn, y  {–1, +1}
Linear SVM Let’s start with the simplest case: linear machines trained on separable data. Suppose the training data (x1,y1), …, (xN,yN), x  Rn, y  {–1, +1} can be separated by a hyperplane (w . x) + b = 0. We say that this set is separated by the optimal hyperplane if it is separated without error and the distance between the closest point to the hyperplane is maximal.

11 The Separating Hyperplane
What would happen if all the points not on these two boundary hyperplanes (H1,H2) were removed ?

12 The Separating Hyperplane

13 The Separating Hyperplane (cont.)
To describe the separating hyperplane let us suppose that all the training data satisfy the following constraints: (w . xi) + b  +1 for yi = +1. (w . xi) + b  – 1 for yi = – 1. which can be combined into one set of inequalities: yi (w . xi + b) – 1  0 for i = 1, …, N.

14 The Separating Hyperplane (cont.)
Now consider the points for which the equality in the previous equations holds. These points lie either on H1: (w . xi) + b = 1 or H2: (w . xi) + b = – 1 with normal w and perpendicular distance from the origin respectively |1 – b|/||w|| and | – 1 – b|/||w||. Therefore the shortest distance from the separating hyperplane to the closest positive (negative) example is 1/||w|| and the margin is simply 2/||w||.

15 The Separating Hyperplane (cont.)
Thus we can find the pair of hyperplanes which gives the maximum margin by minimising ||w||, subject to the constraints that H1 and H2 are parallel and that no training points fall between them. The parallel separating hyperplane is then found by calculating the value of b which places a plane between these two planes at an equal distance from each.

16 The Separating Hyperplane (cont.)
Those training points which lie on one of the hyperplanes (H1,H2) and whose removal would change the solution found are called support vectors.

17 Example

18 Resumo

19 Lagrange Formulation There are two reasons for switching to a Lagrange formulation of the problem. First, the constraints will be replaced by constraints on the lagrangian multipliers ai which will be much easier to handle. Secondly, in this formulation the training data will only appear in the form of dot products between vectors. This is a crucial property which will allow us to generalise the procedure to the nonlinear case.

20 Lagrange Equation (dual)
In this formulation we can maximise Lp subject to the constraints that the differentials of Lp with respect to w and b are both zero. This gives: Optimisation: Ld should be maximised subject to the constraints ai  0 for all i  {1,…,N}.

21 Lagrange Formulation The solution is expressed in terms of fitted Lagrange multipliers ai: Some fraction of ai are exactly zero; the xi for which ai > 0 are called support points S:

22 Support Vectors In the solution, those points for which ai > 0 are called “support vectors” and lie on one of the boundary hyperplanes H1, H2. All other points have ai = 0 and lie either on H1 or H2 such that the equality holds or on that side of H1 or H2 such that the strict inequality holds. Therefore, the support vectors are the critical points of the training set.

23 NonLinear SVM and Kernels Functions

24 Nonlinear SVM The nonlinear SVM implements the following idea.
It maps the training points xi into a high-dimensional feature space through some nonlinear mapping chosen a priori. In this space, an optimal separating hyperplane is constructed. This is called “The Kernel Trick”

25 Geometric Idea Original Space Feature Space

26 Geometric Idea

27 VC (Vapnik and Chervonenkis) Dimension
The VC dimension of a set of functions {fi} is the maximum number of points that can be separated into two classes in all possible ways. In other words, the VC dimension is the maximum number of points that can be shattered by the functions {fi}.

28 VC Dimension (an example)
VC dimension of the lines in the plane is equal to 3 since they can shatter three vectors but not four.

29 VC Dimension (cont.) Theorem. The VC dimension of the set of oriented hyperplanes in Rn is n+1. Therefore a vector space Rm can always be found such that a set of n training points are shattered in Rm.

30 VC Dimension (cont.) To increase the dimension of the points we can define a mapping from the n-dimensional space to the m-dimensional space where m > n: f : Rn  Rm Then we can use the same training algorithm as the linearly separable case by replacing xi . xj with f(xi) . f(xj) in all the equations. As a result the separating surface’s normal vector w will be in Rm.

31 Kernels However, mapping a point into a higher dimensional space might be too calculation- and storage-intensive to be practical. “Kernel functions are functions calculated in the original space that are equivalent to the dot product of the mapped vectors in the higher dimensional space”.

32 Kernels (cont.) K(xi, xj) = f(xi) . f(xj)
Therefore if there were a Kernel function K such that K(xi, xj) = f(xi) . f(xj) we only need to use K in the training algorithm and would never need to explicitly do the mapping. Thus if we replace xi . xj with K(xi , xj) everywhere in the training algorithm, the algorithm will produce a SVM which lives in a high dimensional space. All the linear considerations hold since we are still doing a linear separation but in a different space.

33 Kernels (cont.) Suppose that our vectors are in R2 and we choose
It’s easy to find a mapping f from R2 to R3 such that We can choose

34 Kernels (cont.) Note that neither the mapping f nor the higher dimensional space are unique. We could equally well have chosen the space to again be R3 and or to be R4 and

35 Kernels (cont.) For which kernels does there exist a pair {Rm,f} with the properties described previously? The answer is given by Mercer’s condition.

36 Mercer’s condition If K is a function on vectors u and v then there exists a mapping f into some vector space with an expansion if and only if for any g(u) such that is finite then

37 Some Examples of Nonlinear Kernels
- Polynomial: - Gaussian Radial Basis Function: - Sigmoidal (hyperbolic separating surface):

38 Popular Kernels: polinomials and radials

39 Polynomial Kernels p= p=3 p=5

40 Gaussian Kernels s=5 s=20 s=100
Images from Tom Weedon’s report (Imperial College London)

41 Sigmoidal Kernels [k,g]=[0.00003,-5] [0.00001,-5] [0.000001,-5]
Images from Tom Weedon’s report (Imperial College London)

42 SVM for the XOR Problem The exclusive OR is the simplest problem that cannot be solved using a linear discriminant operating on the features.

43 SVM for the XOR Problem (cont.)
The four points are mapped to a 6-dimensional space by

44 Example: 4 degree polinomial

45 Example: radial Kernel

46 Curse of Dimensionality
Support Vector Machines can suffer in high dimensions. The addition of noise causes the performance of the SVM to degrade.

47 Conclusion An important benefit of the SVM approach is that the complexity of the resulting classifier is characterised by the number of support vectors rather than the dimensionality of the transformed space. As a result, SVMs tend to be less prone to problems of overfitting than some other methods.

48 E no Matlab? SVMStruct: Structure containing information about the trained SVM classifier in the following fields: SupportVectors — Matrix of data points with each row corresponding to a support vector in the normalized data space. This matrix is a subset of the Training input data matrix, after normalization has been applied according to the 'AutoScale' argument. Alpha — Vector of weights for the support vectors. The sign of the weight is positive for support vectors belonging to the first group, and negative for the second group. KernelFunction — Handle to the function that maps the training data into kernel space.

49 E no Matlab? SVMStruct: Structure containing information about the trained SVM classifier in the following fields: GroupNames — Categorical, numeric, or logical vector, a cell vector of strings, or a character matrix with each row representing a class label. Specifies the group identifiers for the support vectors. SupportVectorIndices — Vector of indices that specify the rows in Training, the training data, that were selected as support vectors after the data was normalized, according to the AutoScale argument. E outros…

50 E no Matlab? SVMStruct = svmtrain(Training,Group):
Returns a structure, SVMStruct, containing information about the trained support vector machine (SVM) classifier. Input Arguments Training: Matrix of training data, where each row corresponds to an observation or replicate, and each column corresponds to a feature or variable. Group: Grouping variable, each row representing a class label.

51 E no Matlab? Exemplo executável: Fisher Iris data. load fisheriris
xdata = meas(51:end,3:4); group = species(51:end); svmStruct = svmtrain(xdata, group,'showplot',true);

52 E no Matlab?

53 LIBSVM A Library for Support Vector Machines
LIBSVM is an integrated software for support vector classification, regression and distribution estimation. LIBSVM provides a simple interface where users can easily link it with their own programs. Main features of LIBSVM include: Different SVM formulations Efficient multi-class classification Cross validation for model selection Probability estimates Various kernels (including precomputed kernel matrix) Both C++ and Java sources Demonstrations of SVM classification and regression …

54 cv = svmtrain(heart_scale_label, heart_scale_inst, cmd);
libsvm

55

56 Fim


Download ppt "Tópicos Especiais em Aprendizagem"

Similar presentations


Ads by Google