Presentation is loading. Please wait.

Presentation is loading. Please wait.

ECEC 481/681– Statistical Pattern Recognition Chapter 3 – Linear Classifiers - SVM Chapter 4 – Non-Linear Classifiers 1.

Similar presentations


Presentation on theme: "ECEC 481/681– Statistical Pattern Recognition Chapter 3 – Linear Classifiers - SVM Chapter 4 – Non-Linear Classifiers 1."— Presentation transcript:

1 ECEC 481/681– Statistical Pattern Recognition Chapter 3 – Linear Classifiers - SVM Chapter 4 – Non-Linear Classifiers 1

2 Recap  Density estimation  Linear classifiers  Curves – boundary/interior/exterior  Office hours  Cohen, Tues 2-3  Wait, Thurs 2-3 2

3 3  The perceptron  It is a learning machine that learns from the training vectors via the perceptron algorithm.  The network is called perceptron or neuron.

4 4  Example: At some stage t the perceptron algorithm results in The corresponding hyperplane is ρ=0.7

5 5  Least Squares Methods  If classes are linearly separable, the perceptron output results in  If classes are NOT linearly separable, we shall compute the weights,, so that the difference between The actual output of the classifier,, and The desired outputs, e.g., to be SMALL.

6 6  SMALL, in the mean square error sense, means to choose so that the cost function:

7 7  Minimizing where R x is the autocorrelation matrix and the crosscorrelation vector.

8 8  Multi-class generalization The goal is to compute M linear discriminant functions: according to the MSE. Adopt as desired responses y i : Let And the matrix

9 9 The goal is to compute : The above is equivalent to a number M of MSE minimization problems. That is: Design each so that its desired output is 1 for and 0 for any other class.  Remark: The MSE criterion belongs to a more general class of cost function with the following important property: The value of is an estimate, in the MSE sense, of the a-posteriori probability, provided that the desired responses used during training are and 0 otherwise.

10 10  Mean square error regression: Let, be jointly distributed random vectors with a joint pdf The goal: Given the value of, estimate the value of. In the pattern recognition framework, given one wants to estimate the respective label. The MSE estimate of, given, is defined as: It turns out that: The above is known as the regression of given and it is, in general, a non-linear function of. If is Gaussian the MSE regressor is linear.

11 11  SMALL in the sum of error squares sense means  that is, the input x i and its corresponding class label (±1).  : training pairs

12 12  Pseudoinverse Matrix  Define   

13 13 Thus  Assume N=l X square and invertible. Then Pseudoinverse of X

14 14  Assume N>l. Then, in general, there is no solution to satisfy all equations simultaneously:  The “solution” corresponds to the minimum sum of squares solution.

15 15  Example:

16 16 

17 17  The Bias – Variance Dilemma A classifier is a learning machine that tries to predict the class label y of. In practice, a finite data set D is used for its training. Let us write. Observe that:  For some training sets,, the training may result to good estimates, for some others the result may be worse.  The average performance of the classifier can be tested against the MSE optimal value, in the mean squares sense, that is: where E D is the mean over all possible data sets D.

18 18 The above is written as: In the above, the first term is the contribution of the bias and the second term is the contribution of the variance. For a finite D, there is a trade-off between the two terms. Increasing bias it reduces variance and vice verse. This is known as the bias-variance dilemma. Using a complex model results in low-bias but a high variance, as one changes from one training set to another. Using a simple model results in high bias but low variance.

19 19  Let an -class task,. In logistic discrimination, the logarithm of the likelihood ratios are modeled via linear functions, i.e.,  Taking into account that it can be easily shown that the above is equivalent with modeling posterior probabilities as:  LOGISTIC DISCRIMINATION

20 20  For the two-class case it turns out that

21 21  The unknown parameters are usually estimated by maximum likelihood arguments.  Logistic discrimination is a useful tool, since it allows linear modeling and at the same time ensures posterior probabilities to add to one.

22 22  The goal: Given two linearly separable classes, design the classifier that leaves the maximum margin from both classes.  Support Vector Machines

23 23  Margin: Each hyperplane is characterized by: Its direction in space, i.e., Its position in space, i.e., For EACH direction,, choose the hyperplane that leaves the SAME distance from the nearest points from each class. The margin is twice this distance.

24 24  The distance of a point from a hyperplane is given by:  Scale, so that at the nearest points, from each class, the discriminant function is ±1:  Thus the margin is given by:  Also, the following is valid

25 25  SVM (linear) classifier  Minimize  Subject to  The above is justified since by minimizing the margin is maximised.

26 26  The above is a quadratic optimization task, subject to a set of linear inequality constraints. The Karush- Kuhn-Tucker conditions state that the minimizer satisfies: (1) (2) (3) (4) Where is the Lagrangian

27 27  The solution: from the above, it turns out that:

28 28  Remarks: The Lagrange multipliers can be either zero or positive. Thus, where, corresponding to positive Lagrange multipliers. From constraint (4) above, i.e., the vectors contributing to satisfy

29 29 –These vectors are known as SUPPORT VECTORS and are the closest vectors, from each class, to the classifier. –Once is computed, is determined from conditions (4). –The optimal hyperplane classifier of a support vector machine is UNIQUE. –Although the solution is unique, the resulting Lagrange multipliers are not unique.

30 30  Dual Problem Formulation The SVM formulation is a convex programming problem, with –Convex cost function –Convex region of feasible solutions Thus, its solution can be achieved by its dual problem, i.e., –maximize –subject to λ

31 31 Combine the above to obtain: –maximize –subject to λ

32 32  Remarks: Support vectors enter via inner products.  Non-Separable classes

33 33 In this case, there is no hyperplane such that: Recall that the margin is defined as twice the distance between the following two hyperplanes:

34 34  The training vectors belong to one of three possible categories 1) Vectors outside the band which are correctly classified, i.e., 2) Vectors inside the band, and correctly classified, i.e., 3) Vectors misclassified, i.e.,

35 35  All three cases above can be represented as: 1) 2) 3) are known as slack variables.

36 36  The goal of the optimization is now two-fold: Maximize margin Minimize the number of patterns with. One way to achieve this goal is via the cost where C is a constant and I(.) is not differentiable. In practice, we use an approximation. A popular choice is: Following a similar procedure as before we obtain:

37 37  KKT conditions

38 38  The associated dual problem Maximize subject to  Remarks: The only difference with the separable class case is the existence of in the constraints. λ

39 39  Training the SVM: A major problem is the high computational cost. To this end, decomposition techniques are used. The rationale behind them consists of the following: Start with an arbitrary data subset (working set) that can fit in the memory. Perform optimization, via a general purpose optimizer. Resulting support vectors remain in the working set, while others are replaced by new ones (outside the set) that violate severely the KKT conditions. Repeat the procedure. The above procedure guarantees that the cost function decreases. Platt’s SMO algorithm chooses a working set of two samples, thus analytic optimization solution can be obtained.

40 40  Multi-class generalization Although theoretical generalizations exist, the most popular in practice is to look at the problem as M two- class problems (one against all).  Example:  Observe the effect of different values of C in the case of non-separable classes.

41 41  Remarks: Support vectors enter via inner products.  Non-Separable classes

42 42 In this case, there is no hyperplane such that: Recall that the margin is defined as twice the distance between the following two hyperplanes:

43 43  The training vectors belong to one of three possible categories 1) Vectors outside the band which are correctly classified, i.e., 2) Vectors inside the band, and correctly classified, i.e., 3) Vectors misclassified, i.e.,

44 44  All three cases above can be represented as: 1) 2) 3) are known as slack variables.

45 45  The goal of the optimization is now two-fold: Maximize margin Minimize the number of patterns with. One way to achieve this goal is via the cost where C is a constant and I(.) is not differentiable. In practice, we use an approximation. A popular choice is: Following a similar procedure as before we obtain:

46 46  KKT conditions

47 47  The associated dual problem Maximize subject to  Remarks: The only difference with the separable class case is the existence of in the constraints. λ

48 48  Training the SVM: A major problem is the high computational cost. To this end, decomposition techniques are used. The rationale behind them consists of the following: Start with an arbitrary data subset (working set) that can fit in the memory. Perform optimization, via a general purpose optimizer. Resulting support vectors remain in the working set, while others are replaced by new ones (outside the set) that violate severely the KKT conditions. Repeat the procedure. The above procedure guarantees that the cost function decreases. Platt’s SMO algorithm chooses a working set of two samples, thus analytic optimization solution can be obtained.

49 49  Multi-class generalization Although theoretical generalizations exist, the most popular in practice is to look at the problem as M two- class problems (one against all).  Example:  Observe the effect of different values of C in the case of non-separable classes.

50 Practical SVM  We will revisit SVM classifiers at the end of chapter 4  Twist: apply a special function projecting the data into a higher dimensional space prior to applying the SVM classifier  Good open source SVM implementation is libSVM  libSVM now integrated with MATLAB 50

51 Chapter 4  Non-linear classifiers 51

52 52 Non Linear Classifiers  The XOR problem x1x1 x2x2 XORClass 000B 011A 101A 110B

53 53  There is no single line (hyperplane) that separates class A from class B. On the contrary, AND and OR operations are linearly separable problems

54 54  The Two-Layer Perceptron  For the XOR problem, draw two, instead, of one lines

55 55  Then class B is located outside the shaded area and class A inside. This is a two-phase design. Phase 1Phase 1: Draw two lines (hyperplanes) Each of them is realized by a perceptron. The outputs of the perceptrons will be depending on the position of x. Phase 2Phase 2: Find the position of x w.r.t. both lines, based on the values of y 1, y 2.

56 56 Equivalently: The computations of the first phase perform a mapping 1 st phase 2 nd phase x1x1 x2x2 y1y1 y2y2 000(-) B(0) 011(+)0(-)A(1) 101(+)0(-)A(1) 111(+) B(0)

57 57 The decision is now performed on the transformed data. This can be performed via a second line, which can also be realized by a perceptron.

58 58  Computations of the first phase perform a mapping that transforms the nonlinearly separable problem to a linearly separable one.  The architecture

59 Remember the Perceptrons! 59

60 60 This is known as the two layer perceptron with one hidden and one output layer. The activation functions are The neurons (nodes) of the figure realize the following lines (hyperplanes)

61 61  Classification capabilities of the two-layer perceptron  The mapping performed by the first layer neurons is onto the vertices of the unit side square, e.g., (0, 0), (0, 1), (1, 0), (1, 1).  The more general case,

62 62 performs a mapping of a vector onto the vertices of the unit side H p hypercube  The mapping is achieved with p neurons each realizing a hyperplane. The output of each of these neurons is 0 or 1 depending on the relative position of x w.r.t. the hyperplane.

63 63  Intersections of these hyperplanes form regions in the l- dimensional space. Each region corresponds to a vertex of the H p unit hypercube.

64 64 For example, the 001 vertex corresponds to the region which is located to the (-) side of g 1 (x)=0 to the (-) side of g 2 (x)=0 to the (+) side of g 3 (x)=0

65 65  The output neuron realizes a hyperplane in the transformed space, that separates some of the vertices from the others. Thus, the two layer perceptron has the capability to classify vectors into classes that consist of unions of polyhedral regions. But NOT ANY union. It depends on the relative position of the corresponding vertices.

66 66  Three layer-perceptrons  The architecture  This is capable to classify vectors into classes consisting of ANY union of polyhedral regions.  The idea is similar to the XOR problem. It realizes more than one plane in the space.

67 67  The reasoning For each vertex, corresponding to class, say A, construct a hyperplane which leaves THIS vertex on one side (+) and ALL the others to the other side (-). The output neuron realizes an OR gate  Overall: The first layer of the network forms the hyperplanes, the second layer forms the regions and the output neuron forms the classes.  Designing Multilayer Perceptrons  One direction is to adopt the above rationale and develop a structure that classifies correctly all the training patterns.  The other direction is to choose a structure and compute the synaptic weights to optimize a cost function.

68 68  The Backpropagation Algorithm  This is an algorithmic procedure that computes the synaptic weights iteratively, so that an adopted cost function is minimized (optimized)  In a large number of optimizing procedures, computation of derivatives are involved. Hence, discontinuous activation functions pose a problem, i.e.,  There is always an escape path!!! The logistic function is an example. Other functions are also possible and in some cases more desirable.

69 69

70 70  The steps: Adopt an optimizing cost function, e.g., –Least Squares Error –Relative Entropy between desired responses and actual responses of the network for the available training patterns. That is, from now on we have to live with errors. We only try to minimize them, using certain criteria. Adopt an algorithmic procedure for the optimization of the cost function with respect to the synaptic weights e.g., –Gradient descent –Newton’s algorithm –Conjugate gradient

71 71 The task is a nonlinear optimization one. For the gradient descent method

72 72  The Procedure: Initialize unknown weights randomly with small values. Compute the gradient terms backwards, starting with the weights of the last (3 rd ) layer and then moving towards the first Update the weights Repeat the procedure until a termination procedure is met  Two major philosophies: Batch mode: The gradients of the last layer are computed once ALL training data have appeared to the algorithm, i.e., by summing up all error terms. Pattern mode: The gradients are computed every time a new training data pair appears. Thus gradients are based on successive individual errors.

73 73

74 74  A major problem: The algorithm may converge to a local minimum

75 75  The Cost function choice Examples: The Least Squares Desired response of the m th output neuron (1 or 0) for Actual response of the m th output neuron, in the interval [0, 1], for input

76 76  The cross-entropy This presupposes an interpretation of y and ŷ as probabilities  Classification error rate. This is also known as discriminative learning. Most of these techniques use a smoothed version of the classification error.

77 77  Remark 1: A common feature of all the above is the danger of local minimum convergence. “Well formed” cost functions guarantee convergence to a “good” solution, that is one that classifies correctly ALL training patterns, provided such a solution exists. The cross-entropy cost function is a well formed one. The Least Squares is not.

78 78  Remark 2: Both, the Least Squares and the cross entropy lead to output values that approximate optimally class a-posteriori probabilities!!! That is, the probability of class given. This is a very interesting result. It does not depend on the underlying distributions. It is a characteristic of certain cost functions. How good or bad is the approximation, depends on the underlying model. Furthermore, it is only valid at the global minimum.

79 79  Choice of the network size. How big a network can be. How many layers and how many neurons per layer?? There are two major directions Pruning Techniques: These techniques start from a large network and then weights and/or neurons are removed iteratively, according to a criterion.

80 80 —Methods based on parameter sensitivity + higher order terms where Near a minimum and assuming that

81 81 Pruning is now achieved in the following procedure: Train the network using Backpropagation for a number of steps Compute the saliencies Remove weights with small s i. Repeat the process —Methods based on function regularization

82 Homework  Due by beginning of class 4/19/2016  Read chapter 4 of the text  Do text problem 4.1, p.240  Read the following 2 papers and answer the questions: 1. LeCun, Yann, Bengio, Yoshua, Hinton and Geoffrey (2015). "Deep learning." Nature 521(7553): 436- 444. a.What is the difference between a neural network and a convolutional network? b.What is the difference between a feed-forward network and backpropagation? c.(Graduate students registered for 681) What is the key characteristic of recurrent neural networks, and what is the main challenge with using them? 2. Cilibrasi, R., Vitanyi and P. M. B. (2005). "Clustering by compression." Information Theory, IEEE Transactions on 51(4): 1523-1545. a.What are the characteristics of a ‘normal’ compressor and how do these relate to the notion of a metric distance? b.(Graduate students registered for 681) Why is the specific choice of compression algorithm important in the NCD? c.(Graduate students registered for 681) Why is the NCD robust to variations in different compression algorithms? d.Download the NIST handwritten digit data from http://yann.lecun.com/exdb/mnist/. Extract 3 different digits and compress them using any compression algorithm of your choice. Report the compressed and uncompressed size of each digit in bytes.http://yann.lecun.com/exdb/mnist/

83  Submit a “letter of intent” for your course project to me. This is a separate e-mail submission declaring your intentions for this semester’s research project.  You should include the title of your project, the names of your co-authors, and describe why the project is important and exactly how it relates to the course content and how it impacts/ differs from your current research!  Some ideas for projects follow on the next slide  Include (if appropriate) a small sample image, a brief description of the data and of the classification tasks you will undertake and how this differs from previous work  The letter should be a 1 page pdf, no less than 11 point arial font, 0.5” margins.  Be sure to acknowledge that the written research paper will be submitted prior to the scheduled final exam, and that there will be a short oral presentation during the last week.  Be sure to include what “ground truth” data you will base your results on. In other words, how you will evaluate the performance of your classifier.  This proposal will be worth 20% of your final project grade. Grading rubric:  Format / appearance / readability – 30%  Items from bullet points above – 70% 83 Homework part 2 – project proposal Due April 26 th

84 Homework part 2 – project ideas  Ideal – use data from your thesis or your research advisor and combine that with a new method from the text or the related literature  Pick 2-3 techniques (e.g. deep learning, NCD, etc.) to explore in detail…  Use the CloneView data (http://bioimage.coe.drexel.edu/CloneView/)http://bioimage.coe.drexel.edu/CloneView/  Detect: death, mitosis, etc. (!) This is my research – opportunities for advanced study…  Try a comparison of word vectors and Normalized Google Distance (NGD)  Detect: death, mitosis, etc. (!) This is my research – opportunities for advanced study…  Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation GloVe: Global Vectors for Word Representation  Cilibrasi, L., R., Vitanyi, and B., P.M. (2007). The Google Similarity Distance. Knowledge and Data Engineering, IEEE Transactions on 19, 370-383.  Data sources  Bioimage data from Cohen lab  Use the MNIST dataset  Look on kaggle for data  Talk to me for other ideas / feedback 84

85 Instructor Contact Information  Andrew R. Cohen  Associate Prof.  Department of Electrical and Computer Engineering  Drexel University  3120 – 40 Market St., Suite 110  Philadelphia, PA 19104  office phone: (215) 571 – 4358  http://bioimage.coe.drexel.edu/courses http://bioimage.coe.drexel.edu  acohen@coe.drexel.edu


Download ppt "ECEC 481/681– Statistical Pattern Recognition Chapter 3 – Linear Classifiers - SVM Chapter 4 – Non-Linear Classifiers 1."

Similar presentations


Ads by Google