ECES 690 – Statistical Pattern Recognition Lecture 2- Density Estimation / Linear Classifiers Andrew R. Cohen 1.

ECES 690 – Statistical Pattern Recognition Lecture 2- Density Estimation / Linear Classifiers Andrew R. Cohen 1

Recap 2  Pattern recognition is about classifying data  The classes may be known a priori – supervised, or they may be unknown – unsupervised  As you plan for your term project, it should probably contain some aspect related to classification.

3  Minimum Distance Classifiers  equiprobable   Euclidean Distance: smaller  Mahalanobis Distance: smaller

5  Maximum Likelihood     ESTIMATION OF UNKNOWN PROBABILITY DENSITY FUNCTIONS

6   

8 Asymptotically unbiased and consistent

Estimation Bias And Consistency 9 The bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated An estimator is said to be consistent if it converges in probability to the true value of the parameter

10  Example:

11  Maximum Aposteriori Probability Estimation  In ML method, θ was considered as a parameter  Here we shall look at θ as a random vector described by a pdf p(θ), assumed to be known  Given Compute the maximum of  From Bayes theorem

12  The method:

14  Example:

15  Nonparametric Estimation   In words : Place a segment of length h at and count points inside it.  If is continuous: as, if

Example – histograms!  MATLAB  x=normrnd(0,1,N);  hist(x,k)  Histogram – count of frequencies  PDF – histogram relationship  nHist=hist(x,100,N)  nPDF=nHist./(sum(nHist)) 16

17  Parzen Windows  Place at a hypercube of length and count points inside.

18  Define That is, it is 1 inside a unit side hypercube centered at 0 The problem: Parzen windows-kernels-potential functions

19  Mean value Hence unbiased in the limit The bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated

20  Variance The smaller the h the higher the variance h=0.1, N=1000h=0.8, N=1000

21 h=0.1, N=10000  The higher the N the better the accuracy

22  If asymptotically unbiased  The method Remember:

23  CURSE OF DIMENSIONALITY  In all the methods, so far, we saw that the highest the number of points, N, the better the resulting estimate.  If in the one-dimensional space an interval, filled with N points, is adequate (for good estimation), in the two-dimensional space the corresponding square will require N 2 and in the ℓ -dimensional space the ℓ - dimensional cube will require N ℓ points.  The exponential increase in the number of necessary points in known as the curse of dimensionality. This is a major problem one is confronted with in high dimensional spaces.

  An Example : 24

25  NAIVE – BAYES CLASSIFIER  Let and the goal is to estimate i = 1, 2, …, M. For a “good” estimate of the pdf one would need, say, N ℓ points.  Assume x 1, x 2,…, x ℓ mutually independent. Then:  In this case, one would require, roughly, N points for each pdf. Thus, a number of points of the order N·ℓ would suffice.  It turns out that the Naïve – Bayes classifier works reasonably well even in cases that violate the independence assumption.

26  K Nearest Neighbor Density Estimation  In Parzen: The volume is constant The number of points in the volume is varying  Now: Keep the number of points constant Leave the volume to be varying

28  The Nearest Neighbor Rule  Choose k out of the N training vectors, identify the k nearest ones to x  Out of these k identify k i that belong to class ω i   The simplest version k=1 !!!  For large N this is not bad. It can be shown that: if P B is the optimal Bayesian error probability, then:

29    For small P B :  An example:

30  Voronoi tesselation

31  Bayes Probability Chain Rule  Assume now that the conditional dependence for each x i is limited to a subset of the features appearing in each of the product terms. That is: where BAYESIAN NETWORKS

32  For example, if ℓ=6, then we could assume: Then:  The above is a generalization of the Naïve – Bayes. For the Naïve – Bayes the assumption is: A i = Ø, for i=1, 2, …, ℓ

33  A graphical way to portray conditional dependencies is given below  According to this figure we have that: x 6 is conditionally dependent on x 4, x 5 x 5 on x 4 x 4 on x 1, x 2 x 3 on x 2 x 1, x 2 are conditionally independent on other variables.  For this case:

34  Bayesian Networks  Definition: A Bayesian Network is a directed acyclic graph (DAG) where the nodes correspond to random variables. Each node is associated with a set of conditional probabilities (densities), p(x i |A i ), where x i is the variable associated with the node and A i is the set of its parents in the graph.  A Bayesian Network is specified by: The marginal probabilities of its root nodes. The conditional probabilities of the non-root nodes, given their parents, for ALL possible values of the involved variables.

35  The figure below is an example of a Bayesian Network corresponding to a paradigm from the medical applications field.  This Bayesian network models conditional dependencies for an example concerning smokers (S), tendencies to develop cancer (C) and heart disease (H), together with variables corresponding to heart (H1, H2) and cancer (C1, C2) medical tests.

36  Once a DAG has been constructed, the joint probability can be obtained by multiplying the marginal (root nodes) and the conditional (non-root nodes) probabilities.  Training: Once a topology is given, probabilities are estimated via the training data set. There are also methods that learn the topology.  Probability Inference: This is the most common task that Bayesian networks help us to solve efficiently. Given the values of some of the variables in the graph, known as evidence, the goal is to compute the conditional probabilities for some of the other variables, given the evidence.

37  Example: Consider the Bayesian network of the figure: a) If x is measured to be x=1 (x1), compute P(w=0|x=1) [P(w0|x1)]. b) If w is measured to be w=1 (w1) compute P(x=0|w=1) [ P(x0|w1)].

38  For a), a set of calculations are required that propagate from node x to node w. It turns out that P(w0|x1) = 0.63.  For b), the propagation is reversed in direction. It turns out that P(x0|w1) = 0.4.  In general, the required inference information is computed via a combined process of “message passing” among the nodes of the DAG.  Complexity:  For singly connected graphs, message passing algorithms amount to a complexity linear in the number of nodes.

Bayesian networks and functional graphical models Pearl, J., Causality: Models, Reasoning, and Inference 2000: Cambridge University Press. 39

Intermission  Chapter 2  Bayes classification  Estimating the distribution  Isolating dependencies  Chapter 3  Perceptrons  Linear Classifiers  Neural networks (1)  Support Vector Machines (linear?) 40

41  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

42  Hence:

43  The Perceptron Algorithm  Assume linearly separable classes, i.e.,  The case falls under the above formulation, since

44  Our goal: Compute a solution, i.e., a hyperplane w, so that The steps –Define a cost function to be minimized. –Choose an algorithm to minimize the cost function. –The minimum corresponds to a solution.

45  The Cost Function Where Y is the subset of the vectors wrongly classified by w. When Y =O (empty set) a solution is achieved and

46 J(w) is piecewise linear (WHY?)  The Algorithm The philosophy of the gradient descent is adopted.

47 Wherever valid This is the celebrated Perceptron Algorithm. w

48  An example:  The perceptron algorithm converges in a finite number of iteration steps to a solution if

49  A useful variant of the perceptron algorithm  It is a reward and punishment type of algorithm.

50  The perceptron  It is a learning machine that learns from the training vectors via the perceptron algorithm.  The network is called perceptron or neuron.

51  Example: At some stage t the perceptron algorithm results in The corresponding hyperplane is ρ=0.7

52  Least Squares Methods  If classes are linearly separable, the perceptron output results in  If classes are NOT linearly separable, we shall compute the weights,, so that the difference between The actual output of the classifier,, and The desired outputs, e.g., to be SMALL.

53  SMALL, in the mean square error sense, means to choose so that the cost function:

54  Minimizing where R x is the autocorrelation matrix and the crosscorrelation vector.

55  Multi-class generalization The goal is to compute M linear discriminant functions: according to the MSE. Adopt as desired responses y i : Let And the matrix

56 The goal is to compute : The above is equivalent to a number M of MSE minimization problems. That is: Design each so that its desired output is 1 for and 0 for any other class.  Remark: The MSE criterion belongs to a more general class of cost function with the following important property: The value of is an estimate, in the MSE sense, of the a-posteriori probability, provided that the desired responses used during training are and 0 otherwise.

57  Mean square error regression: Let, be jointly distributed random vectors with a joint pdf The goal: Given the value of, estimate the value of. In the pattern recognition framework, given one wants to estimate the respective label. The MSE estimate of, given, is defined as: It turns out that: The above is known as the regression of given and it is, in general, a non-linear function of. If is Gaussian the MSE regressor is linear.

58  SMALL in the sum of error squares sense means  that is, the input x i and its corresponding class label (±1).  : training pairs

59  Pseudoinverse Matrix  Define   

60 Thus  Assume N=l X square and invertible. Then Pseudoinverse of X

61  Assume N>l. Then, in general, there is no solution to satisfy all equations simultaneously:  The “solution” corresponds to the minimum sum of squares solution.

62  Example:

63 

64  The Bias – Variance Dilemma A classifier is a learning machine that tries to predict the class label y of. In practice, a finite data set D is used for its training. Let us write. Observe that:  For some training sets,, the training may result to good estimates, for some others the result may be worse.  The average performance of the classifier can be tested against the MSE optimal value, in the mean squares sense, that is: where E D is the mean over all possible data sets D.

65 The above is written as: In the above, the first term is the contribution of the bias and the second term is the contribution of the variance. For a finite D, there is a trade-off between the two terms. Increasing bias it reduces variance and vice verse. This is known as the bias-variance dilemma. Using a complex model results in low-bias but a high variance, as one changes from one training set to another. Using a simple model results in high bias but low variance.

66  Let an -class task,. In logistic discrimination, the logarithm of the likelihood ratios are modeled via linear functions, i.e.,  Taking into account that it can be easily shown that the above is equivalent with modeling posterior probabilities as:  LOGISTIC DISCRIMINATION

67  For the two-class case it turns out that

68  The unknown parameters are usually estimated by maximum likelihood arguments.  Logistic discrimination is a useful tool, since it allows linear modeling and at the same time ensures posterior probabilities to add to one.

69  The goal: Given two linearly separable classes, design the classifier that leaves the maximum margin from both classes.  Support Vector Machines

70  Margin: Each hyperplane is characterized by: Its direction in space, i.e., Its position in space, i.e., For EACH direction,, choose the hyperplane that leaves the SAME distance from the nearest points from each class. The margin is twice this distance.

71  The distance of a point from a hyperplane is given by:  Scale, so that at the nearest points, from each class, the discriminant function is ±1:  Thus the margin is given by:  Also, the following is valid

72  SVM (linear) classifier  Minimize  Subject to  The above is justified since by minimizing the margin is maximised.

73  The above is a quadratic optimization task, subject to a set of linear inequality constraints. The Karush- Kuhn-Tucker conditions state that the minimizer satisfies: (1) (2) (3) (4) Where is the Lagrangian

74  The solution: from the above, it turns out that:

75  Remarks: The Lagrange multipliers can be either zero or positive. Thus, where, corresponding to positive Lagrange multipliers. From constraint (4) above, i.e., the vectors contributing to satisfy

76 –These vectors are known as SUPPORT VECTORS and are the closest vectors, from each class, to the classifier. –Once is computed, is determined from conditions (4). –The optimal hyperplane classifier of a support vector machine is UNIQUE. –Although the solution is unique, the resulting Lagrange multipliers are not unique.

77  Dual Problem Formulation The SVM formulation is a convex programming problem, with –Convex cost function –Convex region of feasible solutions Thus, its solution can be achieved by its dual problem, i.e., –maximize –subject to λ

78 Combine the above to obtain: –maximize –subject to λ

79  Remarks: Support vectors enter via inner products.  Non-Separable classes

80 In this case, there is no hyperplane such that: Recall that the margin is defined as twice the distance between the following two hyperplanes:

81  The training vectors belong to one of three possible categories 1) Vectors outside the band which are correctly classified, i.e., 2) Vectors inside the band, and correctly classified, i.e., 3) Vectors misclassified, i.e.,

82  All three cases above can be represented as: 1) 2) 3) are known as slack variables.

83  The goal of the optimization is now two-fold: Maximize margin Minimize the number of patterns with. One way to achieve this goal is via the cost where C is a constant and I(.) is not differentiable. In practice, we use an approximation. A popular choice is: Following a similar procedure as before we obtain:

84  KKT conditions

85  The associated dual problem Maximize subject to  Remarks: The only difference with the separable class case is the existence of in the constraints. λ

86  Training the SVM: A major problem is the high computational cost. To this end, decomposition techniques are used. The rationale behind them consists of the following: Start with an arbitrary data subset (working set) that can fit in the memory. Perform optimization, via a general purpose optimizer. Resulting support vectors remain in the working set, while others are replaced by new ones (outside the set) that violate severely the KKT conditions. Repeat the procedure. The above procedure guarantees that the cost function decreases. Platt’s SMO algorithm chooses a working set of two samples, thus analytic optimization solution can be obtained.

87  Multi-class generalization Although theoretical generalizations exist, the most popular in practice is to look at the problem as M two- class problems (one against all).  Example:  Observe the effect of different values of C in the case of non-separable classes.

88 Homework  Due by beginning of class 4/12/2016  Finish reading chapter 2  Read chapter 3 of the text  Do text computer experiments 3.1,2,3,4 p. 146  For each problem, plot the data and the decision surface.  Hint : for data generation, use function mvnrand from the statistics toolbox (see chapter 2 sample programs)  Next: chapter 4 Missing: legend, title, etc.

Instructor Contact Information  Andrew R. Cohen  Associate Prof.  Department of Electrical and Computer Engineering  Drexel University  3120 – 40 Market St., Suite 110  Philadelphia, PA 19104  office phone: (215) 571 – 4358  http://bioimage.coe.drexel.edu/courses http://bioimage.coe.drexel.edu  acohen@coe.drexel.edu

ECES 690 – Statistical Pattern Recognition Lecture 2- Density Estimation / Linear Classifiers Andrew R. Cohen 1.

Similar presentations

Presentation on theme: "ECES 690 – Statistical Pattern Recognition Lecture 2- Density Estimation / Linear Classifiers Andrew R. Cohen 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ECES 690 – Statistical Pattern Recognition Lecture 2- Density Estimation / Linear Classifiers Andrew R. Cohen 1.

Similar presentations

Presentation on theme: "ECES 690 – Statistical Pattern Recognition Lecture 2- Density Estimation / Linear Classifiers Andrew R. Cohen 1."— Presentation transcript:

Similar presentations

About project

Feedback