Presentation on theme: "Longin Jan Latecki Temple University"— Presentation transcript:
1Longin Jan Latecki Temple University firstname.lastname@example.org Ch. 2: Linear Discriminants Stephen Marsland, Machine Learning: An Algorithmic Perspective. CRC 2009 based on slides from Stephen Marsland, from Romain Thibaux (regression slides), and Moshe SipperLongin Jan LateckiTemple University
2McCulloch and Pitts Neurons x1w1x2w2howmxmGreatly simplified biological neuronsSum the inputsIf total is less than some threshold, neuron firesOtherwise does notStephen Marsland
3McCulloch and Pitts Neurons for some threshold The weight wj can be positive or negativeInhibitory or exitatoryUse only a linear sum of inputsUse a simple output instead of a pulse (spike train)Stephen Marsland
4Neural Networks Can put lots of McCulloch & Pitts neurons together Connect them up in any way we likeIn fact, assemblies of the neurons are capable of universal computationCan perform any computation that a normal computer canJust have to solve for all the weights wijStephen Marsland
5Training Neurons Adapting the weights is learning How does the network know it is right?How do we adapt the weights to make the network right more often?Training set with target outputsLearning ruleStephen Marsland
62.2 The PerceptronThe perceptron is considered the simplest kind of feed-forward neural network.Definition from Wikipedia:The perceptron is a binary classifier which maps its input x (a real-valued vector) to an output value f(x) (a single binary value) across the matrix:In order to not explicitly write b, we extend the input vector x by one more dimension that is always set to -1, e.g., x=(-1,x_1, …, x_7) with x_0=-1, and extend the weight vector to w=(w_0,w_1, …, w_7). Then adjusting w_0 corresponds to adjusting b.
9Perceptron Decision = Recall Outputs are:For example, y=(y_1, …, y_5)=(1, 0, 0, 1, 1) is a possible output.We may have a different function g in the place of sign, as in (2.4) in the book.Stephen Marsland
10Perceptron Learning = Updating the Weights We want to change the values of the weightsAim: minimise the error at the outputIf E = t-y, want E to be 0Use:InputLearning rateErrorStephen Marsland
11Example 1: The Logical OR -1X_1X_2t1W_0W_1W_2Initial values: w_0(0)=-0.05, w_1(0) =-0.02, w_2(0)=0.02, and =0.25Take first row of our training table:y_1= sign( -0.05× × ×0 ) = 1w_0(1) = ×(0-1)×-1=0.2w_1(1) = ×(0-1)×0=-0.02w_2(1) = ×(0-1)×0=0.02We continue with the new weights and the second row, and so onWe make several passes over the training data.
12Decision boundary for OR perceptron Stephen Marsland
25Obstacle Avoidance with the Perceptron LSRS-0.6-0.6-0.01-0.01LMRMStephen Marsland
262.3 Linear Separability Outputs are: where and is the angle between vectors x and w.Stephen Marsland
27Geometry of linear Separability The equation of a line isw_0 + w_1*x + w_2*y=0It also means that point (x,y) is on the lineThis equation is equivalent towx = (w_0, w_1,w_2) (1,x,y) = 0If wx > 0, then the anglebetween w and xis less than 90 degree, which means thatw and x lie on the same side of the line.wEach output node of perceptron tries to separate the training dataInto two classes (fire or no-fire) with a linear decision boundary,i.e., straight line in 2D, plane in 3D, and hyperplane in higher dim.Stephen Marsland
28Linear SeparabilityThe Binary AND FunctionStephen Marsland
29Gradient Descent Learning Rule Consider linear unit without threshold and continuous output o (not just –1,1)y=w0 + w1 x1 + … + wn xnTrain the wi’s such that they minimize the squared errorE[w1,…,wn] = ½ dD (td-yd)2where D is the set of training examples
30Supervised Learning Training and test data sets Training set; input & target
33Incremental Stochastic Gradient Descent Batch mode : gradient descentw=w - ED[w] over the entire data DED[w]=1/2d(td-yd)2Incremental mode: gradient descentw=w - Ed[w] over individual training examples dEd[w]=1/2 (td-yd)2Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if is small enough
34Gradient Descent Perceptron Learning Gradient-Descent(training_examples, )Each training example is a pair of the form <(x1,…xn),t> where (x1,…,xn) is the vector of input values, and t is the target output value, is the learning rate (e.g. 0.1)Initialize each wi to some small random valueUntil the termination condition is met, DoFor each <(x1,…xn),t> in training_examples DoInput the instance (x1,…,xn) to the linear unit and compute the output yFor each linear unit weight wi Dowi= (t-y) xiwi=wi+wi
35Limitations of the Perceptron The Exclusive Or (XOR) function.Linear SeparabilityABOut1Stephen Marsland
36Limitations of the Perceptron W1 > 0W2 > 0W1 + W2 < 0?Stephen Marsland
37Limitations of the Perceptron? In2ABCOut1In1In3Stephen Marsland
382.4 Linear regression Temperature Given examples Predict 402624Temperature202220304020301020Figure 1: scatter(1:20,10+(1:20)+2*randn(1,20),'k','filled'); a=axis; a(3)=0; axis(a);102010Given examplesPredictgiven a new point
47Summery Perceptron and regression optimize the same target function In both cases we compute the gradient (vector of partial derivatives)In the case of regression, we set the gradient to zero and solve for vector w. As the solution we have a closed formula for w such that the target function obtains the global minimum.In the case of perceptron, we iteratively go in the direction of the minimum by going in the direction of minus the gradient. We do this incrementally making small steps for each data point.
48Homework 1(Ch ) Implement perceptron in Matlab and test it on the Pmia Indian Dataset from UCI Machine Learning Repository:(Ch ) Implementing linear regression in Matlab and apply it to auto-mpg dataset.
49From Ch. 3: Testing How do we evaluate our trained network? Can’t just compute the error on the training data - unfair, can’t see overfittingKeep a separate testing setAfter training, evaluate on this test setHow do we check for overfitting?Can’t use training or testing setsStephen Marsland
50Validation Keep a third set of data for this Train the network on training dataPeriodically, stop and evaluate on validation setAfter training has finished, test on test setThis is coming expensive on data!Stephen Marsland
51Hold Out Cross Validation InputsTargets…TrainingValidationStephen Marsland
52Hold Out Cross Validation Partition training data into K subsetsTrain on K-1 of subsets, validate on KthRepeat for new network, leaving out a different subsetChoose network that has best validation errorTraded off data for computationExtreme version: leave-one-outStephen Marsland
53Early Stopping When should we stop training? Could set a minimum training errorDanger of overfittingCould set a number of epochsDanger of underfitting or overfittingCan use the validation setMeasure the error on the validation set during trainingStephen Marsland
54Early Stopping Time to stop training Error Validation Training Number of epochsStephen Marsland