Download presentation
Presentation is loading. Please wait.
1
Concept Learning Algorithms
Come from many different theoretical backgrounds and motivations Behaviors related to human learning Some biologically inspired, others not Neural Networks Nearest Neighbor Tree Learners Utilitarian (just get good result) Biologically- Inspired © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
2
CS 760 – Machine Learning (UW-Madison)
Today’s Topics Perceptrons Artificial Neural Networks (ANNs) Backpropagation Weight Space © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
3
CS 760 – Machine Learning (UW-Madison)
Connectionism PERCEPTRONS (Rosenblatt 1957) among earliest work in machine learning died out in 1960’s (Minsky & Papert book) wij J wik K I L wil Outputi = F(Wij * outputj + Wik * outputk + Wil * outputl ) © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
4
Perceptron as Classifier
Output for N example X is sign(W·X), where sign is -1 or +1 (or use threshold and 0,1) Candidate Hypotheses: real-valued weight vectors Training: Update W for each misclassified example X (target class t, predicted o) by: Wi Wi + h(t-o)Xi Here h is learning rate parameter © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
5
CS 760 – Machine Learning (UW-Madison)
Gradient Descent for the Perceptron (Assume no threshold for now, and start with a common error measure) 2 Error ½ * ( t – o ) Network’s output Teacher’s answer (a constant wrt the weights) E Wk ΔWj - η = (t – o) E Wk (t – o) = -(t – o) o W k Remember: o = W·X © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
6
Continuation of Derivation
( ∑k w k * x k) E = -(t – o) Stick in formula for output Wk Wk = -(t – o) x k So ΔWk = η (t – o) xk The Perceptron Rule Also known as the delta rule and other names (with small variations in calc.) © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
7
As it looks in your text (processing all data at once)…
© Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
8
CS 760 – Machine Learning (UW-Madison)
Linear Separability Consider a perceptron, its output is 1 If W1X1+W2X2 + … + WnXn > Q 0 otherwise In terms of feature space - W1X1 + W2X2 = Q X2 = = Q -W1X1 W2 -W Q W W2 X1+ y = mx + b Hence, can only classify examples if a “line” (hyerplane) can separate them © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
9
Perceptron Convergence Theorem (Rosemblatt, 1957)
Perceptron no Hidden Units If a set of examples is learnable, the perceptron training rule will eventually find the necessary weights However a perceptron can only learn/represent linearly separable dataset © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
10
The (Infamous) XOR Problem
Not linearly separable Exclusive OR (XOR) X1 Input 0 0 0 1 1 0 1 1 Output 1 a) b) c) d) 1 b d a c 1 X2 A Neural Network Solution 1 X1 1 -1 -1 X2 1 Let Q = 0 for all nodes 1 © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
11
The Need for Hidden Units
If there is one layer of enough hidden units (possibly 2N for Boolean functions), the input can be recoded (N = number of input units) This recoding allows any mapping to be represented (Minsky & Papert) Question: How to provide an error signal to the interior units? © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
12
CS 760 – Machine Learning (UW-Madison)
Hidden Units One View Allow a system to create its own internal representation – for which problem solving is easy A perceptron © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
13
Advantages of Neural Networks
Provide best predictive accuracy for some problems Being supplanted by SVM’s? Can represent a rich class of concepts Positive negative Positive Saturday: 40% chance of rain Sunday: 25% chance of rain © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
14
CS 760 – Machine Learning (UW-Madison)
Overview of ANNs Output units error weight Recurrent link Hidden units Input units © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
15
CS 760 – Machine Learning (UW-Madison)
Backpropagation © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
16
CS 760 – Machine Learning (UW-Madison)
Backpropagation Backpropagation involves a generalization of the perceptron rule Rumelhart, Parker, and Le Cun (and Bryson & Ho, 1969), Werbos, 1974) independently developed (1985) a technique for determining how to adjust weights of interior (“hidden”) units Derivation involves partial derivatives (hence, threshold function must be differentiable) error signal E Wi,j © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
17
CS 760 – Machine Learning (UW-Madison)
Weight Space Given a neural-network layout, the weights are free parameters that define a space Each point in this Weight Space specifies a network Associated with each point is an error rate, E, over the training data Backprop performs gradient descent in weight space © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
18
Gradient Descent in Weight Space
© Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
19
The Gradient-Descent Rule
E(w) [ ] E w0 w1 w2 wN , , , … … … , _ The “gradient” This is a N+1 dimensional vector (i.e., the ‘slope’ in weight space) Since we want to reduce errors, we want to go “down hill” We’ll take a finite step in weight space: E E w = - E ( w ) or wi = - “delta” = change to w E wi W1 W2 w © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
20
“On Line” vs. “Batch” Backprop
Technically, we should look at the error gradient for the entire training set, before taking a step in weight space (“batch” Backprop) However, as presented, we take a step after each example (“on-line” Backprop) Much faster convergence Can reduce overfitting (since on-line Backprop is “noisy” gradient descent) © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
21
“On Line” vs. “Batch” BP (continued)
w * Note wi,BATCH wi, ON-LINE, for i > 1 BATCH – add w vectors for every training example, then ‘move’ in weight space. ON-LINE – “move” after each example (aka, stochastic gradient descent) wi w1 w3 w2 E w1 w2 w3 w * Final locations in space need not be the same for BATCH and ON-LINE w © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
22
Need Derivatives: Replace Step (Threshold) by Sigmoid
Individual units output i= F(Sweight i,j x output j) Where F(input i) = j 1 1+e -(input i – bias i) bias output input © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
23
Differentiating the Logistic Function
F(wgt’ed in) out i = 1 1 + e - ( wj,i x outj) 1/2 F ’(wgt’ed in) = out i ( 1- out i ) Wj x outj © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
24
CS 760 – Machine Learning (UW-Madison)
BP Calculations k j i Assume one layer of hidden units (std. topology) Error ½ ( Teacheri – Output i ) 2 = ½ (Teacheri – F ( [Wi,j x Output j] )2 = ½ (Teacheri – F ( [Wi,j x F (Wj,k x Output k)]))2 Determine recall Error Wi,j Wj,k = (use equation 2) = (use equation 3) * See Table 4.2 in Mitchell for results wx,y = - ( E / wx,y ) © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
25
Derivation in Mitchell
© Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
26
CS 760 – Machine Learning (UW-Madison)
Some Notation © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
27
CS 760 – Machine Learning (UW-Madison)
By Chain Rule (since Wji influences rest of network only by its influence on Netj)… © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
28
CS 760 – Machine Learning (UW-Madison)
© Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
29
CS 760 – Machine Learning (UW-Madison)
Also remember this for later – We’ll call it -δj © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
30
CS 760 – Machine Learning (UW-Madison)
© Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
31
CS 760 – Machine Learning (UW-Madison)
Remember netk = wk1 xk1 + … + wkN xkN Remember that oj is xkj: output from j is input to k © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
32
CS 760 – Machine Learning (UW-Madison)
© Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
33
CS 760 – Machine Learning (UW-Madison)
Using BP to Train ANN’s Initiate weights & bias to small random values (eg. in [-0.3, 0.3]) Randomize order of training examples; for each do: Propagate activity forward to output units k j i outi = F( wi,j x outj ) j © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
34
Using BP to Train ANN’s (continued)
Compute “deviation” for output units Compute “deviation” for hidden units Update weights i = F ’( neti ) x (Teacheri-outi) j = F ’( netj ) x ( wi,j x i) i F ’( netj ) = F(neti) neti wi,j = x i x out j wj,k = x j x out k © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
35
Using BP to Train ANN’s (continued)
Repeat until training-set error rate small enough (or until tuning-set error rate begins to rise – see later slide) Should use “early stopping” (i.e., minimize error on the tuning set; more details later) Measure accuracy on test set to estimate generalization (future accuracy) © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
36
Advantages of Neural Networks
Universal representation (provided enough hidden units) Less greedy than tree learners In practice, good for problems with numeric inputs and can also handle numeric outputs PHD: for many years, best protein secondary structure predictor © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
37
CS 760 – Machine Learning (UW-Madison)
Disadvantages Models not very comprehensible Long training times Very sensitive to number of hidden units… as a result, largely being supplanted by SVMs (SVMs take very different approach to getting non-linearity) © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
38
CS 760 – Machine Learning (UW-Madison)
Looking Ahead Perceptron rule can also be thought of as modifying weights on data points rather than features Instead of process all data (batch) vs. one-at-a-time, could imagine processing 2 data points at a time, adjusting their relative weights based on their relative errors This is what Platt’s SMO does (the SVM implementation in Weka) © Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
39
Backup Slide to help with Derivative of Sigmoid
© Jude Shavlik David Page 2010 CS 760 – Machine Learning (UW-Madison)
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.