Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.

Similar presentations


Presentation on theme: "Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer."— Presentation transcript:

1 Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer

2 Motivation Nonlinear functions of linear combinations of inputs can accurately estimate a wide variety of functions

3 Projection Pursuit Regression An additive model that uses weighted sums of inputs rather than X g,w are estimated using a flexible smoothing method g m is a ridge function in R p V m = (w m ) T X is projection of X onto unit vector w m Pursuing w m that fits model well

4 If M arbitrarily large, can approximate any continuous function in R p arbitrarily well (universal approximator) As M increases, interpretability decreases PPR useful for prediction M = 1: Single index model easy to interpret and slightly more general than linear regression Projection Pursuit Regression

5 Fitting PPR Model To estimate g, given w, consider M=1 model With derived variables v i = w T x i, becomes a 1-D smoothing problem Any scatterplot smoother (e.g. smoothing spline) can be used Complexity constraints on g must be made to prevent overfitting

6 Fitting PPR Model To estimate w, given g Want to minimize squared error using Gauss- Newton search (second derivative of g is discarded) Use weighted LS regression on x to find w new Target: Weights: Added (w,g) pair compensates for error in current set of pairs

7 g,w estimated iteratively until convergence M > 1, model built in forward stage-wise manner, adding a (g,w) pair at each stage Differentiable smoothing methods preferable, local regression and smoothing splines convenient g m ’s from previous steps can be readjusted with backfitting, unclear how affects performance w m usually not readjusted but could be M usually estimated by forward stage-wise builder, cross-validation can also be used Computational demands made it unpopular Fitting PPR Model

8 From PPR to Neural Networks PPR: Each ridge function is different NN: Each node has the same activation/transfer function PPR: Optimizing each ridge function separately (as in additive models) NN: Optimizing all of the nodes at each training step

9 Neural Networks Specifically feed-forward, back- propagation networks Inputs fed forward Errors propagated backward Made of layers of Processing Elements (PEs, aka perceptrons) Each PE represents the function g(w T x), g is a transfer function g fixed, unlike PPR Output layer of D PEs, y i  D Hidden layers, in which outputs not directly observed, are optional.

10 NN uses parametric functions unlike PPR Common ones include: –Threshold f(v) = 1 if v > c, else -1 –Sigmoid f(v) = 1/(1 + e -v ), Range [0, 1] –Tanh f(v) = (e v – e -v )/(e v + e -v ), Range [-1,1] Desirable properties: –Monotonic, Nonlinear, Bounded –Easily calculated derivative –Largest change at intermediate values Transfer functions Must scale inputs so weighted sums will fall in transition region, not saturation region (upper/lower bounds). Must scale outputs so range falls within range of transfer function Sigmoid Hyperbolic tangent Threshold

11 How many hidden layers and PEs in each layer? Adding nodes and layers adds complexity to model –Beware overfitting –Beware of extra computational demands A 3-layer network with a non-linear transfer function is capable of any function mapping. Hidden layers

12 Back Propagation Minimizing the squared error function

13 Back Propagation

14

15 Learning parameters Error function contains many local minima If learning too much, might jump over local minima (results in spiky error curve) Learning rates, Separate rates for each layer? Momentum Epoch size, # of epochs Initial weights Final solution depends on starting weights Want weighted sums of inputs to fall in transition region Small, random weights centered around zero work well

16 Overfitting and weight decay If network too complex, overfitting is likely (very large weights) Could stop training before training error minimized Could use a validation set to determine when to stop Weight decay more explicit, analogous to ridge regression Penalizes large weights

17 Other issues Neural network training is O(NpML) –N observations, p predictors, M hidden units, L training epochs –Epoch sizes also have a linear effect on computation time –Can take minutes or hours to train Long list of parameters to adjust Training is a random walk in a vast space Unclear when to stop training, can have large impact on performance on test set Can avoid guesswork in estimating # of hidden nodes needed with cascade correlation (analogous to PPR)

18 Cascade Correlation Automatically finds optimal network structure Start with a network with no hidden PEs Grow it by one hidden PE at a time PPR adds ridge functions that model the current error in the system Cascade correlation adds PEs that model the current error in the system

19 Train the initial network using any algorithm until error converges or falls under a specified bound Take one hidden PE and connect all inputs and all currently existing PEs to it Cascade Correlation

20 Train hidden PE to maximally correlate with current network error Gradient ascent rather than descent O = # of output units, P = # of training patterns z is the new hidden PE’s current output for pattern p J = # inputs and hidden PEs connected to new hidden PE E p o is the residual error on pattern p observed at output unit o, are the means of E and z over all training patterns + is the sign of the term inside the abs. val. brackets

21 Freeze the weights of the inputs and pre-existing hidden PEs to the new hidden PE Weights between inputs/hidden PEs and output PEs still live Repeat cycle of training with modified network until error converges or falls below specified bound Cascade Correlation Box = weight frozen, X = weight live

22 No need to guess architecture Each hidden PE sees distinct problem and learns solution quickly Hidden PEs in backprop networks “engage in complex dance” (Fahlman) Trains fewer PEs at each epoch Can cache output of hidden PEs once weights frozen “The Cascade-Correlation Learning Architecture” Scott E. Fahlman and Christian Lebiere Cascade Correlation


Download ppt "Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer."

Similar presentations


Ads by Google