Presentation is loading. Please wait.

Presentation is loading. Please wait.

Neural networks (1) Traditional multi-layer perceptrons

Similar presentations


Presentation on theme: "Neural networks (1) Traditional multi-layer perceptrons"β€” Presentation transcript:

1 Neural networks (1) Traditional multi-layer perceptrons

2 Neural network K-class classification: K nodes in top layer Continuous outcome: Single node in top layer

3 Zm are created from linear combinations of the inputs,
Yk is modeled as a function of linear combinations of the Zm For regression, typically K = 1, 𝑍 π‘š =𝜎 𝛼 0π‘š + 𝛼 π‘š 𝑇 𝑋 ,m=1,…,M π‘Œ= 𝛽 0 + 𝛽 𝑇 𝑍 K-class classification: 𝑍 π‘š =𝜎 𝛼 0π‘š + 𝛼 π‘š 𝑇 𝑋 ,m=1,…,M 𝑇 π‘˜ = 𝛽 0π‘˜ + 𝛽 π‘˜ 𝑇 𝑍, π‘˜=1,…, 𝐾 π‘Œ π‘˜ = 𝑒 𝑇 π‘˜ 𝑙=1 𝐾 𝑒 𝑇 𝑙 , π‘˜=1,…,𝐾 Or in more general terms: 𝑓 π‘˜ 𝑋 = 𝑔 π‘˜ 𝑇 regression: 𝑔 π‘˜ 𝑇 = 𝑇 π‘˜ , 𝐾=1 classification: 𝑔 π‘˜ 𝑇 = 𝑒 𝑇 π‘˜ 𝑙=1 𝐾 𝑒 𝑇 𝑙

4 An old activation function
Neural network An old activation function

5 Neural network Other activation functions are used. We will continue using sigmoid function for this discussion.

6 A simple network with linear functions.
Neural network A simple network with linear functions. β€œbias”: intercept y1: x1 + x β‰₯ 0 y2: x1 +x2 βˆ’1.5 β‰₯ 0 z1 = +1 if and only if both y1=1 and y2=-1

7 Neural network

8 Neural network

9 Fitting Neural Networks
Set of parameters (weights): Objective function: Regression (typically K=1): Classification: cross-entropy (deviance)

10 Fitting Neural Networks
minimizing R(ΞΈ) is by gradient descent, called β€œback-propagation” Middle-layer values for each data point: We use the square error loss for demonstration:

11 Fitting Neural Networks
Rules of derivatives used here: Sum rule & constant multiple rule: (af(x)+bg(x))’ = (af(x))’ + (bg(x))’ = af ’(x)+bg’(x) Chain rule: (f(g(x)))’ = f’(g(x)) g’(x) Note: we are going to take derivatives against the coefficients 𝛼 π‘Žπ‘›π‘‘ 𝛽.

12 = Fitting Neural Networks Derivatives: Descent along the gradient: k Ξ²
m Ξ± l i: observation index :learning rate

13 Fitting Neural Networks
By definition

14 Fitting Neural Networks
General workflow of back-propagation: Forward: fix weights and compute 𝑓 π‘˜ ( π‘₯ 𝑖 ) Backward: compute 𝛿 π‘˜π‘– back propagate to compute 𝑠 π‘šπ‘– use both 𝛿 π‘˜π‘– and 𝑠 π‘šπ‘– to compute the gradients update the weights

15 Fitting Neural Networks

16 Fitting Neural Networks
Can use parallel computing - each hidden unit passes and receives information only to and from units that share a connection. Online training the fitting scheme allows the network to handle very large training sets, and also to update the weights as new observations come in. Training neural network is an β€œart” – the model is generally over-parametrized optimization problem is non-convex and unstable A neural network model is a blackbox and hard to directly interpret

17 Fitting Neural Networks
Initiation When weight vectors are close to length zero all Z values are close to zero. οƒ The sigmoid curve is close to linear. the overall model is close to linear. a relatively simple model. (This can be seen as a regularized solution) Start with very small weights. Let the neural network learn necessary nonlinear relations from the data. Starting with large weights often leads to poor solutions.

18 Fitting Neural Networks
Overfitting The model is too flexible, involving too many parameters. May easily overfit the data. Early stopping – do not let the algorithm converge. Because the model starts with linear, this is a regularized solution (towards linear). Explicit regularization (β€œweight decay”) – minimize tends to shrink smaller weights more. Cross-validation is used to estimate Ξ».

19 Fitting Neural Networks

20 Fitting Neural Networks

21 Fitting Neural Networks
Number of Hidden Units and Layers Too few – might not have enough flexibility to capture the nonlinearities in the data Too many – overly flexible, BUT extra weights can be shrunk toward zero if appropriate regularization is used. βœ”

22 Examples β€œA radial function is in a sense the most difficult for the neural net, as it is spherically symmetric and with no preferred directions.”

23 Examples

24 Examples

25 Going beyond single hidden layer
An old benchmark problem: classification of handwritten numerals.

26 Going beyond single hidden layer
5x5 οƒ  1 No weight sharing 3x3 οƒ  1 each of the units in a single 8 Γ— 8 feature map share the same set of nine weights (but have their own bias parameter) οƒ  Decision boundaries of parallel lines 5x5 οƒ  1 weight shared 3x3 οƒ  1 same operation on different parts

27 Going beyond single hidden layer
A training epoch: one sweep through the entire training set. Too few: underfitting Too many: potential overfitting


Download ppt "Neural networks (1) Traditional multi-layer perceptrons"

Similar presentations


Ads by Google