Presentation is loading. Please wait.

Presentation is loading. Please wait.

Artificial neural networks Ricardo Ñanculef Alegría Universidad Técnica Federico Santa María Campus Santiago.

Similar presentations

Presentation on theme: "Artificial neural networks Ricardo Ñanculef Alegría Universidad Técnica Federico Santa María Campus Santiago."— Presentation transcript:

1 Artificial neural networks Ricardo Ñanculef Alegría Universidad Técnica Federico Santa María Campus Santiago

2 2 Learning from Natural Systems Bio-inspired systems  Ants colony  Genetic algorithms  Artificial neural networks The power of the brain  Examples: vision, text-processing  Other animals: dolphins, bats

3 3 Modeling the Human Brain key functional characteristics Learning and generalization ability Continuous adaptation Robustesness and fault tolerance

4 4 Massive parallelism Distributed knowledge representation: memory Basic organization: networks of neurons Modeling the Human Brain key structural characteristics receptorsNeural netseffectors

5 5 Modeling the Human Brain neurons

6 6 Human Brain in numbers Cerebral cortex: E11 neurons (more than the number of stars in the milky-way) Massive connectivity: E3 to E4 connections per neuron (in total, E15 connections) Time response E-3 seconds. Silicon chips E-9 seconds (one million times faster ) Yet, human are more efficient than computers at computationally complex tasks. Why?

7 7 Artificial Neural Networks “ A neural network is a massively parallel distributed processor made up of simple processing units, which has a natural propensity for storing experiential knowledge and making it available for use. It resembles the brain in: Knowledge is acquired by a learning process Connection strengths between processing units are used to store the acquired knowledge. ” Simon Haykin, “Neural Networks, a comprehensive foundation”, 2nd Ed, Reprint 2005, Prentice Hall

8 8 Artificial Neural Networks “ From the perspective of pattern recognition, neural networks can be regarded as an extension of the many conventional techniques which have been developed over several decades (…) for example, discriminant functions, logit.. ” Christopher Bishop, “Neural Networks for Pattern Recogniton”, Reprint, 2005, Oxford University Press

9 9 Pattern Classification Clustering Function Approximation: Regression Time Series Forecasting Optimization Content-addressable Memory Artificial Neural Networks diverse applications

10 10 The beginnings McCulloch and Pitts, 1943: “A logical calculus of the ideas immanent in nervous activity” First neuron model based on simplifications about the brain behavior  binary incoming signals  Connection strengths: weights to each incoming signal  binary response: active or inactive  activation threshold or bias Just some years earlier: boolean algebra

11 11 The beginnings The model activation threshold (bias) connection weights

12 12 The beginnings These neurons can compute logical operations

13 13 The beginnings Perceptron (1958). Rosenblatt proposes the use of “layers of neurons” as a computational tool. Proposes a training algorithm Emphasis in the learning capabilities of NN McCulloch and Pitts derived to automata theory

14 14 The perceptron Architecture …

15 15 The perceptron Notation …

16 16 Perceptron Rule We have a set of patterns with desired responses … If used for classification … number of neurons in the perceptron clase

17 17 Perceptron Rule 1. Initialize the weights and the thresholds 2. Present a pattern vector 3. Update the weights according to learning rate If used for classification …

18 18 Separating hyperplanes

19 19 Separating hyperplanes Hyperplane: set L of points satisfying For any pair of points lying in L Hence, the normal vector to L is

20 20 Separating hyperplanes Signed distance of any point to L

21 21 Separating hyperplanes Consider a two-class classificaton problem One class coded as +1 and one as -1 An input is classified as the sign of the distance to the hyperplane How to train the classifier?

22 22 Separating hyperplanes An idea: train to minimize the distance of the misclassified inputs to the hyperplane Note this is very different to train with the quadratic loss of all the points

23 23 Gradient Descent Suppose we have to minimize on Where is a vector For example Iterate:

24 24 Stochastic Gradient Descent Suppose we have to minimize on Where is a random variable We have samples of Iterate:

25 25 Separating hyperplanes If M is fixed Stochastic gradient descent

26 26 Separating hyperplanes For correctly classified inputs no correction on the parameters is applied Now, note that

27 27 Separating hyperplanes Perceptron rule

28 28 Perceptron convergence theorem Theorem: If there exists a set of connection weights and activation threshold which is able to separate the two clases, the perceptron algorithm will converge to some solution in a finite number of steps and indepently of the initialization of the weights and bias.

29 29 Perceptron Conclusion: perceptron rule, with two clases, is a stochastic gradient descent algorithm that aims to minimize the distances of the misclassified examples to the hyperplane. With more than two classes, the perceptron uses one neuron to model a class againts the others. This is an actual perspective

30 30 Delta Rule Widrow and Hoff It considers general activation functions

31 31 Delta Rule Update the weights according to …

32 32 Delta Rule Update the weights according to …

33 33 Delta Rule Can the perceptron rule be obtained as a special case from this rule? Step function is not differentiable Note that with this algorithm all the patterns are observed before correction, while with the Rosenblatt's algorithm each pattern induces a correction

34 34 Perceptrons Perceptrons and logistic regression With more than 1 neuron: each neuron has the form of a logistic model of one class against the others.

35 35 Neural Networks Death Minsky: 1969, “Perceptrons”. x1x1 xnxn x3x3 x2x2... y = 1 y = -1 xnxn x1x1 b

36 36 Neural Networks Death A perceptron cannot learn the XOR

37 37 Neural Networks renaissance Idea: map the data to a feature space where the solution is linear

38 38 Neural Networks renaissance Problem: this transformation is problem dependent

39 39 Neural Networks renaissance Solution: multilayer perceptrons (FANN) More biologically plausible Internal layers learn the map

40 40 Architecture

41 41 Architecture: regression each output corresponds to a response

42 42 Architecture: classification each output corresponds to a class, such that Training data has to be coded by 0-1 response variables

43 43 Universal approximation Theorem Theorem: Let an admissible activation function and let be a compact subset of Hence, for any continuous function and for any

44 44 Universal approximation Theorem Admissible activation functions

45 45 Universal approximation Theorem norm extensions:  other output activation functions  other norms

46 46 Fitting Neural Networks The back-propagation algorithm: A generalization of the delta rule for multilayer perceptrons It is a gradient descent algorithm for the quadratic loss function

47 47 Back-propagation Gradient descent generates a sequence of aproximations related as

48 48 Back-propagation Equations For Why back propagation? …

49 49 Back-propagation Algorithm 1. Initialize the weights and the thresholds 2. For each example i compute 3. Update the weights according to 4. Iterate 2 and 3 until convergence

50 50 Stochastic Back-propagation 1. Initialize the weights and the thresholds 2. For each example i compute 3. Iterate 2 until convergence

51 51 Some Issues in Training NN Local Minima Architecture selection Generalization and Overfitting Other training functions

52 52 Local Minima Back-propagation is a gradient descent procedure and hence converges to any configuration of weights such that This can be local

53 53 Local Minima Starting values  Usually random values near zero  Note that the sigmoid functions roughly linear if the weights are near zero  Training as non-linearity increasing

54 54 Local Minima Starting values  Stochastic back-propagation: order in presentation of the examples  Multiple neural networks Select the best Average the networks Average the weights Ensemble models

55 55 Local Minima Other optimization algorithms  Back-propagation with momentum Momentum term Momentum parameter 0.1-0.8

56 56 Overfitting Early stopping and validation set 1. Divide the available data into training and validation sets. 2. Compute the validation error rate periodically during training. 3. Stop training when the validation error rate "starts to go up".

57 57 Overfitting Early stopping and validation set

58 58 Overfitting Regularization by weight decay  Weight decays shrinks towards a linear (very simple!!) model

59 59 A Closer Look at Regularization Convergence in A doesn’t guarantee convergence in B Space of functions Values of risk

60 60 Overfitting Regularization by weight decay  Tiknohov regularization: let us consider the problem of estimating a function by observing  Suppose we minimize on some space H

61 61 Overfitting Regularization by weight decay  it is well-known that ever for continuos A is not true  Key regularization theorem: If H is compact, the last property holds!!

62 62 Overfitting Regularization by weight decay  compactness: a metric space is called compact if it is bounded and closed  suppose we minimize on H  where the sets are compact

63 63 Overfitting Regularization by weight decay  Let a function such that  Hence under some selections of  Example

64 64 A Closer Look at Weight Decay Less complicated hypothesis has lower error rate

65 65 NN for classification Loss function: Is the quadratic loss appropriate?

66 66 NN for classification

67 67 Projection Pursuit Generalization of 2- layer regression NN Universal approximator  Good for prediction  Not good for deriving interpretable models of data Basis functions (activation functions) are now “learned” from data Weights are viewed as projection directions we have to “pursuit”

68 68 Projection Pursuit Output Inputs ridge functions unit vectors &

69 69 PPR: Derived Features Dot product is projection of the signal onto Ridge function varies in the direction of

70 70 PPR: Training Minimize squared error Consider  Given, we derive features and smooth  Given, we minimize over with Newton’s (like) Method  Iterate those two steps to convergence

71 71 PPR: Newton’s Method Use derivatives to iteratively improve estimate

72 72 PPR: Newton’s Method Use derivatives to iteratively improve estimate Weighted least squares regression to hit the target

73 73 PPR: Implementation Details Suggested smoothing methods  Local regression  Smoothing splines (, ) pairs added in a forward stage- wise manner Very close to ensemble methods

74 74 Conclusions Neural Networks are very general approach to both regression and classification Effective learning tool when:  Prediction is desired  Formulating a description of a problem’s solution is not desired

Download ppt "Artificial neural networks Ricardo Ñanculef Alegría Universidad Técnica Federico Santa María Campus Santiago."

Similar presentations

Ads by Google