Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Neural Networks

Similar presentations


Presentation on theme: "Introduction to Neural Networks"— Presentation transcript:

1 Introduction to Neural Networks
x0 xn w0 wn o Threshold units Lecture 1

2 History spiking neural networks
Vapnik (1990) ---support vector machine Broomhead & Lowe (1988) ----Radial basis functions (RBF) Linsker (1988) Informax principle Rumelhart, Hinton Back-propagation & Williams (1986) Kohonen(1982) Self-organizing maps Hopfield(1982) Hopfield Networks Minsky & Papert(1969) Perceptrons Rosenblatt(1960) Perceptron Minsky(1954) Neural Networks (PhD Thesis) Hebb(1949) The organization of behaviour McCulloch & Pitts (1943) -----neural networks and artificial intelligence were born

3 History of Neural Networks
1943: McCullough and Pitts - Modeling the Neuron for Parallel Distributed Processing 1958: Rosenblatt - Perceptron 1969: Minsky and Papert publish limits on the ability of a perceptron to generalize 1970’s and 1980’s: ANN renaissance 1986: Rumelhart, Hinton + Williams present backpropagation 1989: Tsividis: Neural Network on a chip

4 William McCulloch

5 Neural Networks McCulloch & Pitts (1943) are generally recognised as the designers of the first neural network Many of their ideas still used today (e.g. many simple units combine to give increased computational power and the idea of a threshold)

6 Neural Networks Hebb (1949) developed the first learning rule (on the premise that if two neurons were active at the same time the strength between them should be increased)

7

8 Neural Networks During the 50’s and 60’s many researchers worked on the perceptron amidst great excitement. 1969 saw the death of neural network research for about 15 years – Minsky & Papert Only in the mid 80’s (Parker and LeCun) was interest revived (in fact Werbos discovered algorithm in 1974)

9 How Does the Brain Work ? (1)
NEURON The cell that perform information processing in the brain Fundamental functional unit of all nervous system tissue

10 How Does the Brain Work ? (2)
Each consists of : SOMA, DENDRITES, AXON, and SYNAPSE

11 Biological neurons axon dendrites synapse cell

12 Neural Networks We are born with about 100 billion neurons
A neuron may connect to as many as 100,000 other neurons

13 Biological inspiration
Dendrites Soma (cell body) Axon

14 Biological inspiration
dendrites axon synapses The information transmission happens at the synapses.

15 Biological inspiration
The spikes travelling along the axon of the pre-synaptic neuron trigger the release of neurotransmitter substances at the synapse. The neurotransmitters cause excitation or inhibition in the dendrite of the post-synaptic neuron. The integration of the excitatory and inhibitory signals may produce spikes in the post-synaptic neuron. The contribution of the signals depends on the strength of the synaptic connection.

16 Biological Neurons human information processing system consists of brain neuron: basic building block cell that communicates information to and from various parts of body Simplest model of a neuron: considered as a threshold unit –a processing element (PE) Collects inputs & produces output if the sum of the input exceeds an internal threshold value

17 Artificial Neural Nets (ANNs)
Many neuron-like PEs units Input & output units receive and broadcast signals to the environment, respectively Internal units called hidden units since they are not in contact with external environment units connected by weighted links (synapses) A parallel computation system because Signals travel independently on weighted channels & units can update their state in parallel However, most NNs can be simulated in serial computers A directed graph, with labeled edges by weights is typically used to describe the connections among units

18 Each processing unit has a simple program that:
activation level A NODE ini g ai input function activation function output input links links aj Wj,i ai = g(ini) Each processing unit has a simple program that: a) computes a weighted sum of the input data it receives from those units which feed into it b) outputs of a single value, which in general is a non-linear function of the weighted sum of the its inputs ---this output then becomes an input to those units into which the original units feeds

19 g = Activation functions for units
Step function (Linear Threshold Unit) Sign function Sigmoid function sign(x) = +1, if x >= 0 -1, if x < 0 sigmoid(x) = 1/(1+e-x) step(x) = 1, if x >= threshold 0, if x < threshold

20 Real vs artificial neurons
axon dendrites synapse cell x0 xn w0 wn o Threshold units

21 Artificial neurons Neurons work by processing information. They receive and provide information in form of spikes. x1 x2 x3 xn-1 xn w1 Output w2 Inputs y w3 . . . wn-1 wn The McCullogh-Pitts model

22 Mathematical representation
The neuron calculates a weighted sum of inputs and compares it to a threshold. If the sum is higher than the threshold, the output is set to 1, otherwise to -1. Non-linearity

23 Artificial neurons x1 x2 w1 w2 threshold  f wn xn

24

25 Basic Concepts Definition of a node:
A node is an element which performs the function y = fH(∑(wixi) + Wb) Connection Node

26 f : activation function
Anatomy of an Artificial Neuron bias 1 f : activation function inputs output h : combine wi & xi

27 Simple Perceptron Binary logic application
fH(x) = u(x) [linear threshold] Wi = random(-1,1) Y = u(W0X0 + W1X1 + Wb) Now how do we train it?

28 Artificial Neuron A physical neuron An artificial neuron
From experience: examples / training data Strength of connection between the neurons is stored as a weight-value for the specific connection. Learning the solution to a problem = changing the connection weights An artificial neuron

29 Mathematical Representation
Inputs Output w2 w1 wn . y x2 xn b x1 b w1 w2 wn x1 x2 xn + x0 f(n) . n y Inputs Weights Summation Activation Output

30

31

32 Perceptron Learning Rule
A simple perceptron It’s a single-unit network Change the weight by an amount proportional to the difference between the desired output and the actual output. Δ Wi = η * (D-Y).Ii Perceptron Learning Rule Input Learning rate Desired output Actual output

33 Linear Neurons Obviously, the fact that threshold units can only output the values 0 and 1 restricts their applicability to certain problems. We can overcome this limitation by eliminating the threshold and simply turning fi into the identity function so that we get: With this kind of neuron, we can build networks with m input neurons and n output neurons that compute a function f: Rm  Rn.

34 Linear Neurons Linear neurons are quite popular and useful for applications such as interpolation. However, they have a serious limitation: Each neuron computes a linear function, and therefore the overall network function f: Rm  Rn is also linear. This means that if an input vector x results in an output vector y, then for any factor  the input x will result in the output y. Obviously, many interesting functions cannot be realized by networks of linear neurons.

35 Mathematical Representation

36 Gaussian Neurons Another type of neurons overcomes this problem by using a Gaussian activation function: 1 fi(neti(t)) neti(t) -1

37 Gaussian Neurons Gaussian neurons are able to realize non-linear functions. Therefore, networks of Gaussian units are in principle unrestricted with regard to the functions that they can realize. The drawback of Gaussian neurons is that we have to make sure that their net input does not exceed 1. This adds some difficulty to the learning in Gaussian networks.

38 Sigmoidal Neurons Sigmoidal neurons accept any vectors of real numbers as input, and they output a real number between 0 and 1. Sigmoidal neurons are the most common type of artificial neuron, especially in learning networks. A network of sigmoidal units with m input neurons and n output neurons realizes a network function f: Rm  (0,1)n

39 Sigmoidal Neurons fi(neti(t)) neti(t)  = 0.1  = 1
fi(neti(t)) neti(t) -1  = 0.1  = 1 The parameter  controls the slope of the sigmoid function, while the parameter  controls the horizontal offset of the function in a way similar to the threshold neurons.

40 Example: A simple single unit adaptive network
The network has 2 inputs, and one output. All are binary. The output is 1 if W0I0 + W1I1 + Wb > 0  0 if W0I0 + W1I1 + Wb ≤ 0  We want it to learn simple OR: output a 1 if either I0 or I1 is 1.

41 Artificial neurons The McCullogh-Pitts model:
spikes are interpreted as spike rates; synaptic strength are translated as synaptic weights; excitation means positive product between the incoming spike rate and the corresponding synaptic weight; inhibition means negative product between the incoming spike rate and the corresponding synaptic weight;

42 Artificial neurons Nonlinear generalization of the McCullogh-Pitts neuron: y is the neuron’s output, x is the vector of inputs, and w is the vector of synaptic weights. Examples: sigmoidal neuron Gaussian neuron

43 NNs: Dimensions of a Neural Network
Knowledge about the learning task is given in the form of examples called training examples. ANN is specified by: an architecture: a set of neurons and links connecting neurons. Each link has a weight, a neuron model: the information processing unit of the NN, a learning algorithm: used for training the NN by modifying the weights in order to solve the particular learning task correctly on the training examples. The aim is to obtain a NN that generalizes well, that is, that behaves correctly on new instances of the learning task.

44 Neural Network Architectures
Many kinds of structures, main distinction made between two classes: a) feed- forward (a directed acyclic graph (DAG): links are unidirectional, no cycles b) recurrent: links form arbitrary topologies e.g., Hopfield Networks and Boltzmann machines Recurrent networks: can be unstable, or oscillate, or exhibit chaotic behavior e.g., given some input values, can take a long time to compute stable output and learning is made more difficult…. However, can implement more complex agent designs and can model systems with state We will focus more on feed- forward networks

45

46

47

48

49

50

51

52

53

54

55 Single Layer Feed-forward
Input layer of source nodes Output layer neurons

56 Multi layer feed-forward
Input layer Output Hidden Layer 3-4-2 Network

57 Feed-forward networks:
Advantage: lack of cycles = > computation proceeds uniformly from input units to output units. -activation from the previous time step plays no part in computation, as it is not fed back to an earlier unit - simply computes a function of the input values that depends on the weight settings –it has no internal state other than the weights themselves. - fixed structure and fixed activation function g: thus the functions representable by a feed-forward network are restricted to have a certain parameterized structure

58 Learning in biological systems
Learning = learning by adaptation The young animal learns that the green fruits are sour, while the yellowish/reddish ones are sweet. The learning happens by adapting the fruit picking behaviour. At the neural level the learning happens by changing of the synaptic strengths, eliminating some synapses, and building new ones.

59 Learning as optimisation
The objective of adapting the responses on the basis of the information received from the environment is to achieve a better state. E.g., the animal likes to eat many energy rich, juicy fruits that make its stomach full, and makes it feel happy. In other words, the objective of learning in biological organisms is to optimise the amount of available resources, happiness, or in general to achieve a closer to optimal state.

60 Synapse concept Hebb’s Rule:
The synapse resistance to the incoming signal can be changed during a "learning" process [1949] Hebb’s Rule: If an input of a neuron is repeatedly and persistently causing the neuron to fire, a metabolic change happens in the synapse of that particular input to reduce its resistance

61 Neural Network Learning
Objective of neural network learning: given a set of examples, find parameter settings that minimize the error. Programmer specifies - numbers of units in each layer - connectivity between units, Unknowns - connection weights

62 Supervised Learning in ANNs
In supervised learning, we train an ANN with a set of vector pairs, so-called exemplars. Each pair (x, y) consists of an input vector x and a corresponding output vector y. Whenever the network receives input x, we would like it to provide output y. The exemplars thus describe the function that we want to “teach” our network. Besides learning the exemplars, we would like our network to generalize, that is, give plausible output for inputs that the network had not been trained with.

63 Supervised Learning in ANNs
There is a tradeoff between a network’s ability to precisely learn the given exemplars and its ability to generalize (i.e., inter- and extrapolate). This problem is similar to fitting a function to a given set of data points. Let us assume that you want to find a fitting function f:RR for a set of three data points. You try to do this with polynomials of degree one (a straight line), two, and nine.

64 Supervised Learning in ANNs
deg. 2 f(x) x deg. 9 deg. 1 Obviously, the polynomial of degree 2 provides the most plausible fit.

65 Overfitting Real Distribution Overfitted Model

66

67

68

69

70 Supervised Learning in ANNs
The same principle applies to ANNs: If an ANN has too few neurons, it may not have enough degrees of freedom to precisely approximate the desired function. If an ANN has too many neurons, it will learn the exemplars perfectly, but its additional degrees of freedom may cause it to show implausible behavior for untrained inputs; it then presents poor ability of generalization. Unfortunately, there are no known equations that could tell you the optimal size of your network for a given application; you always have to experiment.

71 Learning in Neural Nets
Learning Tasks Supervised Unsupervised Data: Labeled examples (input , desired output) Tasks: classification pattern recognition regression NN models: perceptron adaline feed-forward NN radial basis function support vector machines Data: Unlabeled examples (different realizations of the input) Tasks: clustering content addressable memory NN models: self-organizing maps (SOM) Hopfield networks

72 Learning Algorithms Depend on the network architecture:
Error correcting learning (perceptron) Delta rule (AdaLine, Backprop) Competitive Learning (Self Organizing Maps)

73 Perceptrons Perceptrons are single-layer feedforward networks
Each output unit is independent of the others Can assume a single output unit Activation of the output unit is calculated by: O = Step( ) where xj is the activation of input unit j, and we assume an additional weight and input to represent the threshold

74 Perceptron x1 x2 xn w1 w2 wn . X0 = 1 w0 > 0 1 if otherwise O =

75

76 Perceptron Rosenblatt (1958) defined a perceptron to be a machine that learns, using examples, to assign input vectors (samples) to different classes, using linear functions of the inputs Minsky and Papert (1969) instead describe perceptron as a stochastic gradient-descent algorithm that attempts to linearly separate a set of n-dimensional training data.

77 Linear Separable x2 x2 + + - - + + x1 x1 - + - - (b) (a)
some functions not representable - e.g., (b) not linearly separable

78 So what can be represented using perceptrons?
and or Representation theorem: 1 layer feedforward networks can only represent linearly separable functions. That is, the decision surface separating positive from negative examples has to be a plane.

79 Learning Boolean AND

80 XOR No w0, w1, w2 satisfy: (Minsky and Papert, 1969)

81 Expressive limits of perceptrons
Can the XOR function be represented by a perceptron (a network without a hidden layer)? XOR cannot be represented.

82 How can perceptrons be designed?
The Perceptron Learning Theorem (Rosenblatt, 1960): Given enough training examples, there is an algorithm that will learn any linearly separable function. Theorem 1 (Minsky and Papert, 1969) The perceptron rule converges to weights that correctly classify all training examples provided the given data set represents a function that is linearly separable

83 The perceptron learning algorithm
Inputs: training set {(x1,x2,…,xn,t)} Method Randomly initialize weights w(i), -0.5<=i<=0.5 Repeat for several epochs until convergence: for each example Calculate network output o. Adjust weights: learning rate error Perceptron training rule

84 Why does the method work?
The perceptron learning rule performs gradient descent in weight space. Error surface: The surface that describes the error on each example as a function of all the weights in the network. A set of weights defines a point on this surface. We look at the partial derivative of the surface with respect to each weight (i.e., the gradient -- how much the error would change if we made a small change in each weight). Then the weights are being altered in an amount proportional to the slope in each direction (corresponding to a weight). Thus the network as a whole is moving in the direction of steepest descent on the error surface. The error surface in weight space has a single global minimum and no local minima. Gradient descent is guaranteed to find the global minimum, provided the learning rate is not so big that that you overshoot it.

85 Multi-layer, feed-forward networks
Perceptrons are rather weak as computing models since they can only learn linearly-separable functions. Thus, we now focus on multi-layer, feed forward networks of non- linear sigmoid units: i.e., g(x) =

86 Multi-layer feed-forward networks
Multi-layer, feed forward networks extend perceptrons i.e., 1-layer networks into n-layer by: Partition units into layers 0 to L such that; lowermost layer number, layer 0 indicates the input units topmost layer numbered L contains the output units. layers numbered 1 to L are the hidden layers Connectivity means bottom-up connections only, with no cycles, hence the name"feed-forward" nets Input layers transmit input values to hidden layer nodes hence do not perform any computation. Note: layer number indicates the distance of a node from the input nodes

87 Multilayer feed forward network
Layer of output units v2 v1 v3 Layer of hidden units Layer of input units x x x x x4

88 Multi-layer feed-forward networks
Multi-layer feed-forward networks can be trained by back-propagation provided the activation function g is a differentiable function. Threshold units don’t qualify, but the sigmoid function does. Back-propagation learning is a gradient descent search through the parameter space to minimize the sum-of-squares error. Most common algorithm for learning algorithms in multilayer networks

89 Sigmoid units x0 w0 Sigmoid unit for g o xn wn
This is g’ (the basis for gradient descent)

90 Determining optimal network structure
Weak point of fixed structure networks: poor choice can lead to poor performance Too small network: model incapable of representing the desired Function Too big a network: will be able to memorize all examples but forming a large lookup table, but will not generalize well to inputs that have not been seen before. Thus finding a good network structure is another example of a search problems. Some approaches to search for a solution for this problem include Genetic algorithms But using GAs is very cpu-intensive.

91 Learning rate Ideally, each weight should have its own learning rate
As a substitute, each neuron or each layer could have its own rate Learning rates should be proportional to the sqrt of the number of inputs to the neuron

92 Setting the parameter values
How are the weights initialized? Do weights change after the presentation of each pattern or only after all patterns of the training set have been presented? How is the value of the learning rate chosen? When should training stop? How many hidden layers and how many nodes in each hidden layer should be chosen to build a feedforward network for a given problem? How many patterns should there be in a training set? How does one know that the network has learnt something useful?

93 When should neural nets be used for learning a problem
If instances are given as attribute-value pairs. Pre-processing required: Continuous input values to be scaled in [0-1] range, and discrete values need to converted to Boolean features. Noise in training examples. If long training time is acceptable.

94 Neural Networks: Advantages
Distributed representations Simple computations Robust with respect to noisy data Robust with respect to node failure Empirically shown to work well for many problem domains Parallel processing

95 Neural Networks: Disadvantages
Training is slow Interpretability is hard Network topology layouts ad hoc Can be hard to debug May converge to a local, not global, minimum of error May be hard to describe a problem in terms of features with numerical values

96 Back-propagation Algorithm
yj(n) y0=+1 y1(n) yi(n) wji(n) wj1(n) wj0(n)=bj (n) netj(n) f(.) Total error All output neurons Average squared error where N=No. of items in the training set

97 Back-propagation Algorithm
as as as as if as Gradient decent Error term where

98 Back-propagation Algorithm
Neuron k is an output node Neuron j is a hidden node Output of neuron k Weight adjustment Learning rate Local gradient Input signal Hidden layer(j) Output layer (k)

99

100

101 Brain vs. Digital Computers (1)
Computers require hundreds of cycles to simulate a firing of a neuron The brain can fire all the neurons in a single step. Parallelism Serial computers require billions of cycles to perform some tasks but the brain takes less than a second e.g. Face Recognition

102 What are Neural Networks
An interconnected assembly of simple processing elements, units, neurons or nodes, whose functionality is loosely based on the animal neuron The processing ability of the network is stored in the interunit connection strengths, or weights, obtained by a process of adaptation to, or learning from, a set of training patterns.

103 Definition of Neural Network
A Neural Network is a system composed of many simple processing elements operating in parallel which can acquire, store, and utilize experiential Knowledge.

104 Architecture Connectivity: Feedback
fully connected partially connected Feedback feedforward network: no feedback simpler, more stable, proven most useful recurrent network: feedback from output to input units complex dynamics, may be unstable Number of layers i.e. presence of hidden layers

105 Inputs Feedforward, Fully-Connected with One Hidden Layer Outputs
Node Connection Outputs

106 Hidden Units Layer of nodes between input and output nodes
Allow a network to learn non-linear functions Allow the net to represent combinations of the input features

107 Learning Algorithms How the network learns the relationship between the inputs and outputs Type of algorithm used depends on type of network- architecture, type of learning,etc. Back Propagation: most popular modifications exist: quick prop, Delta-bar-Delta Others: Conjugate gradient descent, Levenberg-Marquardt, K-Means, Kohonen, standard pseudo-inverse (SVD) linear optimization

108 Types of Networks Multilayer Perceptron Radial Basis Function Kohonen
Linear Hopfield Adaline/Madaline Probabilistic Neural Network (PNN) General Regression Neural Network (GRNN) and at least thirty others

109 A Neural Network is a system composed of many simple processing elements operating in parallel which can acquire, store, and utilize experiential knowledge Basic Artificial Model Consists of simple processing elements called neurons, units or nodes Each neuron is connected to other nodes with an associated weight(strength) which typically multiplies the signal transmitted. Each neuron has a single threshold value Characterization Architecture: the pattern of nodes and connections between them Learning algorithm, or training method: method for determining weights of the connections Activation function: function that produces an output based on the input values received by node

110 Perceptrons First studied in the late 1950s
Also known as Layered Feed-Forward Networks The only efficient learning element at that time was for single-layered networks Today, used as a synonym for a single-layer, feed-forward network

111 Perceptrons First studied in the late 1950s
Also known as Layered Feed-Forward Networks The only efficient learning element at that time was for single-layered networks Today, used as a synonym for a single-layer, feed-forward network

112 Single Layer Perceptron
Squashing function need not be sigmoidal

113 Perceptron Architecture

114 Single-Neuron Perceptron

115 Decision Boundary

116 Example OR

117 OR solution

118 Multiple-Neuron Perceptron

119 Learning Rule Test Problem

120 Starting Point

121 Tentative Learning Rule

122 Second Input Vector

123 Third Input Vector

124 Unified Learning Rule

125 Multiple-Neuron Perceptron

126 Apple / Banana Example

127 Apple / Banana Example, Second iteration

128 Apple / Banana Example, Check

129 Historical Note There was great interest in Perceptrons in the '50s and '60s - centred on the work of Rosenblatt. This was crushed by the publication of "Perceptrons" by Minsky and Papert. • EOR problem regions must be linearly separable. • Training Problems esp. with higher order nets.

130 However, we will concentrate on nets with units arranged in layers
x1 xn MLP used to describe any general feedforward (no recurrent connections) network However, we will concentrate on nets with units arranged in layers

131 x1 xn NB different books refer to the above as either 4 layer (no. of layers of neurons) or 3 layer (no. of layers of adaptive weights). We will follow the latter convention 1st question: what do the extra layers gain you? Start with looking at what a single layer can’t do

132 Perceptron does not work here
XOR problem Single layer generates a linear decision boundary XOR (exclusive OR) problem 0+0=0 1+1=2=0 mod 2 1+0=1 0+1=1 Perceptron does not work here

133 Minsky & Papert (1969) offered solution to XOR problem by
combining perceptron unit responses using a second layer of units +1 1 3 2 +1

134 This is a linearly separable problem!
(1,-1) (1,1) (-1,-1) (-1,1) This is a linearly separable problem! Since for 4 points { (-1,1), (-1,-1), (1,1),(1,-1) } it is always linearly separable if we want to have three points in a class

135 xn x1 x2 Input Output Three-layer networks Hidden layers

136 What do each of the layers do?
3rd layer can generate arbitrarily complex boundaries 1st layer draws linear boundaries 2nd layer combines the boundaries

137 Can also view 2nd layer as using local knowledge while 3rd layer does global
With sigmoidal activation functions can show that a 3 layer net can approximate any function to arbitrary accuracy: property of Universal Approximation Proof by thinking of superposition of sigmoids Not practically useful as need arbitrarily large number of units but more of an existence proof For a 2 layer net, same is true for a 2 layer net providing function is continuous and from one finite dimensional space to another

138

139 Example: Learning addition
First find the outputs OI , OII . In order to do this, propagate the inputs forward. First find the outputs for the neurons of hidden layer

140 Example: Learning addition
Then find the outputs of the neurons of output layer

141 Example: Learning addition
Now propagate back the errors. In order to do that first find the errors for the output layer, also update the weights between hidden layer and output layer

142 Example: Learning addition
And backpropagate the errors to hidden layer.

143 Example: Learning addition
And backpropagate the errors to hidden layer.

144 Example: Learning addition
Finally update weights!!!!


Download ppt "Introduction to Neural Networks"

Similar presentations


Ads by Google