CSM6120 Introduction to Intelligent Systems

Name: CSM6120 Introduction to Intelligent Systems
Uploaded: 2017-08-22T20:44:56+00:00
Duration: PTM33S29
Channel: Sylvia Harvey
Description: CSM6120 Introduction to Intelligent Systems

CSM6120 Introduction to Intelligent Systems
Neural Networks

Sub-symbolic learning
When we use some sort of rule-based system, generally we understand the rules We understand the conclusions it draws, because it can tell us When a system learns from such rules, processes it in an understood way, and comes up with an understandable result, it is known as symbolic learning

Sub-symbolic learning
Some systems of learning – which we are coming to – operate in quite a different way Sub-symbolic learning is learning where we don't really understand or have control over the process by which it comes to its conclusion Though we could, with considerable effort and vast amounts of time, find out This includes neural networks, genetic algorithms, genetic programming and to some extent, statistical methods

Artificial neural networks
ANNs for short We use the word “Artificial” to distinguish them from biological neural networks They come in various shapes and sizes Inputs (variables), outputs (results) May be one or more outputs

Neural networks (applications)
Face recognition Event recognition, tracking Time series prediction Process control Optical character recognition Handwriting recognition Robotics, movement Games: car control, etc Etc…

ANNs statistical? We all think of ANNs as pure AI, right?
There is much published literature which claims that ANNs are really no more than non-linear regression models, and claims that they can be implemented using standard statistical software But there is a difference: randomness Standard statistical methods will always produce the same result given the same starting data AI methods all have an element of randomness about them, and will produce different results every time they’re run It is this random element which enables them to find solutions in a search space which might not otherwise be found

Neurons Original ANN research aimed at modelling networks of real neurons in the brain 1010 neurons in the brain – unrealistic target! Brain is massively parallel One job shared between many neurons If one goes wrong, no big deal Known as distributed processing Fault-tolerant

Patterns within data ANNs learn to recognise patterns within sets of data Fault tolerance helps – can still cope with unexpected input examples, placing emphasis only on what it considers important

The neuron metaphor Neurons Multiply inputs by weights along edges
Accept information from multiple inputs Transmit information to other neurons Multiply inputs by weights along edges Apply some function to the set of inputs at each node

ANN topology

Nodes (neurons, or perceptrons)
Each input node brings into the network the value of one independent variable The hidden layer nodes do most of the work The output node(s) give us our result

Types of neuron Linear neuron Logistic neuron Perceptron
Potentially more. Require a convex loss function for gradient descent training Logistic neuron Perceptron

Activation functions Transforms neuron’s input into output
Features of activation functions: A squashing effect is required Prevents accelerating growth of activation levels through the network Simple and easy to calculate, differentiable...?

Perceptrons Formal neuron (McCulloch & Pitts, 1943)
Summation of inputs and threshold functions Showed how any logical function could be computed with simple model neurons Perceptron

Perceptrons Learning by weight adjustment
Output is 1 if the weighted sum of inputs is greater than a given threshold, otherwise output is 0 An extra weight (bias) is used as a way of adjusting the threshold Step function used Perceptron

Alternative notation The threshold (for these examples) is always = 0
If the sum of the weighted inputs >= 0 then P = 1, else P = 0 X = 1 w0 A w1 P w2 B

Example If the inputs are binary (0 or 1) then the neuron units act like a conventional logic module In this example, P = A or B Equates to: if 2A + B – 0.5 >= 0 then output 1, else output 0 X = 1 -0.5 A 2 Equates to: if 2A + B – 0.5 >= 0 then output 1, else output 0 P 1 B

Example 2 What is being computed here? X = 1 -2.5 A 2 P 1 B
By only changing the bias, we can model another logical function = AND By only changing the bias, we can model another logical function = AND

Example 3 What is being computed here? X = 1 -1.5 A 2 2 P -2 B X = 1
-0.5 -3 1 P = (B AND NOT(C) ) OR (A AND NOT(B) AND NOT(C)) OR (A AND B AND C) or P = (A AND B) OR (NOT(C) AND (A XOR B)) Q = (NOT(A) AND (B OR C)) OR (A AND B AND C) C Q 2.5

Standard backpropagation
The commonest learning rule for ANNs Made popular by Rumelhart and McClelland in 1986 Connections between nodes given random initial weights We therefore get a value at the output node(s) which is what happens when these random weights are applied to the data at the input Use the difference between the output and correct values to adjust weights

Training perceptrons What are the weight values?
X = -1 For AND A B P ? A ? P ? B What are the weight values? Initialize with random weights

Training perceptrons -1 For AND A B P 0 0 0 0 1 0 0.3 1 0 0 1 1 1 A
0.3 A 0.5 P -0.4 B Bias A B Summation Output -1 (-1*0.3) + (0*0.5) + (0*-0.4) = -0.3 1 (-1*0.3) + (0*0.5) + (1*-0.4) = -0.7 (-1*0.3) + (1*0.5) + (0*-0.4) = 0.2 (-1*0.3) + (1*0.5) + (1*-0.4) = -0.2

Learning algorithm Epoch: Presentation of the entire training set to the neural network In the case of the AND function an epoch consists of four sets of inputs being presented to the network (i.e. [0,0], [0,1], [1,0], [1,1]) Error: The error value is the amount by which the value outputted by the network differs from the target value For example, if we required the network to output 0 and it outputted a 1, then Error = -1 Error Some method of working out the error (an error function) is defined so that we can find out how accurate the prediction was

Learning algorithm Error Epochs
The weightings on the connections are then adjusted to try to reduce this error Further iterations of weight adjustment proceed until we decide we’re finished And that is a study in itself Epochs There may be many thousands of epochs in one training run, before a satisfactory model is achieved But how many do we need? The BIG question (or one of them!)

Perceptron learning rule
Weights are updated: wi = wi + wi wi =  (t - o) xi t= the target value o is the perceptron output  is a small constant (e.g. 0.12) called the learning rate For each training example: If the output is correct (t=o) the weights wi are not changed If the output is incorrect (to) the weights wi are changed such that the output of the perceptron for the new weights is closer to t The algorithm converges to the correct classification If the training data is linearly separable And  is sufficiently small Target Value, T : When we are training a network we not only present it with the input but also with a value that we require the network to produce. For example, if we present the network with [1,1] for the AND function the training value will be 1 Output , O : The output value from the neuron Ij : Inputs being presented to the neuron Wj : Weight from input neuron (Ij) to the output neuron LR : The learning rate. This dictates how quickly the network converges. It is set by a matter of experimentation. It is typically 0.1

Search space It’s often useful to think of ANN training as a marble rolling across a surface, which has valleys in it Some valleys may be deeper than others – we want to find a deep one When we find it, we stop training Local minimum Global minimum

Gradient descent learning rule
Consider linear unit without threshold and continuous output o (not just 0,1 as with perceptrons) o = w0 + w1 x1 + … + wn xn We train the wi such that they minimise the squared error E[w1,…,wn] = ½ dD (td-od)2 where D is the set of training examples

Gradient descent (w1,w2) Gradient: E[w]=[E/w0,… E/wn] w=- E[w]
wi=- E/wi (w1+w1,w2 +w2) To apply this gradient descent approach we need to: 1 define the function that computes the error of the network. 2 be able to compute the slope of the error surface at a particular point as a function in the change of the weights at a node. 3 define a weight update rule that uses the slope information to produce the steepest descent along the error surface. wi=- E/wi =/wi 1/2d(td-od)2 = /wi 1/2d(td-i wi xi)2 = d(td- od)(-xi)

Gradient descent (linear units)
Gradient-descent(training_examples, ) Each training example is a pair of the form <(x1,…xn),t> where (x1,…,xn) is the vector of input values, and t is the target output value,  is the learning rate (e.g. 0.1) Initialize each wi to some small random value Until the termination condition is met: Initialize each wi to zero For each <(x1,…xn),t> in training_examples: Input the instance (x1,…,xn) to the linear unit and compute the output o For each linear unit weight wi wi = wi + (t-o)xi (update deltas) wi=wi+wi (update weights) The rule for updating weights is more complex for non-linear activation functions (which must be differentiable to be used with gradient descent).

Incremental stochastic gradient descent
Two approaches to this: Batch mode = gradient descent over the entire data D Incremental mode = gradient descent over individual training examples d Incremental gradient descent can approximate batch gradient descent arbitrarily closely if  is small enough

Perceptron vs gradient descent rule
Perceptron learning rule guaranteed to succeed if Training examples are linearly separable Sufficiently small learning rate  Gradient descent learning for linear units Guaranteed to converge to hypothesis with minimum squared error Given sufficiently small learning rate  Even when training data contains noise

Decision boundaries In simple cases, divide feature space by drawing a hyperplane across it Known as a decision boundary Discriminant function: returns different values on opposite sides (straight line) Problems which can be thus classified are linearly separable

Linear separability Decision Boundary X1 A A A A A A A X2 B B B B B B

Decision surface of a perceptron
x2 + - x1 x2 + - x1 + - Linearly separable Non-linearly separable Perceptron is able to represent some useful functions AND(x1,x2) choose weights w0=-1.5, w1=1, w2=1 But functions that are not linearly separable (e.g. XOR) are not representable – we need multilayer perceptrons here

Multilayer networks The input layer. The hidden layer(s).
Cascade neurons together The output from one layer is the input to the next Each layer has its own sets of weights Hidden layer(s)… The input layer. Introduces input values into the network. No activation function or other processing. The hidden layer(s). Perform classification of features Two hidden layers are sufficient to solve any problem Features imply more layers may be better The output layer. Functionally just like the hidden layers Outputs are passed on to the world outside the neural network.

Linear regression neural networks
What happens when we arrange linear neurons in a multilayer network?

Linear regression neural networks
…nothing special happens The product of two linear transformations is itself a linear transformation

Neural networks We want to introduce non-linearities to the network
Non-linearities allow a network to identify complex regions in space

Linear separability An extra layer models a convex hull
1-layer cannot handle XOR More layers can handle more complicated spaces – but require more parameters Each node splits the feature space with a hyperplane If the second layer is AND, a 2-layer network can represent any convex hull An extra layer models a convex hull “An area with no dents in it” Perceptron models, but can’t learn Sigmoid function learning of convex hulls Two layers add convex hulls together Sufficient to classify anything “sane”. In theory, further layers add nothing In practice, extra layers may be better

Separability Exclusive-OR problem Classes with meshed regions
Most general region shapes Structure Single-Layer A B B A Two-Layer A B B A Three-Layer A B B A

Feed-forward networks
Predictions are fed forward through the network to classify

Error backpropagation
Error backpropagation solves the gradient for each partial component separately The target values for each layer come from the next layer This feeds the errors back along the network Training will proceed from the last layer to the first

Backpropagation (BP) algorithm
BP employs gradient descent to attempt to minimize the squared error between the network output values and the target values for these outputs Two stage learning Forward stage: calculate outputs given input pattern Backward stage: update weights by calculating deltas

Termination conditions for BP
The weight update loop may be iterated thousands of times in a typical application The choice of termination condition is important because Too few iterations can fail to reduce error sufficiently Too many iterations can lead to overfitting the training data (see later!) Termination criteria After a fixed number of iterations (epochs) Once the training error falls below some threshold Once the validation error meets some criterion Error surface can have multiple local minima Guarantee toward some local minimum No guarantee to the global minimum

Why BP works in practice A possible scenario
Weights are initialized to values near zero Early gradient descent steps will represent a very smooth function (approximately linear). Why? The sigmoid function is almost linear when the total input (weighted sum of inputs to a sigmoid unit) is near 0 The weights gradually move close to the global minimum As weights grow in the later stages of learning, they represent highly non-linear network functions Gradient steps in this later stage move toward local minima in this region, which is acceptable

Representational power of MLPs
Every boolean function can be represented exactly by some network with two layers of units Note: The number of hidden units required may grow exponentially with the number of network inputs Continuous functions Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer (Cybenko 1989, Hornik 1989) Any function can be approximated to arbitrary accuracy by a network with two hidden layers (Cybenko 1988)

The hidden layer Without this, network probably won’t perform well
No hidden layer means a linear model But it’s another big question! How many nodes do we need in the hidden layer? And might it perform better with more than one? (answers to both: in general, we don't know – experiment and see how it works for your data) See also

Interpretation of hidden layers
What are the hidden layers actually doing?! Essentially feature extraction The non-linearities in the feature extraction can make interpretation of the hidden layers very difficult This leads to neural networks being treated as black boxes

Overfitting We’re trying to form a model using some data, which we can then use to predict the result of other data If we train too much, we end up with a model which can near perfectly describe the data it’s seen, but can’t generalise This model is said to have overtrained/overfitted Very important factor in machine learning! Compare PCA & taking too many PCs

Overfitting example We wanted the model to be able to predict the x points (the model would have been simpler) But it overtrained on the o points (this is a more complicated model) Clearly this is a very exaggerated example… y Key O Training set X Test set Desired train Overfitted x o o x x o x o x

Stopping the training So we have to stop the ANN training after it’s found a good model, but before it goes too far Since all datasets have different characteristics, we have to experiment to find out what is optimal This overfitting problem occurs in other AI methods too and you must recognise it

Avoiding overfitting In order to avoid overfitting, we need to make sure we validate our model: an overfitted model cannot generalise for unseen data Use your query (test) set to check: does it still work? If so, you can assume your model is OK and will work on further unseen data This assumes there is some more data of course!

Advantages of ANNs No programming (training instead)
Very fast in operation (parallel) Can generalize over examples Can tolerate noise Degrades gracefully

Disadvantages of ANNs Needs large training sets
Can take long time Both positive and negative examples needed Supervised learning (desired results known) Can be unpredictable on untrained data Can’t explain results

When to consider ANNs Input is high-dimensional discrete or real-valued E.g. raw sensor data Output is discrete or real-valued Possibly noisy data Form of target function is unknown Human readability of result is unimportant

Example: Handwriting recognition
Demo: 06sp/NeuralNetsHandwriting/JRec.html

More to try Pythia Sharky Evaluation copy, limited to 13 neurons
But good examples supplied Sharky Nice graphical displays, but more difficult to see what's happening with the data

Other types of neural network
Multiple Outputs Skip Layer Network Recurrent Neural Networks …

Multiple outputs Used for N-way classification
Each node in the output layer corresponds to a different class No guarantee that the sum of the output vector will equal 1

Skip layer network Input nodes are also sent directly to the output layer

Recurrent neural networks
Output or hidden layer information is stored in a context or memory layer Output Layer Hidden Layer Context Layer Can have arbitrary topologies Can model systems with internal states (dynamic ones) Delays are associated to a specific weight Training is more difficult Performance may be problematic Stable Outputs may be more difficult to evaluate Unexpected behavior (oscillation, chaos, …) Input Layer

Time Delayed Recurrent Neural Networks
Output layer from time t are used as inputs to the hidden layer at time t+1 Output Layer With an optional decay Hidden Layer Input Layer

Radial Basis Functions (RBFs)
Features One hidden layer Several outputs Inputs Radial units Outputs

Radial Basis Functions (RBFs)
RBF hidden layer units have a receptive field which has a centre Generally, the hidden unit function is Gaussian The output layer is linear Realized function

Learning The training is performed by deciding on 2 steps
How many hidden nodes there should be The centers and the sharpness of the Gaussians 2 steps In the 1st stage, the input dataset is used to determine the parameters of the basis functions In the 2nd stage, functions are kept fixed while the second layer weights are estimated (simple backpropagation algorithm like for MLPs)

MLPs versus RBFs MLP RBF X2 X1 X2 X1 Classification Learning Structure
MLPs separate classes via hyperplanes RBFs separate classes via hyperspheres Learning MLPs use distributed learning RBFs use localized learning RBFs train faster Structure MLPs have one or more hidden layers RBFs have only one layer RBFs require more hidden neurons X2 MLP X1 X2 RBF X1

Kohonen ANNs = Self-organising maps (SOMs) What is it?
It’s a neural network without any training outputs! (unsupervised learning) It doesn’t need to be told how to form its model; it organises itself A bit like PCA in that respect .shtml

Self organizing maps The purpose of SOMs is to map a multidimensional input space onto a topology-preserving map of neurons Preserve topology so that neighboring neurons respond to “similar” input patterns The topological structure is often a 2 or 3 dimensional space Each neuron is assigned a weight vector with the same dimensionality as the input space Input patterns are compared to each weight vector and the closest wins (Euclidean distance) Generates a random set of output nodes None of these necessarily match the inputs we’re going to present to it Each one has a weighting linked to each input unit (sample) we’re going to want to form a model on Each output node is linked to each input layer node (our training inputs)

The activation of the neuron is spread in its direct neighbourhood =>neighbours become sensitive to the same input patterns Block distance The size of the neighbourhood is initially large but reduce over time => Specialization of the network 2nd neighbourhood First neighbourhood

Adaptation During training, the “winner” neuron and its neighborhood adapts to make their weight vector more similar to the input pattern that caused the activation The neurons are moved closer to the input pattern The magnitude of the adaptation is controlled via a learning parameter which decays over time

The algorithm Randomise each output node’s weights
Choose a training input at random See which output matches closest (this is the BMU: Best Matching Unit) Adjust weightings of all neighbouring nodes (nodes within the radius of the BMU) to make them more like the training input The closer to BMU, the greater the change Repeat from step 2

The radius This defines which nodes are to be thought of as near to the initial one Initially (usually) it will encompass all the nodes in the map And decrease to 0 as the network proceeds

Testing We can then present to the network a new, unseen input
The node which it falls closest to will be deemed to be its classification

In action… regensburg.de/~saj39122/begrolu/kohonen.html a good applet demonstration of a Kohonen net training (scroll down): applet running Kohonen net on colours. Note how it’s clear that radius decreases over time

Further reading Follow up all the links in the slides
Read Russell & Norvig, chapter 20 (especially the Neural Network section), but don't worry too much about the maths at this stage Useful resources Good tutorial Enough information to write your own Kohonen net!

CSM6120 Introduction to Intelligent Systems

Similar presentations

Presentation on theme: "CSM6120 Introduction to Intelligent Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSM6120 Introduction to Intelligent Systems

Similar presentations

Presentation on theme: "CSM6120 Introduction to Intelligent Systems"— Presentation transcript:

Similar presentations

About project

Feedback