Presentation is loading. Please wait.

Presentation is loading. Please wait.

Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5

Similar presentations


Presentation on theme: "Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5"— Presentation transcript:

1 Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5
Promethea Pythaitha

2 Artificial Neural Networks
Robust approach to approximating target functions over attributes with continuous domains as well as discrete. Can approximate unknown target functions. Target function can be Discrete valued. Real valued. Vector valued.

3 Neural Net Applications
Robust to noise in training data Among most successful methods at interpreting noisy real world sensor data. Microphone input / speech recognition. Camera input. Handwriting recognition Face recognition and Image Processing. Robotic control. Fuzzy neural nets.

4 Inspiration for Artificial Neural nets.
Natural learning systems (i.e. brains) are composed of very complex webs of interconnected neurons. Each neuron receives signals (current spikes). When neuron threshold is reached, neuron sends its signal downstream to…. Other neurons Physical actuators Perception in the Neocortex Etc….

5 Artificial Neural nets are built out of densely interconnected sets of units.
Each unit (artificial neuron) takes many real-valued inputs, and produces a real-valued output. Output is then sent downstream to Other units within net Output layer of net.

6 Brain estimated to contain ~ 100 billion neurons.
Each neuron connects to an average of 10,000 others. Neuron switching speeds are on the order of a thousandth of a second (versus a ten-billionth of a second for logic gates) Yet brains can make decisions/ recognize images, etc, VERY fast.

7 Hypothesis: Thought /information processing in the brain is result of massively parallelized processing of distributed inputs.

8 Neural Nets are built on this idea:
In parallel, process Distributed data.

9 Artificial vs. Natural. Many Complexities of Natural Neural Nets not present in Artificial ones. Feedback (uncommon in ANNs) Etc. Many features of Artificial Neural Nets not compatible with Natural ones. Units in Artificial Neural Nets produce one constant output rather than a time-varying sequence of current pulses.

10 Neural Network representations.
ALVINN learned to drive an autonomous vehicle on highway. Input: 30 x 32 pixel matrix from camera 960 values (B/W pixel intensities) Output: Steering direction for vehicle. (30 real values) Two layer Neural Net: Input (not counted) Hidden Layer Output.

11 ALVINN

12 ALVINN explained Typical Neural Net Structure.
All 960 inputs in the 30x32 matrix are sent to the four hidden neurons/units, where weighted linear sums are computed. Hidden unit: a unit whose output is only accessible in the net, but not at the output layer. Outputs from the 4 hidden neurons are sent downstream to 30 output neurons, each of which outputs confidence value corresponding to steering in a specific direction. Fuzzy truth?? Probability measure? Program chooses direction with highest confidence

13 Typical Neural Net structure.
Usually, Layers are connected in Directed, Acyclic Graph. Can, in general, have any structure: Cyclic/Acyclic Directed/Undirected Feedforward/feedback Most common & practical Nets trained using Backpropagation Learning = Select weight value for each connection. Assumes net is directed. Cycles are restricted Usually there are none in practice.

14 Appropriate problems for A.N.N.s
Great for problems with noisy data and complex sensor inputs. Camera/microphone, etc VS. Shaft encoder/light-sensor, etc Symbolic problems: As good as Decision Tree Learning!! BACKPROPAGATION Most widely used technique.

15 Appropriate problems for A.N.N.s
Neural Net Learning suitable for problems with following attributes: Instances represented by many attribute, value pairs. Target function depends on vector of predefined attributes Real-valued inputs Attributed can be correlated or inependent. Target function can be discrete-valued, real-valued, or a vector! ALVINN’s target function was a 30 real-element vector.

16 Appropriate problems for A.N.N.s
Training Data can contain Errors/Noise Neural nets very robust to noisy data Thankfully so are natural ones ;-) Long Training Times are acceptable. Neural nets usu. Take longer than other machine-learning algorithms. Few minutes  several hours. Fast evaluation of learned function may be required. Neural nets do compute learned function very fast. ALVINN re-computes bearing several times/second.

17 Appropriate problems for A.N.N.s
Not important if humans can understand learned function!! ALVINN 960 inputs 4 hidden nodes 30 outputs Get somewhat messy looking to humans!! Thanks to the massive parallelism and distributed data.

18 Perceptrons. Basic building block of Neural Nets
Takes several real-valued inputs Computes weighted sum Checks sum against threshold: - w0 If > threshold  output +1 Else  output -1. o(x1, …, xn) = 1 if w0 + w1*x1 + …+ wn*xn > 0 -1 otherwise.

19 For simplicity, let x0 = 1, then o(x1, …, xn) = sgn(w.x)
Vectors denoted in Bold!! The . is vector dot product!! Hypothesis space = All possible combinations of real-valued weights. All w in Rn+1

20 Perceptron can represent any linearly separable concept.
Learned hypothesis is a hyperplane in Rn Equation of hyperplane is w.x = 0 Example: AND, OR, NAND, NOR - vs, XOR, etc. Any boolean function can be represented by 2-layer perceptron network!!

21 Training a single Perceptron
Perceptron training rule. Delta training rule / Gradient Descent. Converge to different hypotheses, Under different conditions!!

22 Perceptron training rule.
Start with random weights Go through training examples When an error is made, Update weights: For each wi wi  wi + Δwi Δwi = η(t - o)xi Terminology: η is the learning rate typically small Sometimes decreases as we proceed. t is the value of the target function o is the value outputted by perceptron.

23 When the set of training examples is completed perfectly, STOP.

24 Pros and Cons Can be proven to converge to a w that correctly classifies all training examples in finite time, if η is small enough. Convergence guaranteed only if concept is linearly seperable. If not, no convergence  no stopping!! Can be an infinite loop of indecision!!

25 Delta rule & Gradient descent.
Addresses problem of non-convergence for nonlinear concepts. Will give a linear approximation to nonlinear concepts that will minimize error. …how do we measure error??

26 Consider a perceptron with the “thresholding” function removed.
Then o = w.x Can define error as sum of squared deviations: E = ½ Σ (td – od)2, sum over all training examples d. td is the target function value for example d. od is the computed value of the weighted sum for d.

27 With this choice of error, it can be proven that minimizing E will lead to the most probable hypothesis that fits the training data, under the assumption that the noise is normally distributed with mean 0. Note “most probable” hypothesis and “correct” hypothesis still can be different.

28 Graph E versus weights:

29 E will always be parabolic (by definition)
So it has only one global minimum. Goal: Descent to the global minimum ASAP! How? Gradient definition. Meaning of the Gradient. Gradient descent.

30 The Gradient of E tells us the direction of steepest ascent.
So, – Gradient E tells us us direction of steepest descent. Go in that direction with step size η. The learning rule becomes:

31 Derivation of simple update rule.
Finally, the Delta Rule weight update is. Δwi = η*Σ(td – od)xid , over all training examples d.

32 Gradient descent pseudocode.
Pick an initial random weight vector w Until the termination condition is met, do Set Δwi = 0 for all i. For each <x, t> in D, do Run net on x :compute o(x) For each wi, do Δ wi = Δ wi + η (t – o) xi wi = wi + Δ wi

33 Results of Gradient Descent.
Because of the shape of the error surface, with only a local minimum, the algorithm will converge to a w with minimum squared deviation/error as long as η is small enough. This holds regardless of linear seperability. If η is too large, the algorithm may skip the global minimum instead of settling in. A common remedy is to decrease η with time.

34 Gradient descent can be used to search large or infinite hypothesis spaces when:
The hypothesis space is continuously parameterized. The error can be differentiated with respect to the hypothesis parameters. However Convergence can be slooooooow If there are lots of local minima, there’s no guarantee it will find the global one.

35 Stochastic gradient descent.
a.k.a. Incremental gradient descent. Instead of updating the weights massively after going through all the training examples, we update them after each example. This really descends the gradient for a single-example error function (an example per step): E = ½ (td – od)2 If η is small enough, this is as optimal as true gradient descent.

36 Stochastic Gradient Descent.
Pick an initial random weight vector w Until the termination condition is met, do For each <x, t> in D, do Run net on x :compute o(x) For each wi, do wi = wi + η (t – o) xi

37 Results. Compared to Stochastic gradient descent, standard gradient descent takes more computation per step, but generally has a larger step size. When E has multiple local minima, Stochastic gradient descent can sometimes avoid them.

38 Perceptrons with discrete output.
For Delta-learning/gradient descent, we discussed the unthresholded perceptrons. It can simply be modified to thresholded perceptrons. Just use the thresholded t values as the t values for the perceptron delta-learning algorithm. (with the unthresholded o values) Unfortunately, this may not necessarily reduce the percent of errors in training data by the thresholded output, just the squared error in the thresholded output.

39 Multilayer Networks and the Backpropagation Algorithm
In order to learn non-linear decision surfaces, a system more complex than perceptrons is needed.

40 Choice of base unit. Multiple layers of linear units  still linear.
Unthresholded perceptron Thresholded perceptron has non-differentiable thresholding function: Cannot compute gradient of E Need something different. Must be non-linear And continuously differentiable.

41 Sigmoid unit. In place of the perceptron step function, use the sigmoid function as thresholding function. Sigmoid: σ(y) = 1 / (1 + e-y)

42 Steepness of incline increases with coefficient of –y.
The sigmoid unit computes the weighted linear sum w.x, and then applied the sigmoid “squashing function”. Steepness of incline increases with coefficient of –y. Continuously differentiable: Derivative: dσ(y)/dy = σ(y)*(1 - σ(y))

43 Backpropagation algorithm.
Learns the weights that minimize squared error given fixed # of units/neurons, and interconnections. Employs gradient descent similar to delta rule. Error is measured by:

44 Error Surface can have multiple local minima.
No guarantee algorithm will find the Global Minimum. However, in practice, backpropagation performs very well. Recall Stochastic Gradient descent vs. Gradient descent.

45 Stochastic Backpropagation Algorithm
Considering a feedforward network with two layers of sigmoid units which is fully connected in one direction. Each unit is assigned an index (I = 0, 1, 2, …) xji denotes input from i into j. wji denotes weight on connection from i to j. δn is the error term associated with unit n.

46 Backpropagation algorithm.

47 Backpropagation explained
Make the network, randomly initialize the weights. For each training example d, apply the network, calculate the squared error Ed, apply the gradient, and proceed a step of size η in direction of steepest decline. Weight update rule: Recall Delta rule: Δ wi = Δ wi + η (t – o) xi Here we have Δwji = ηδjxji The error term δj is more complex here.

48 Error term for unit j. Intuitively, if j is an output node k, then its error is the standard tk – ok multiplied by ok(1- ok) : the derivative of the sigmoid function. Derivative of sigmoid because we’re using gradient(E). If j is a hidden node h, have no th to compare it with. Must sum error in the output nodes k influenced by h: δk weighted by how much they were influenced by h: wkh δh = oh(1- oh) Σ wkhδk

49 Derivation of error term for unit j.

50

51

52

53 Termination condition.
Stop after fixed number of iterations. Stop when E on training data drops below given level. Stop when E on test data drops below a certain level. Too few iterations  too much error remains. Too many  overfitting data.

54 Variant of Backpropagation with momentum.
Adding momentum = make weight update at step n depend partially on that at n-1. Δwji(n) = ηδjxji + α Δwji(n - 1) α is a small number between 0 and 1. Analogous to a ball rolling down a bumpy hill. Momentum (α) tries to keep the ball moving in the same direction as before. Can keep ‘ball’ moving through local minima and plateaus. If Gradient(E) does not change, increases effective step size.

55 Pros and Cons of Momentum
Can provide quicker convergence. However theoretically, can also drop right through global minimum and keep going.

56 Generalization to n-layer network
We simply evaluate the δk for output nodes as before. Then Backpropagate errors through network layer by layer: For a node r at layer m δr = or(1- or) Σwsrδs over all nodes s in layer m+1 Layer m+1 is downstream from m. r feeds input into the s nodes.

57 Generalization to arbitrary acyclic network
We simply evaluate the δk for output nodes as before. Then Backpropagate errors through network: For a node r δr = or(1- or) Σwsrδs over all nodes s in Downstream(r). Downsteam(r) = {s | s recieves input from r}

58 Summary. Artificial Neural networks are
Practical way to learn discrete, real, and vector valued functions. Robust to noisy data. Usually trained via Backpropagation. Used for many real world tasks Robot control Computer creativity

59 Feedforward networks with 3 layers can approximate any function to any desired accuracy given sufficient units/artificial neurons and connections. Good accuracy achieved even with small nets. Backpropagation is able to find intermediate features within the net, that are not explicitly defined as attributes of the input or output.


Download ppt "Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5"

Similar presentations


Ads by Google