Presentation is loading. Please wait.

Presentation is loading. Please wait.

Artificial Neural Networks

Similar presentations


Presentation on theme: "Artificial Neural Networks"— Presentation transcript:

1 Artificial Neural Networks
Chapter 4 Artificial Neural Networks

2 Questions: What is ANNs? How to learn an ANN? (algorithm)
The presentational power of ANNs(advantage and disadvantage)

3 What is ANNs ------Background
Consider humans Neuron switching time second Number of neurons 1010 Connections per neuron 104~5 Scene recognition time second much parallel computation Property of neuron: thresholded unit One motivation for ANN systems is to capture this kind of highly parallel computation based on distributed reprensetation

4 What is ANNs? -----Problems related to ANNs
classfication Voice recognition others

5 Another example: Properties of artificial neural nets (ANNs) Many neuron like threshold switching units Many weighted interconnections among units Highly parallel distributed process Emphasis on tuning weights automatically

6 4.1 Perceptrons

7 To simplify notation, set x0 =1
Learning a perceptron involves choosing values for the weight Therefore, the space H of candidate hypotheses considered in perceptron learning is the set of all possible real-valued weight vectors.

8 We can view the perceptron as representing a hyperplane decision surface in the n-dimensional space of instances. Two way to train perceptron: Perceptron Training Rule and Gradient Descent

9 (1). Perceptron Training Rule
Initialize the ωi with random value in the given interval Update the value of ωi according to the training example is target value o is perceptron output is small constant called learning rate

10 Representation Power of Perceptrons
A single perceptron can be used to represent many boolean functions, such as AND, OR, NAND, NOR, but fail to represent XOR. Eg: g(x1, x2) = AND(x1 ,x2) o(x1, x2) = sgn( x x2 ) x1 x x2 O -1 -1.8 1 -0.8 0.2

11 Representation Power of Perceptrons
(a) Can prove it will converge If training data is linearly separable and sufficiently small (b)But some functions not representable ,eg: not linearly separable (c) Every boolean function can be represented by some network of perceptrons only two levels deep

12 (2). Gradient Descent Key idea: searching the hypothesis space to find the weights that best fit the training examples. Best fit: minimize the squared error Where D is set of training examples

13 Gradient Descent Gradient: Training rule: or

14 Gradient Descent

15 Gradient Descent Algorithm
Initialize each ωi to some small random value Until the termination condition is met , Do Initialize each Δωi to zero. For each <x, t> in training examples Do Input the instance x to the unit and compute the output o For each linear unit weight ωi Do For each linear unit weight ωi ,Do

16 When to use gradient descent
Continuously parameterized hypothesis The error can be differentiable

17 Advantage vs Disadvantage
Guaranteed to converge to hypothesis with local minimum error , Given sufficiently small learning rate η; Even when training data contains noise; Training data not linear separable ; Converge to the single global minimum. Disadvantage Converging sometimes can be very slow; No guarantee Converging to global minimum in cases where there are multiple local minima

18 Incremental (Stochastic) Gradient Descent
Vs. standard Gradient Descent Do until satisfied Compute the gradient Stochastic Gradient Descent For each training example d in D

19 Standard Gradient Descent vs. Stochastic Gradient Descent
Stochastic Gradient Descent can approximate Standard Gradient Descent arbitrarily closely if η made small enough; Stochastic mode can converge faster; Stochastic Gradient descent can sometimes avoid falling into local minima.

20 (3).Perceptron training rule Vs. gradient descent
Thresholded perceptron output: Provided examples are linearly separable Converge to a hypothesis that perfectly classfies the training data gradient descent Unthresholded linear output: Regardless of whether the training data are linearly separable Converge asymptotically toward the minimum error hypothesis

21 Perceptron: Network: 4.2 Multilayer Networks
Perceptrons can only express liner decision,we need to express a rich variety of nonlinear decision

22 Sigmoid unit – a differentiable threshold unit
Sigmoid function: Property: Output: Why do we use sigmoid instead of linear and sgn(x)?

23 The Backpropagation Algorithm
The main idea of backpropagation algorithm computing the input and output of each unit foreword; modifying the weights of units pairs backward with respect to errors

24 Error definition : Batch mode: Individual mode:

25 oj oi = xji ω ij

26 Training rule for Output Unit weights

27 Training Rule for Hidden Unit Weights
ok Error term

28 Backpropagation Algorithm
Initialize all weights to small random numbers Until termination condition is met Do For each training example Do //Propagate the input forward 1. Input the training example to the network and compute the network outputs //Propagate the errors backward 2. For each output unit k 3. For each hidden unit h 4.Update each network weight where

29 Hidden layer Representations

30 Hidden layer Representations

31 Hidden layer Representations

32 Hidden layer Representations

33 Hidden layer Representations

34 Convergence and local minima
Converge to some local minimum and not necessarily to the global minimum error Use stochastic gradient descent rather than the standard gradient descent Initialization will influence the convergence. Training multiple networks network with different initializing random weights,over the same data, then select the best one Training can take thousands of iterations -->slow Initialize weights near zero, Therefore initial networks near linear. Increasingly nonlinear functions is possible as training progresses Add a momentum term to speed convergence

35 Expressive Capabilities of ANNs
Every boolean function can be represented by network with single hidden layer Every bounded continuous function can be approximated with arbitrarily small error by network with one hidden layer Any function can be approximated to arbitrary accuracy by a network with two hidden layers The network with more hidden layers possibly results in the rise of precision , the possibility of converging to a local minima ,however, will increase as well.

36 When to Consider Neural Networks
Input is high dimensional discrete or real valued Output is discrete or real valued Output is a vector of values Possibly noisy data Form of target function is unknown Human readability of result is unimportant

37 Overfitting in ANNs

38 Strategy applied to avoid overfitting
Poor strategy: continue training until the error falls below some threshold A good indicator : the number of iterations that produces the lowest error over the validation set Once the trained weights reach a significantly higher error over the validation set than the stored weights, terminate!

39 Alternative Error Functions

40 Recurrent Networks

41 Recurrent Networks

42 Thank you !


Download ppt "Artificial Neural Networks"

Similar presentations


Ads by Google