Presentation is loading. Please wait.

Presentation is loading. Please wait.

1/11 طراحی و آموزش شبکه های عصبی Slide from Dr. M. Pomplun.

Similar presentations


Presentation on theme: "1/11 طراحی و آموزش شبکه های عصبی Slide from Dr. M. Pomplun."— Presentation transcript:

1 1/11 طراحی و آموزش شبکه های عصبی Slide from Dr. M. Pomplun

2 2 In supervised learning: We train an ANN with a set of vector pairs, so-called exemplars. Each pair (x, y) consists of an input vector x and a corresponding output vector y. Whenever the network receives input x, we would like it to provide output y. The exemplars thus describe the function that we want to “teach” our network. Besides learning the exemplars, we would like our network to generalize, that is, give plausible output for inputs that the network had not been trained with.

3 3 Classification Neural networks have been used success fully in a large number of practical classification tasks, such as the following: Recognizing printed or handwritten characters Classifying loan applications into credit-worthy and non-credit-worthy groups Analyzing sonar and radar data to determine the nature of the source of a signal

4 4 There is a tradeoff between a network’s ability to precisely learn the given exemplars and its ability to generalize (i.e., inter- and extrapolate). This problem is similar to fitting a function to a given set of data points. Let us assume that you want to find a fitting function f:R  R for a set of three data points. You try to do this with polynomials of degree one (a straight line), two, and nine. f(x) x deg. 1 deg. 2 deg. 9 Obviously, the polynomial of degree 2 provides the most plausible fit. Function approximation

5 5 The same principle applies to ANNs: If an ANN has too few neurons, it may not have enough degrees of freedom to precisely approximate the desired function. If an ANN has too many neurons, it will learn the exemplars perfectly, but its additional degrees of freedom may cause it to show implausible behavior for untrained inputs; it then presents poor ability of generalization. Unfortunately, there are no known equations that could tell you the optimal size of your network for a given application; there are only heuristics.

6 6 Evaluation of networks Basic idea: define error function and measure error for untrained data (testing set) Typical: where d is the desired output, and o is the actual output. Root Mean square error:

7 7 Data Representation All networks process one of two types of signal components: analog (continuously variable) signals or discrete (quantized) signals. In both cases, signals have a finite amplitude; their amplitude has a minimum and a maximum value. analog discrete max min Slide from Dr. M. Pomplun

8 8 The main question is: How can we appropriately capture these signals and represent them as pattern vectors that we can feed into the network? We should aim for a data representation scheme that maximizes the ability of the network to detect (and respond to) relevant features in the input pattern. Relevant features are those that enable the network to generate the desired output pattern. Similarly, we also need to define a set of desired outputs that the network can actually produce. We are going to consider internal representation and external interpretation issues as well as specific methods for creating appropriate representations. Data Representation Slide from Dr. M. Pomplun

9 9 Internal Representation Issues As we said before, in all network types, the amplitude of input signals and internal signals is limited: analog networks: values usually between 0 and 1 binary networks: only values 0 and 1 allowed bipolar networks: only values –1 and 1 allowed Without this limitation, patterns with large amplitudes would dominate the network’s behavior. A disproportionately large input signal can activate a neuron even if the relevant connection weight is very small. Slide from Dr. M. Pomplun

10 10 Without any interpretation, we can only use standard methods to define the difference (or similarity) between signals. For example, for binary patterns x and y, we could… … treat them as binary numbers and compute their difference as | x – y | … treat them as vectors and use the cosine of the angle between them as a measure of similarity … count the numbers of digits that we would have to flip in order to transform x into y (Hamming distance ) External Interpretation Issues Slide from Dr. M. Pomplun

11 11 Creating Data Representation The patterns that can be represented by an ANN most easily are binary patterns. Even analog networks “like” to receive and produce binary patterns – we can simply round values < 0.5 to 0 and values  0.5 to 1. To create a binary input vector, we can simply list all features that are relevant to the current task. Each component of our binary vector indicates whether one particular feature is present (1) or absent (0). Slide from Dr. M. Pomplun

12 12 Creating Data Representation With regard to output patterns, most binary-data applications perform classification of their inputs. The output of such a network indicates to which class of patterns the current input belongs. Usually, each output neuron is associated with one class of patterns. As you already know, for any input, only one output neuron should be active (1) and the others inactive (0), indicating the class of the current input. Slide from Dr. M. Pomplun

13 13 Creating Data Representation 000000010010 010001010110 100010011010 Similarly, instead of nine output units, four would suffice, using the following output patterns to indicate a square: Slide from Dr. M. Pomplun

14 14 The problem with such representations is that the meaning of the output of one neuron depends on the output of other neurons. This means that each neuron does not represent (detect) a certain feature, but groups of neurons do. In general, such functions are much more difficult to learn. Such networks usually need more hidden neurons and longer training, and their ability to generalize is weaker than for the one-neuron-per- feature-value networks. Creating Data Representation Slide from Dr. M. Pomplun

15 15 Another way of representing n-ary data in a neural network is using one neuron per feature, but scaling the (analog) value to indicate the degree to which a feature is present. Good examples: the brightness of a pixel in an input image the distance between a robot and an obstacle Poor examples: the letter (1 – 26) of a word the type (1 – 6) of a chess piece Creating Data Representation Slide from Dr. M. Pomplun

16 16 Therefore, it is appropriate to represent a non-binary feature by a single analog input value only if this value is scaled, i.e., it represents the degree to which a feature is present. This is the case for the brightness of a pixel or the output of a distance sensor (feature = obstacle proximity). It is not the case for letters or chess pieces. For example, assigning values to individual letters (a = 0, b = 0.04, c = 0.08, …, z = 1) implies that a and b are in some way more similar to each other than are a and z. Obviously, in most contexts, this is not a reasonable assumption. Creating Data Representation Slide from Dr. M. Pomplun

17 17 Exemplar Analysis When building a neural network application, we must make sure that we choose an appropriate set of exemplars (training data): The entire problem space must be covered. There must be no inconsistencies (contradictions) in the data. We must be able to correct such problems without compromising the effectiveness of the network. Slide from Dr. M. Pomplun

18 18 Training and Performance Evaluation How many samples should be used for training? Heuristic: At least 5-10 times as many samples as there are weights in the network. Formula (Baum & Haussler, 1989): P is the number of samples, |W| is the number of weights to be trained, and ‘a’ is the desired accuracy (e.g., proportion of correctly classified samples). Slide from Dr. M. Pomplun

19 19 What learning rate  should we choose? The problems that arise when  is too small or to big are similar to the Adaline. Unfortunately, the optimal value of  entirely depends on the application. Values between 0.1 and 0.9 are typical for most applications. Often,  is initially set to a large value and is decreased during the learning process. Leads to better convergence of learning, also decreases likelihood of “getting stuck” in local error minimum at early learning stage. Training and Performance Evaluation Slide from Dr. M. Pomplun

20 20 When training a BPN, what is the acceptable error, i.e., when do we stop the training? The minimum error that can be achieved does not only depend on the network parameters, but also on the specific training set. Thus, for some applications the minimum error will be higher than for others. Training and Performance Evaluation Slide from Dr. M. Pomplun

21 21 An insightful way of performance evaluation is partial-set training. The idea is to split the available data into two sets – the training set and the test set. The network’s performance on the second set indicates how well the network has actually learned the desired mapping. We should expect the network to interpolate, but not extrapolate. Therefore, this test also evaluates our choice of training samples. Training and Performance Evaluation Slide from Dr. M. Pomplun

22 22 Some examples: Predicting the Weather Let us study an interesting neural network application. Its purpose is to predict the local weather based on a set of current weather data: temperature (degrees Celsius) atmospheric pressure (inches of mercury) relative humidity (percentage of saturation) wind speed (kilometers per hour) wind direction (N, NE, E, SE, S, SW, W, or NW) cloud cover (0 = clear … 9 = total overcast) weather condition (rain, hail, thunderstorm, …) Slide from Dr. M. Pomplun

23 23 We assume that we have access to the same data from several surrounding weather stations. There are 8 such stations that surround our position. How should we format the input patterns? We need to represent the current weather conditions by an input vector whose elements range in magnitude between zero and one. When we inspect the raw data, we find that there are two types of data that we have to account for: Scaled, continuously variable values n-ary representations of category values Slide from Dr. M. Pomplun

24 24 The following data can be scaled: temperature (-10… 40 degrees Celsius) atmospheric pressure (26… 34 inches of mercury) relative humidity (0… 100 percent) wind speed (0… 250 km/h) cloud cover (0… 9) We can just scale each of these values so that its lower limit is mapped to some  and its upper value is mapped to (1 -  ). These numbers will be the components of the input vector. Slide from Dr. M. Pomplun

25 25 Usually, wind speeds vary between 0 and 40 km/h. By scaling wind speed between 0 and 250 km/h, we can account for all possible wind speeds, but usually only make use of a small fraction of the scale. Therefore, only the most extreme wind speeds will exert a substantial effect on the weather prediction. Consequently, we will use two scaled input values: wind speed ranging from 0 to 40 km/h wind speed ranging from 40 to 250 km/h Slide from Dr. M. Pomplun

26 26 How about the non-scalable weather data? Wind direction is represented by an eight- component vector, where only one element (or possibly two adjacent ones) is active, indicating one out of eight wind directions. The subjective weather condition is represented by a nine-component vector with at least one, and possibly more, active elements. With this scheme, we can encode the current conditions at a given weather station with 23 vector components: one for each of the four scaled parameters two for wind speed eight for wind direction nine for the subjective weather condition Slide from Dr. M. Pomplun

27 27 Since the input does not only include our station, but also the eight surrounding ones, the input layer of the network looks like this: … our station … north…… northwest The network has 207 input neurons, which accept 207-component input vectors. Slide from Dr. M. Pomplun

28 28 What should the output patterns look like? We want the network to produce a set of indicators that we can interpret as a prediction of the weather in 24 hours from now. In analogy to the weather forecast on the evening news, we decide to demand the following four indicators: a temperature prediction a prediction of the chance of precipitation occurring an indication of the expected cloud cover a storm indicator (extreme conditions warning) Slide from Dr. M. Pomplun

29 29 Each of these four indicators can be represented by one scaled output value: temperature (-10… 40 degrees Celsius) chance of precipitation (0%… 100%) cloud cover (0… 9) storm warning: two possibilities: –0: no storm warning; 1: storm warning –probability of serious storm (0%… 100%) Of course, the actual network outputs range from  to (1 -  ), and after their computation, if necessary, they are scaled to match the ranges specified above. Slide from Dr. M. Pomplun

30 30 We decide (or experimentally determine) to use a hidden layer with 42 sigmoidal neurons. In summary, our network has 207 input neurons 42 hidden neurons 4 output neurons Because of the small output vectors, 42 hidden units may suffice for this application. Slide from Dr. M. Pomplun

31 31 The next thing we need to do is collecting the training exemplars. First we have to specify what our network is supposed to do: In production mode, the network is fed with the current weather conditions, and its output will be interpreted as the weather forecast for tomorrow. Therefore, in training mode, we have to present the network with exemplars that associate known past weather conditions at a time t with the conditions at t – 24 hrs. So we have to collect a set of historical exemplars with known correct output for every input. Slide from Dr. M. Pomplun

32 32 Obviously, if such data is unavailable, we have to start collecting them. The selection of exemplars that we need depends, among other factors, on the amount of changes in weather at our location. And how about the granularity of our exemplar data, i.e., the frequency of measurement? Using one sample per day would be a natural choice, but it would neglect rapid changes in weather. If we use hourly instantaneous samples, however, we increase the likelihood of conflicts. Slide from Dr. M. Pomplun

33 33 Therefore, we decide to do the following: We will collect input data every hour, but the corresponding output pattern will be the average of the instantaneous patterns over a 12-hour period. This way we reduce the possibility of errors while increasing the amount of training data. Now we have to train our network. If we use samples in one-hour intervals for one year, we have 8,760 exemplars. Our network has 207  42 + 42  4 = 8862 weights, which means that data from ten years, i.e., 87,600 exemplars would be desirable (rule of thumb). Slide from Dr. M. Pomplun

34 34 Since with a large number of samples the hold-one-out training method is very time consuming, we decide to use partial-set training instead. The best way to do this would be to acquire a test set (control set), that is, another set of input-output pairs measured on random days and at random times. After training the network with the 87,600 exemplars, we could then use the test set to evaluate the performance of our network. Slide from Dr. M. Pomplun

35 35 Neural network troubleshooting: Plot the global error as a function of the training epoch. The error should decrease after every epoch. If it oscillates, do the following tests. Try reducing the size of the training set. If then the network converges, a conflict may exist in the exemplars. If the network still does not converge, continue pruning the training set until it does converge. Then add exemplars back gradually, thereby detecting the ones that cause conflicts. If this still does not work, look for saturated neurons (extreme weights) in the hidden layer. If you find those, add more hidden-layer neurons, possibly an extra 20%. If there are no saturated units and the problems still exist, try lowering the learning parameter  and training longer. Slide from Dr. M. Pomplun

36 36 If the network converges but does not accurately learn the desired function, evaluate the coverage of the training set. If the coverage is adequate and the network still does not learn the function precisely, you could refine the pattern representation. For example, you could include a season indicator to the input, helping the network to discriminate between similar inputs that produce very different outputs. Slide from Dr. M. Pomplun


Download ppt "1/11 طراحی و آموزش شبکه های عصبی Slide from Dr. M. Pomplun."

Similar presentations


Ads by Google