1/11 طراحی و آموزش شبکه های عصبی Slide from Dr. M. Pomplun.

Slides:



Advertisements
Similar presentations
NEURAL NETWORKS Perceptron
Advertisements

Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
Neural Networks  A neural network is a network of simulated neurons that can be used to recognize instances of patterns. NNs learn by searching through.
Computer Vision Lecture 18: Object Recognition II
Mehran University of Engineering and Technology, Jamshoro Department of Electronic Engineering Neural Networks Feedforward Networks By Dr. Mukhtiar Ali.
Simple Neural Nets For Pattern Classification
1 Lecture 8: Genetic Algorithms Contents : Miming nature The steps of the algorithm –Coosing parents –Reproduction –Mutation Deeper in GA –Stochastic Universal.
November 19, 2009Introduction to Cognitive Science Lecture 20: Artificial Neural Networks I 1 Artificial Neural Network (ANN) Paradigms Overview: The Backpropagation.
Chapter 2: Pattern Recognition
November 9, 2010Neural Networks Lecture 16: Counterpropagation 1 Unsupervised Learning So far, we have only looked at supervised learning, in which an.
Induction of Decision Trees
September 21, 2010Neural Networks Lecture 5: The Perceptron 1 Supervised Function Approximation In supervised learning, we train an ANN with a set of vector.
September 14, 2010Neural Networks Lecture 3: Models of Neurons and Neural Networks 1 Visual Illusions demonstrate how we perceive an “interpreted version”
Evaluating Hypotheses
November 30, 2010Neural Networks Lecture 20: Interpolative Associative Memory 1 Associative Networks Associative networks are able to store a set of patterns.
October 5, 2010Neural Networks Lecture 9: Applying Backpropagation 1 K-Class Classification Problem Let us denote the k-th class by C k, with n k exemplars.
October 14, 2010Neural Networks Lecture 12: Backpropagation Examples 1 Example I: Predicting the Weather We decide (or experimentally determine) to use.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
October 7, 2010Neural Networks Lecture 10: Setting Backpropagation Parameters 1 Creating Data Representations On the other hand, sets of orthogonal vectors.
Neural Networks Lecture 17: Self-Organizing Maps
October 12, 2010Neural Networks Lecture 11: Setting Backpropagation Parameters 1 Exemplar Analysis When building a neural network application, we must.
Radial Basis Function (RBF) Networks
Radial Basis Function Networks
November 25, 2014Computer Vision Lecture 20: Object Recognition IV 1 Creating Data Representations The problem with some data representations is that the.
October 8, 2013Computer Vision Lecture 11: The Hough Transform 1 Fitting Curve Models to Edges Most contours can be well described by combining several.
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
Issues with Data Mining
November 28, 2012Introduction to Artificial Intelligence Lecture 18: Neural Network Application Design I 1 CPN Distance/Similarity Functions In the hidden.
December 5, 2012Introduction to Artificial Intelligence Lecture 20: Neural Network Application Design III 1 Example I: Predicting the Weather Since the.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Review and Preview This chapter combines the methods of descriptive statistics presented in.
Kumar Srijan ( ) Syed Ahsan( ). Problem Statement To create a Neural Networks based multiclass object classifier which can do rotation,
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
© Negnevitsky, Pearson Education, Will neural network work for my problem? Will neural network work for my problem? Character recognition neural.
1/11 طراحی و آموزش شبکه های عصبی Slide from Dr. M. Pomplun.
NEURAL NETWORKS FOR DATA MINING
1 2. Independence and Bernoulli Trials Independence: Events A and B are independent if It is easy to show that A, B independent implies are all independent.
From Biological to Artificial Neural Networks Marc Pomplun Department of Computer Science University of Massachusetts at Boston
Artificial Intelligence Methods Neural Networks Lecture 4 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.
November 26, 2013Computer Vision Lecture 15: Object Recognition III 1 Backpropagation Network Structure Perceptrons (and many other classifiers) can only.
Introduction to Neural Networks. Biological neural activity –Each neuron has a body, an axon, and many dendrites Can be in one of the two states: firing.
1 Lecture 6 Neural Network Training. 2 Neural Network Training Network training is basic to establishing the functional relationship between the inputs.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Supervised learning network G.Anuradha. Learning objectives The basic networks in supervised learning Perceptron networks better than Hebb rule Single.
November 20, 2014Computer Vision Lecture 19: Object Recognition III 1 Linear Separability So by varying the weights and the threshold, we can realize any.
October 1, 2013Computer Vision Lecture 9: From Edges to Contours 1 Canny Edge Detector However, usually there will still be noise in the array E[i, j],
November 21, 2013Computer Vision Lecture 14: Object Recognition II 1 Statistical Pattern Recognition The formal description consists of relevant numerical.
Pattern recognition – basic concepts. Sample input attribute, attribute, feature, input variable, independent variable (atribut, rys, příznak, vstupní.
April 14, 2016Introduction to Artificial Intelligence Lecture 20: Image Processing 1 Example I: Predicting the Weather Let us study an interesting neural.
April 12, 2016Introduction to Artificial Intelligence Lecture 19: Neural Network Application Design II 1 Now let us talk about… Neural Network Application.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Supervised Learning – Network is presented with the input and the desired output. – Uses a set of inputs for which the desired outputs results / classes.
March 1, 2016Introduction to Artificial Intelligence Lecture 11: Machine Evolution 1 Let’s look at… Machine Evolution.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Big data classification using neural network
Neural Network Architecture Session 2
Deep Feedforward Networks
Supervised Learning in ANNs
Neural Networks Winter-Spring 2014
Sampling Distributions and Estimation
Neural Networks A neural network is a network of simulated neurons that can be used to recognize instances of patterns. NNs learn by searching through.
Fitting Curve Models to Edges
Hidden Markov Models Part 2: Algorithms
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Creating Data Representations
Supervised Function Approximation
Example I: Predicting the Weather
NEURAL NETWORK APPLICATION DESIGN
Computer Vision Lecture 19: Object Recognition III
Presentation transcript:

1/11 طراحی و آموزش شبکه های عصبی Slide from Dr. M. Pomplun

2 In supervised learning: We train an ANN with a set of vector pairs, so-called exemplars. Each pair (x, y) consists of an input vector x and a corresponding output vector y. Whenever the network receives input x, we would like it to provide output y. The exemplars thus describe the function that we want to “teach” our network. Besides learning the exemplars, we would like our network to generalize, that is, give plausible output for inputs that the network had not been trained with.

3 Classification Neural networks have been used success fully in a large number of practical classification tasks, such as the following: Recognizing printed or handwritten characters Classifying loan applications into credit-worthy and non-credit-worthy groups Analyzing sonar and radar data to determine the nature of the source of a signal

4 There is a tradeoff between a network’s ability to precisely learn the given exemplars and its ability to generalize (i.e., inter- and extrapolate). This problem is similar to fitting a function to a given set of data points. Let us assume that you want to find a fitting function f:R  R for a set of three data points. You try to do this with polynomials of degree one (a straight line), two, and nine. f(x) x deg. 1 deg. 2 deg. 9 Obviously, the polynomial of degree 2 provides the most plausible fit. Function approximation

5 The same principle applies to ANNs: If an ANN has too few neurons, it may not have enough degrees of freedom to precisely approximate the desired function. If an ANN has too many neurons, it will learn the exemplars perfectly, but its additional degrees of freedom may cause it to show implausible behavior for untrained inputs; it then presents poor ability of generalization. Unfortunately, there are no known equations that could tell you the optimal size of your network for a given application; there are only heuristics.

6 Evaluation of networks Basic idea: define error function and measure error for untrained data (testing set) Typical: where d is the desired output, and o is the actual output. Root Mean square error:

7 Data Representation All networks process one of two types of signal components: analog (continuously variable) signals or discrete (quantized) signals. In both cases, signals have a finite amplitude; their amplitude has a minimum and a maximum value. analog discrete max min Slide from Dr. M. Pomplun

8 The main question is: How can we appropriately capture these signals and represent them as pattern vectors that we can feed into the network? We should aim for a data representation scheme that maximizes the ability of the network to detect (and respond to) relevant features in the input pattern. Relevant features are those that enable the network to generate the desired output pattern. Similarly, we also need to define a set of desired outputs that the network can actually produce. We are going to consider internal representation and external interpretation issues as well as specific methods for creating appropriate representations. Data Representation Slide from Dr. M. Pomplun

9 Internal Representation Issues As we said before, in all network types, the amplitude of input signals and internal signals is limited: analog networks: values usually between 0 and 1 binary networks: only values 0 and 1 allowed bipolar networks: only values –1 and 1 allowed Without this limitation, patterns with large amplitudes would dominate the network’s behavior. A disproportionately large input signal can activate a neuron even if the relevant connection weight is very small. Slide from Dr. M. Pomplun

10 Without any interpretation, we can only use standard methods to define the difference (or similarity) between signals. For example, for binary patterns x and y, we could… … treat them as binary numbers and compute their difference as | x – y | … treat them as vectors and use the cosine of the angle between them as a measure of similarity … count the numbers of digits that we would have to flip in order to transform x into y (Hamming distance ) External Interpretation Issues Slide from Dr. M. Pomplun

11 Creating Data Representation The patterns that can be represented by an ANN most easily are binary patterns. Even analog networks “like” to receive and produce binary patterns – we can simply round values < 0.5 to 0 and values  0.5 to 1. To create a binary input vector, we can simply list all features that are relevant to the current task. Each component of our binary vector indicates whether one particular feature is present (1) or absent (0). Slide from Dr. M. Pomplun

12 Creating Data Representation With regard to output patterns, most binary-data applications perform classification of their inputs. The output of such a network indicates to which class of patterns the current input belongs. Usually, each output neuron is associated with one class of patterns. As you already know, for any input, only one output neuron should be active (1) and the others inactive (0), indicating the class of the current input. Slide from Dr. M. Pomplun

13 In other cases, classes are not mutually exclusive, and more than one output neuron can be active at the same time. Another variant would be the use of binary input patterns and analog output patterns for “classification”. In that case, again, each output neuron corresponds to one particular class, and its activation indicates the probability (between 0 and 1) that the current input belongs to that class. Creating Data Representation Slide from Dr. M. Pomplun

14 Tertiary (and n-ary) patterns can cause more problems than binary patterns when we want to format them for an ANN. For example, imagine the tic-tac-toe game. Each square of the board is in one of three different states: occupied by an X, occupied by an O, empty Let us now assume that we want to develop a network that plays tic-tac- toe. This network is supposed to receive the current game configuration as its input. Its output is the position where the network wants to place its next symbol (X or O). Obviously, it is impossible to represent the state of each square by a single binary value. Creating Data Representation Slide from Dr. M. Pomplun

15 Creating Data Representation Possible solution: Use multiple binary inputs to represent non-binary states. Treat each feature in the pattern as an individual subpattern. Represent each subpattern with as many positions (units) in the pattern vector as there are possible states for the feature. Then concatenate all subpatterns into one long pattern vector. Slide from Dr. M. Pomplun

16 Creating Data Representation Example: X is represented by the subpattern 100 O is represented by the subpattern 010 is represented by the subpattern 001 The squares of the game board are enumerated as follows : Slide from Dr. M. Pomplun

17 Creating Data Representation Then consider the following board configuration: XX OOX O It would be represented by the following binary string: Consequently, our network would need a layer of 27 input units. Slide from Dr. M. Pomplun

18 Creating Data Representation And what would the output layer look like? Well, applying the same principle as for the input, we would use nine units to represent the 9-ary output possibilities. Considering the same enumeration scheme: Our output layer would have nine neurons, one for each position. To place a symbol in a particular square, the corresponding neuron, and no other neuron, would fire (1) Slide from Dr. M. Pomplun

19 But… Would it not lead to a smaller, simpler network if we used shorter encoding of the non-binary states? We do not need 3-digit strings such as 100, 010, and 001, to represent X, O, and the empty square, respectively. We can achieve a unique representation with 2-digits strings such as 10, 01, and 00. Creating Data Representation Similarly, instead of nine output units, four would suffice, using the following output patterns to indicate a square: Slide from Dr. M. Pomplun

20 The problem with such representations is that the meaning of the output of one neuron depends on the output of other neurons. This means that each neuron does not represent (detect) a certain feature, but groups of neurons do. In general, such functions are much more difficult to learn. Such networks usually need more hidden neurons and longer training, and their ability to generalize is weaker than for the one-neuron-per- feature-value networks. Creating Data Representation Slide from Dr. M. Pomplun

21 Creating Data Representation On the other hand, sets of orthogonal vectors (such as 100, 010, 001) can be processed by the network more easily. This becomes clear when we consider that a neuron’s net input signal is computed as the inner product of the input and weight vectors. The geometric interpretation of these vectors shows that orthogonal vectors are especially easy to discriminate for a single neuron. Slide from Dr. M. Pomplun

22 Another way of representing n-ary data in a neural network is using one neuron per feature, but scaling the (analog) value to indicate the degree to which a feature is present. Good examples: the brightness of a pixel in an input image the distance between a robot and an obstacle Poor examples: the letter (1 – 26) of a word the type (1 – 6) of a chess piece Creating Data Representation Slide from Dr. M. Pomplun

23 This can be explained as follows: The way NNs work (both biological and artificial ones) is that each neuron represents the presence/absence of a particular feature. Activations 0 and 1 indicate absence or presence of that feature, respectively, and in analog networks, intermediate values indicate the extent to which a feature is present. Consequently, a small change in one input value leads to only a small change in the network’s activation pattern. Creating Data Representation Slide from Dr. M. Pomplun

24 Therefore, it is appropriate to represent a non-binary feature by a single analog input value only if this value is scaled, i.e., it represents the degree to which a feature is present. This is the case for the brightness of a pixel or the output of a distance sensor (feature = obstacle proximity). It is not the case for letters or chess pieces. For example, assigning values to individual letters (a = 0, b = 0.04, c = 0.08, …, z = 1) implies that a and b are in some way more similar to each other than are a and z. Obviously, in most contexts, this is not a reasonable assumption. Creating Data Representation Slide from Dr. M. Pomplun

25 If you wanted to represent the state of each square on the tic-tac-toe board by one analog value, which would be the better way to do this? = 0 X = 0.5 O = 1 Not a good scale! Goes from “neutral” to “friendly” and then “hostile”. More natural scale! Goes from “friendly” to “neutral” and then “hostile”. X = 0 = 0.5 O = 1 Creating Data Representation Slide from Dr. M. Pomplun

26 Exemplar Analysis When building a neural network application, we must make sure that we choose an appropriate set of exemplars (training data): The entire problem space must be covered. There must be no inconsistencies (contradictions) in the data. We must be able to correct such problems without compromising the effectiveness of the network. Slide from Dr. M. Pomplun

27 For many applications, we do not just want our network to classify any kind of possible input. Instead, we want our network to recognize whether an input belongs to any of the given classes or it is “garbage” that cannot be classified. To achieve this, we train our network with both “classifiable” and “garbage” data (null patterns). For the the null patterns, the network is supposed to produce a zero output, or a designated “null neuron” is activated. Ensuring Coverage Slide from Dr. M. Pomplun

28 We have to make sure that all of these exemplars taken together cover the entire input space. If it is certain that the network will never be presented with “garbage” data, then we do not need to use null patterns for training. Sometimes there may be conflicting exemplars in our training set. A conflict occurs when two or more identical input patterns are associated with different outputs. Why is this problematic? Slide from Dr. M. Pomplun

29 Assume a BPN with a training set including the exemplars (a, b) and (a,c). Whenever the exemplar (a, b) is chosen, the network adjust its weights to present an output for ‘a’ that is closer to b. Whenever (a, c) is chosen, the network changes its weights for an output closer to c, thereby “unlearning” the adaptation for (a, b). In the end, the network will associate input ‘a’ with an output that is “between” ‘b’ and ‘c’, but is neither exactly ‘b’ or ‘c’, so the network error caused by these exemplars will not decrease. For many applications, this is undesirable. Ensuring Consistency Slide from Dr. M. Pomplun

30 Uncertainty is often treated as a single uniform concept that simply represents the absence of precise information. The uncertainty sometimes results from a random process; it sometimes results only from the lack of information that induces some 'belief' (instead of some 'knowledge'). Data error is considered to be well-defined and measurable part of uncertainty in reasoning systems, important distinctions have been made between different varieties of uncertainty and the different conditions that produce them. One of these distinctions is between instances of uncertainty that are ‘vague’ and those that are ‘ambiguous’. Uncertainty

31 Vague uncertainty exists when there is a general lack of information regarding a judgment or a particular target. In terms of classification, a vague target would be one where there is only weak evidence for membership to any specific class. In contrast, ambiguous uncertainty exists when there is an abundance of conflicting information regarding a possible judgment or a particular target. In terms of classification, an ambiguous target would be one where there is strong evidence for membership in two or more mutually exclusive categories. Uncertainty

32 To identify such conflicts, we can apply a search algorithm to our set of exemplars. How can we resolve an identified conflict? Of course, the easiest way is to eliminate the conflicting exemplars from the training set. However, this reduces the amount of training data that is given to the network. Eliminating exemplars is the best way to go if it is found that these exemplars represent invalid data, for example, inaccurate measurements. In general, however, other methods of conflict resolution are preferable. Ensuring Consistency Slide from Dr. M. Pomplun

33 Another method combines the conflicting patterns. For example, if we have exemplars (0011, 0101), (0011, 0010), we can replace them with the following single exemplar: (0011, 0111). The way we compute the output vector of the new exemplar based on the two original output vectors depends on the current task. It should be the value that is most “similar” (in terms of the external interpretation) to the original two values. Ensuring Consistency Slide from Dr. M. Pomplun

34 Alternatively, we can alter the representation scheme. Let us assume that the conflicting measurements were taken at different times or places. In that case, we can just expand all the input vectors, and the additional values specify the time or place of measurement. For example, the exemplars (0011, 0101), (0011, 0010) could be replaced by the following ones: (100011, 0101), (010011, 0010). Ensuring Consistency Slide from Dr. M. Pomplun

35 Training and Performance Evaluation How many samples should be used for training? Heuristic: At least 5-10 times as many samples as there are weights in the network. Formula (Baum & Haussler, 1989): P is the number of samples, |W| is the number of weights to be trained, and ‘a’ is the desired accuracy (e.g., proportion of correctly classified samples). Slide from Dr. M. Pomplun

36 What learning rate  should we choose? The problems that arise when  is too small or to big are similar to the Adaline. Unfortunately, the optimal value of  entirely depends on the application. Values between 0.1 and 0.9 are typical for most applications. Often,  is initially set to a large value and is decreased during the learning process. Leads to better convergence of learning, also decreases likelihood of “getting stuck” in local error minimum at early learning stage. Training and Performance Evaluation Slide from Dr. M. Pomplun

37 When training a BPN, what is the acceptable error, i.e., when do we stop the training? The minimum error that can be achieved does not only depend on the network parameters, but also on the specific training set. Thus, for some applications the minimum error will be higher than for others. Training and Performance Evaluation Slide from Dr. M. Pomplun

38 An insightful way of performance evaluation is partial-set training. The idea is to split the available data into two sets – the training set and the test set. The network’s performance on the second set indicates how well the network has actually learned the desired mapping. We should expect the network to interpolate, but not extrapolate. Therefore, this test also evaluates our choice of training samples. Training and Performance Evaluation Slide from Dr. M. Pomplun

39 If the test set only contains one exemplar, this type of training is called “hold-one-out” training. It is to be performed sequentially for every individual exemplar. This, of course, is a very time-consuming process. For example, if we have 1,000 exemplars and want to perform 100 epochs of training, this procedure involves 1,000  999  100 = 99,900,000 training steps. Partial-set training with a split would only require 70,000 training steps. On the positive side, the advantage of hold-one-out training is that all available exemplars (except one) are use for training, which might lead to better network performance. Training and Performance Evaluation Slide from Dr. M. Pomplun

40 Some examples: Predicting the Weather Let us study an interesting neural network application. Its purpose is to predict the local weather based on a set of current weather data: temperature (degrees Celsius) atmospheric pressure (inches of mercury) relative humidity (percentage of saturation) wind speed (kilometers per hour) wind direction (N, NE, E, SE, S, SW, W, or NW) cloud cover (0 = clear … 9 = total overcast) weather condition (rain, hail, thunderstorm, …) Slide from Dr. M. Pomplun

41 We assume that we have access to the same data from several surrounding weather stations. There are 8 such stations that surround our position. How should we format the input patterns? We need to represent the current weather conditions by an input vector whose elements range in magnitude between zero and one. When we inspect the raw data, we find that there are two types of data that we have to account for: Scaled, continuously variable values n-ary representations of category values Slide from Dr. M. Pomplun

42 The following data can be scaled: temperature (-10… 40 degrees Celsius) atmospheric pressure (26… 34 inches of mercury) relative humidity (0… 100 percent) wind speed (0… 250 km/h) cloud cover (0… 9) We can just scale each of these values so that its lower limit is mapped to some  and its upper value is mapped to (1 -  ). These numbers will be the components of the input vector. Slide from Dr. M. Pomplun

43 Usually, wind speeds vary between 0 and 40 km/h. By scaling wind speed between 0 and 250 km/h, we can account for all possible wind speeds, but usually only make use of a small fraction of the scale. Therefore, only the most extreme wind speeds will exert a substantial effect on the weather prediction. Consequently, we will use two scaled input values: wind speed ranging from 0 to 40 km/h wind speed ranging from 40 to 250 km/h Slide from Dr. M. Pomplun

44 How about the non-scalable weather data? Wind direction is represented by an eight- component vector, where only one element (or possibly two adjacent ones) is active, indicating one out of eight wind directions. The subjective weather condition is represented by a nine-component vector with at least one, and possibly more, active elements. With this scheme, we can encode the current conditions at a given weather station with 23 vector components: one for each of the four scaled parameters two for wind speed eight for wind direction nine for the subjective weather condition Slide from Dr. M. Pomplun

45 Since the input does not only include our station, but also the eight surrounding ones, the input layer of the network looks like this: … our station … north…… northwest The network has 207 input neurons, which accept 207-component input vectors. Slide from Dr. M. Pomplun

46 What should the output patterns look like? We want the network to produce a set of indicators that we can interpret as a prediction of the weather in 24 hours from now. In analogy to the weather forecast on the evening news, we decide to demand the following four indicators: a temperature prediction a prediction of the chance of precipitation occurring an indication of the expected cloud cover a storm indicator (extreme conditions warning) Slide from Dr. M. Pomplun

47 Each of these four indicators can be represented by one scaled output value: temperature (-10… 40 degrees Celsius) chance of precipitation (0%… 100%) cloud cover (0… 9) storm warning: two possibilities: –0: no storm warning; 1: storm warning –probability of serious storm (0%… 100%) Of course, the actual network outputs range from  to (1 -  ), and after their computation, if necessary, they are scaled to match the ranges specified above. Slide from Dr. M. Pomplun

48 We decide (or experimentally determine) to use a hidden layer with 42 sigmoidal neurons. In summary, our network has 207 input neurons 42 hidden neurons 4 output neurons Because of the small output vectors, 42 hidden units may suffice for this application. Slide from Dr. M. Pomplun

49 The next thing we need to do is collecting the training exemplars. First we have to specify what our network is supposed to do: In production mode, the network is fed with the current weather conditions, and its output will be interpreted as the weather forecast for tomorrow. Therefore, in training mode, we have to present the network with exemplars that associate known past weather conditions at a time t with the conditions at t – 24 hrs. So we have to collect a set of historical exemplars with known correct output for every input. Slide from Dr. M. Pomplun

50 Obviously, if such data is unavailable, we have to start collecting them. The selection of exemplars that we need depends, among other factors, on the amount of changes in weather at our location. And how about the granularity of our exemplar data, i.e., the frequency of measurement? Using one sample per day would be a natural choice, but it would neglect rapid changes in weather. If we use hourly instantaneous samples, however, we increase the likelihood of conflicts. Slide from Dr. M. Pomplun

51 Therefore, we decide to do the following: We will collect input data every hour, but the corresponding output pattern will be the average of the instantaneous patterns over a 12-hour period. This way we reduce the possibility of errors while increasing the amount of training data. Now we have to train our network. If we use samples in one-hour intervals for one year, we have 8,760 exemplars. Our network has 207   4 = 8862 weights, which means that data from ten years, i.e., 87,600 exemplars would be desirable (rule of thumb). Slide from Dr. M. Pomplun

52 Since with a large number of samples the hold-one-out training method is very time consuming, we decide to use partial-set training instead. The best way to do this would be to acquire a test set (control set), that is, another set of input-output pairs measured on random days and at random times. After training the network with the 87,600 exemplars, we could then use the test set to evaluate the performance of our network. Slide from Dr. M. Pomplun

53 Neural network troubleshooting: Plot the global error as a function of the training epoch. The error should decrease after every epoch. If it oscillates, do the following tests. Try reducing the size of the training set. If then the network converges, a conflict may exist in the exemplars. If the network still does not converge, continue pruning the training set until it does converge. Then add exemplars back gradually, thereby detecting the ones that cause conflicts. If this still does not work, look for saturated neurons (extreme weights) in the hidden layer. If you find those, add more hidden-layer neurons, possibly an extra 20%. If there are no saturated units and the problems still exist, try lowering the learning parameter  and training longer. Slide from Dr. M. Pomplun

54 If the network converges but does not accurately learn the desired function, evaluate the coverage of the training set. If the coverage is adequate and the network still does not learn the function precisely, you could refine the pattern representation. For example, you could include a season indicator to the input, helping the network to discriminate between similar inputs that produce very different outputs. Slide from Dr. M. Pomplun