1 COMP 578 Artificial Neural Networks for Data Mining Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University.

1 COMP 578 Artificial Neural Networks for Data Mining Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

2 Human vs. Computer Computers –Not good at performing such tasks as visual or audio processing/recognition. –Execute instructions one after another extremely rapidly. –Good at serial activities (e.g. counting, adding). Human brain –Units respond at  10/s (vs. PV 2.5GHz). –Work on many different things at once. –Vision or speech recognition by interaction of many different pieces of information.

3 The brain Human brain is complicated and poorly understood. Contains approximately 10 10 basic units called neurons. Each neuron connected to about 10,000 others. Axon Dendrites Synapse Soma (or Cell Body)

4 The Neuron Neuron accepts many inputs (through dendrites). Inputs are all added up in some fashion. If enough active inputs are received at once, neuron will be activated and “fire” (through axon). Dendrites Axon Soma Synapse

5 The Synapse Axon produce voltage pulse called action potential (AP). Need arrival of more than one AP to trigger synapse. Synapse releases neurotransmitters when AP is raised sufficiently. Neurotransmitters diffuse across the gap chemically activating dendrites on the other side. Some synapses pass a large signal across, whilst others allow very little through.

6 Modeling the Single Neuron n inputs. Efficiency of synapses modeled by having a multiplicative factor on each of the inputs to the neuron. Multiplicative factor = associated weights on input lines. Neuron’s tasks: –Calculates weighted sum of its inputs. –Compares sum to some internal threshold. –Turn on if threshold exceeded. Σ x1x1 x2x2 xnxn w1w1 w2w2 wnwn y

7 A Mathematical Model of Neurons Neuron computes weighted sum: Fire if SUM exceeds a threshold θ. –y=1 if SUM > θ –y=0 if SUM  θ.

8 Learning in Simple Neurons Need to be able to determine connection weights. Inspiration comes from looking at real neural systems. –Reinforce good behavior and reprimand bad. –E.g., train a NN to recognize 2 characters H and F –Output 1 when a H is presented and 0 when it sees a F. –If it produces an incorrect output, we want to reduce the chances of that happening again. –This is done by modifying the weights.

9 Learning in Simple Neurons (2) Neuron given random initial weights. –At starting state, neuron knows nothing. Present an H. –Neuron computes the weighted sum of inputs. –Compare weighted sum with threshold. –If exceeds threshold, output a 1 otherwise a 0. If output is 1, neuron is correct. –Do nothing. Otherwise if neuron produces a 0. –Increase the weights so that next time it will exceed the threshold and produces a 1.

10 A Simple Learning Rule How much weight to increase? Can follow simple rule: –Add the input values to the weights when we want the output to be on. –Subtract the input values from the weights when we want the output to be off. This learning rule is called the Hebb rule: –It is a variant on one proposed by Donald Hebb and is called Hebbian learning. –It is the earliest and simplest learning rule for a neuron.

11 The Hebb Net Step 0. Initialize all weights: –w i =0 (i = 1 to n). Step 1. For each input training record (s) it’s target output (t), do steps 2-4. –Step 2. Set activations for all input units: –Step 3. Set activation for the output unit: –Step 4. Adjust the weights and the bias: w i (new) = w i (old) + x i y (i = 1 to n) (note:  w i = x i y) θ(new) = θ(old) + y. The bias (the θ) adjusted like a weight from a unit whose output signal is always 1.

12 A Hebb Net Example

13 The Data Set Attributes –HS_Index: {Drop, Rise} –Trading_Vol: {Small, Medium, Large} –DJIA: {Drop, Rise} Class Label –Buy_Sell: {Buy, Sell}

14 The Data Set HS_IndexTrading_VolDJIABuy_Sell 1DropLargeDropBuy 2RiseLargeRiseSell 3RiseMediumDropBuy 4DropSmallDropSell 5RiseSmallDropSell 6RiseLargeDropBuy 7RiseSmallRiseSell 8DropLargeRiseSell

15 Transformation Input Features –HS_Index_Drop: {-1, 1} –HS_Index_Rise: {-1, 1} –Trading_Vol_Small: {-1, 1} –Trading_Vol_Medium: {-1, 1} –Trading_Vol_Large: {-1, 1} –DJIA_Drop: {-1, 1} –DJIA_Rise: {-1, 1} –Bias: {1} Output Feature –Buy_Sell: {-1, 1} HIS=Drop HIS=Rise Bias DJIA=Drop DJIA=Rise B/S

16 Transformed Data Input FeatureOutput Feature 1 2 3 4 5 6 7 8

17 Record 1 Input Feature: Output Feature: Original Weight: Weight Change: New Weight:

25 A Hebb Net Example 2 (x1(x1 X2X2 1) (111)+1+1 (11) (-111) (-11) Input Target

26 (x1(x1 x2x2 1) (w`(w` w2w2 θ)θ) (w1(w1 w2w2 θ)θ) (000) (111)1(111)(111) Input Target Weight Changes Weights The separating line becomes x 2 = - x 1 - 1 x2x2 x1x1

27 (x1(x1 x2x2 1) (w1(w1 w2w2 b)b) (w1(w1 w2w2 b)b) (111) (11)(-11-1)(020) Input Target Weight Changes Weights The separating line becomes x 2 = 0 x2x2 x1x1

28 (x1(x1 x2x2 1) (w1(w1 w2w2 b)b) (w1(w1 w2w2 b)b) (020) (-111)(1-1)(11-1) Input Target Weight Changes Weights The separating line becomes x 2 = - x 1 + 1 x2x2 x1x1

29 (x1(x1 x2x2 1) (w1(w1 w2w2 b)b) (w1(w1 w2w2 b)b) (11-1) (-11)(11-1)(22-2) Input Target Weight Changes Weights Even though the weights have changed, the separating line is still x 2 = - x 1 + 1 The graph of the decision regions (the positive response and the negative response) remains as shown. x1x1 x2x2

30 A Hebb Net Example 3 (x1(x1 x2x2 1) (111)1 (101)0 (011)0 (001)0 Input Target

31 (x1(x1 x2x2 1) (w1(w1 w2w2 b)b) (w1(w1 w2w2 b)b) (000) (111)1(111)(111) Input Target Weight Changes Weights The separating line becomes x 2 = - x 1 - 1 x2x2 00 0 x1x1

32 (x1(x1 x2x2 1) (w1(w1 w2w2 b)b) (w1(w1 w2w2 b)b) (101)0(000)(111) (011)0(000)(111) (001)0(000)(111) Input Target Weight Changes Weights Since the target value is 0, no learning occurs. Using binary target values prevents the net from learning any pattern for which the target is “off”.

33 Characteristics of the Hebb Net Choice of training records determines which problems can be solved. Training records corresponding to the AND function can be solved if inputs and targets in bipolar form. Bipolar representation allows modification of a weight when input and target are both “on” and when they are both “off” at the same time.

34 The Perceptron Learning Rule More powerful than the Hebb rule. The Perceptron learning rule convergence theorem states that: –If weights exist to allow neuron to respond correctly to all training patterns, then the rule will find such weights. –The neuron will find these weights in a finite number of training steps. Let SUM be the weighted sum, the output of the Perceptron, y = f(SUM), can be 1, 0, -1. The activation function is:

35 Perceptron Learning For each training record, the net would calculate the response of the output unit. The net would determine whether an error occurred for this pattern (comparing the calculated with target value). If an error occurred, weights would be changed according to: w i (new) = w i (old) +  tx i where t is +1 or –1 and  is the learning rate. If an error did not occur, the weights would not be changed. Training continue until no error occurred.

36 Perceptron for classification Step 0. Initialize all weights and bias: (For simplicity, set weights and bias to zero.) Set learning rate  (0 <  < 1). (For simplicity,  can be set to 1.) Step 1. While stopping condition is false, do steps 2-6. Step 2. For each training pair, do Steps 3-5: Step 3. Set activation for input unit, x i. Step 4. Compute response of output unit: SUM = θ +  i x i w i. Step 5. Update weights and bias if error occurred for this vector. If y’  y, w i (new) = w i (old) +  tx i θ(new) = θ (old) +  t else w i (new) = w i (old) θ (new) = θ (old) Step 6. If no weights changed in 2, stop else continue.

37 Perceptron for classification (2) Only weights connecting active input units (x i  0) are updated. Weights are updated only for patterns that do not produce the correct value of y. Less learning as more training patterns produce the correct response. The threshold on the activation function for the response unit is a fixed, non-negative value . The form of the activation function for the output unit constitutes an undecided band of fixed width determined by  separating the region of positive response from that of negative response.

38 Perceptron for classification (3) Instead of one separating line, we have a line separating the region of positive response from the region of zero response (line bounding inequality): –w 1 x 1 + w 2 x 2 + b >  and a line separating the region of zero response from the region of negative response (line bounding the inequality): w 1 x 1 + w 2 x 2 + b <  w 1 x 1 + w 2 x 2 + b >  w 1 x 1 + w 2 x 2 + b < 

39 Perceptron

40 The Data Set (1) Attributes –HS_Index: {Drop, Rise} –Trading_Vol: {Small, Medium, Large} –DJIA: {Drop, Rise} Class Label –Buy_Sell: {Buy, Sell}

41 The Data Set (2) HS_IndexTrading_VolDJIABuy_Sell 1DropLargeDropBuy 2RiseLargeRiseSell 3RiseMediumDropBuy 4DropSmallDropSell 5RiseSmallDropSell 6RiseLargeDropBuy 7RiseSmallRiseSell 8DropLargeRiseSell

42 Transformation Input Features –HS_Index_Drop: {0, 1} –HS_Index_Rise: {0, 1} –Trading_Vol_Small: {0, 1} –Trading_Vol_Medium: {0, 1} –Trading_Vol_Large: {0, 1} –DJIA_Drop: {0, 1} –DJIA_Rise: {0, 1} –Bias: {0} Output Feature –Buy  1 –Sell  -1

43 Transformed Data Input FeatureOutput Feature 1 2 3 4 5 6 7 8

44 Record 1 Input Feature: Output Feature: Original Weight: Output: f(0) = 0 Weight Change: New Weight:

49 Record 6 Input Feature: Output Feature: Original Weight: Output: f(-2) = -1 Weight Change: New Weight:

50 Record 7 Input Feature: Output Feature: Original Weight: Output: f(-3) = -1 Weight Change: New Weight:

52 A Perceptron Example (x1(x1 x2x2 1) (111)1 (101) (011) (001)

53 (x1(x1 x2x2 1)(w1(w1 w2w2 b)b) (000) (111)001(111)(111) Input Net Out Target Weight Changes Weights The separating lines become x 1 + x 2 + 1 =.2 and x 1 + x 2 + 1 = -.2 x2x2 x1x1

54 The separating lines become x 2 =.2 and x 2 = -.2 (x1(x1 x2x2 1)(w1(w1 w2w2 b)b) (111) (101)21(-10-1)(010) Input Net Out Target Weight Changes Weights x2x2 x1x1

55 (x1(x1 x2x2 1)(w1(w1 w2w2 b)b) (010) (0 1010 1) 1 1 (0 0 -1) 0) (0 0000 -1) Input Net Out Target Weight Changes Weights

56 (x1(x1 x2x2 1)(w1(w1 w2w2 b)b) (00-1) (111)-11(111)(110) Input Net Out Target Weight Changes Weights The separating line become x 1 + x 2 =.2 and x 1 + x 2 = -.2 x2x2 x1x1

57 (x1(x1 x2x2 1)(w1(w1 w2w2 b)b) (110) (101)11(-10-1)(01-1) Input Net Out Target Weight Changes Weights Te separating line become x 1 + x 2 =.2 and x 1 + x 2 = -.2 x2x2 x1x1

58 (x1(x1 x2x2 1)(w1(w1 w2w2 b)b) (01-1) (0 1010 1) 0 -2 0 (0 0 -1) 0) (0 0000 -2) Input Net Out Target Weight Changes Weights The results for the third epoch are: (x1(x1 x2x2 1)(w1(w1 w2w2 b)b) (00-2) (111)-21(111)(11-1) (101)00(-10-1)(01-1) (011) (000)(01-2) (001)-2 (000)(01-2) Input Net Out Target Weight Changes Weights

59 The results for the fourth epoch are: (111) 1(111)(12-1) (101)00(-10-1)(02-2) (011)00(0-1)(01-3) (001)-3 (000)(01-3) For the fifth epoch, we have (111)-21(111)(12-2) (101) (000)(12-2) (011)00(0-1)(11-3) (001)-3 (000)(11-3) And for the sixth epoch, (111) 1(111)(22-2) (101)00(-10-1)(12-3) (011) (000)(12-3) (001)-3 (000)(12-3)

60 (111)001(111)(23-2) (101)00(-10-1)(13-3) (011)00(0-1)(12-4) (001)-4 (000)(12-4) The results for the seventh epoch are: The eight epoch yields (111) 1(111)(23-3) (101) (000)(23-3) (011)00(0-1)(22-4) (001)-4 (000)(22-4) (111)001(111)(33-3) (101)00(-10-1)(23-4) (011) (000)(23-4) (001)-4 (000)(23-4) And the ninth

61 Finally, the results for the tenth epoch are (111)111(000)(23-4) (101)-2 (000)(23-4) (011) (000)(23-4) (001)-4 (000)(23-4) The positive response is given by: –2x 1 + 3x 2 – 4 >.2 with boundary line –x 2 = -2 / 3x 1 + 7 / 5 The negative response is given by: –2x 1 + 3x 2 – 4 < -.2 with boundary line –x 2 = -2 / 3x 1 + 19 / 15 x2x2 x1x1

62 The 2nd Perceptron Algorithm (x1(x1 x2x2 1)(w1(w1 w2w2 b)b) (000) (111)001(111)(111) (11)11(-11-1)(020) (-111)21(1-1)(11-1) (-11)-3 (000)(11-1) Input Net Out Target Weight Changes Weights

63 In the second epoch of training, we have: (111)111(000)(11-1) (11)-1 (000)(11-1) (-111)-1 (000)(11-1) (-11)-3-1 (000)(11-1) Since all the  w’s are 0 in epoch 2, the system was fully trained after the first epoch.

64 Limitations of Perceptrons Perceptron finds a straight line that separates classes. It cannot learn for exclusive-or (XOR) problems. Such patterns are not linearly separable. Not much work after Minsky and Papert published their book in 1969. Rumelhart and McClelland produced an improvement in 1986. –Proposed some modern adaptations to Perceptron, called multilayer Perceptron.

65 The Multilayer Perceptron Overcome linearly inseparability: –Use more perceptrons. –Each set up to identify small, linearly separable sections of the inputs. –Combine their outputs into another perceptron. Each neuron still takes weighted sum of inputs, thresholds it, outputs 1 or 0. But how can we learn?

66 The Multilayer Perceptron (2) Perceptrons in the 2 nd layer do not know which of the real inputs were on or not. Only 2-state, on or off, gives no indication of how much to adjust the weights. –Some weighted input definitely turn on a neuron. –Some weighted inputs only just turn a neuron on and should not be altered to the same extent. –What changes to produce a better solution next time? –Which of the input weights should be increased and which should not? –But we have no way of finding out (the credit assignment problem).

67 The Solution Need a non-binary thresholding function. Use a slightly different non- linearity so that it more or less turns on or off. A possible new thresholding function is the sigmoid function. Sigmoid thresholding function does not mask inputs from the outputs.

68 The Multi-layer Preceptron An input layer, an output layer, and a hidden layer. Each unit in hidden and output layer is like a perceptron unit. But the thresholding function is sigmoid. Units in input layer serve to distribute values they receive to next layer Input units do not perform a weighted sum or threshold.

69 The Backpropagation Rule Single-layer perceptron model changed. –Thresholding function from a step to a sigmoid function. –A hidden layer added. –Learning rule needs to be altered. New learning rule for multilayer perceptron is called the “generalized delta rule”, or the “backpropagation rule”. –Show NN a pattern and calculate its response. –Compare with desired response. –Alter weights so that NN can produce a more accurate output next time. –The learning rule provides the method for adjusting the weights so as to decrease the error next time.

70 Backpropagation Details Define an error function to represent difference between NN's current output and the correct output. The backpropagation rule aims to reduce the error by: –Calculating the value of the error for a particular input. –Then back-propagates the error from one layer to the previous one. –Each unit in the net has its weights adjusted so that it reduces the value of the error function –For units on the output. Their output and desired output is known and adjusting the weights is relatively simple. –For units in the middle: Those that are connected to outputs with a large error should have their weights adjusted a lot. Those that feed almost correct outputs should not be altered much.

71 The Detailed Algorithm Step0. Initialize weights (Set to small random values). Step 1. While stopping condition is false, do Steps 2-9. –Step 2. For each training pair, do Steps 3-8. Feedbackward. Step 3. Each input unit (x i, i = 1, …, n) receives input signal x i and broadcasts this signal to all units in the layer above (the hidden units). Step 4. Each hidden unit (Z j, j = 1, …, p) sums its weighted input signals, –applies its activation function to compute its output signal, z j = f(z_in j ), –and sends this signal to all units in the layer above (output units). Step 5. Each output unit (Y k, k=1, …, m) sums its weighted input signals, –And applies its activation function to compute its output signal, y k = f(z_in j ),

72 The Detailed Algorithm (2) Feedbackward. Step 6. Each output unit (y k, k = 1, …, m) receives a target pattern corresponding to the input training pattern, computes its error information term, –Calculates its weight correction term (used to update w jk later),  w jk =  k z j, –Calculates its bias correction term (used to upate w 0k later),  w 0k =  k, –And sends  k to units in the layer below. Step 7. Each hidden unit (Z j, j=1, …, p) sums its delta inputs (from units in the layer above), –Multiplies by the derivative of its activation function to calculate its error information term,  j =  _in j f’(z_in j ), –Calculates its weight correction term (used to update vij later),  v ij =  j x i, –And calculates its bias correction term (used to update v0j later),  v 0j =  j,

73 The Detailed Algorithm (3) Update weights and biases: Step 8. Each output unit (Y k, k = 1, …, m) updates its bias and weights (j=0, …, p): w jk (new)= w jk (old)+  w jk, –Each hidden unit (Zj,j=1, …, p) updates its bias and weights (I=0,…,n): v jk (new)= v jk (old)+  v jk, –Step 9. Test stopping condition.

74 An example: Multilayer Perceptron Network with Backpropagation Training Vol=High HSI=Rise DJIA=Drop

75 Initial Weights and Bias Values w ij = Weight between nodes i and j.  i = Bias value of node i. For node 4, –w 14 = 0.2, w 24 = 0.4, w 34 = – 0.5,  4 = – 0.4 For node 5, –w 15 = – 0.3, w 25 = 0.1, w 35 = 0.2,  5 = 0.2 For node 6, –w 16 = 0.6, w 26 = 0.7, w 36 = – 0.1,  6 = 0.1 For node 7, –w 47 = – 0.3, w 57 = – 0.2, w 67 = 0.1,  7 = 0.6 For node 8, –w 48 = – 0.5, w 58 = 0.1, w 68 = – 0.3,  8 = 0.3

76 Training (1) Learning Rate = 0.9 Input: Output: For node 4, –Input: 0.2 + 0 – 0.5 – 0.4 = – 0.7 –Output: 1 / (1 + e 0.7 ) = 0.332 For node 5, –Input: – 0.3 + 0 + 0.2 + 0.2 = 0.1 –Output: 1 / (1 + e – 0.1 ) = 0.525 For node 6, –Input: 0.6 + 0 – 0.1 + 0.1 = 0.6 –Output: 1 / (1 + e – 0.6 ) = 0.646 For node 7, –Input: 0.332 * (– 0.3) + 0.525 * (– 0.2) + 0.646 * 0.1 + 0.6 = 0.460 –Output: 1 / (1 + e 0.460 ) = 0.613 For node 8, –Input: 0.322 * (– 0.5) + 0.525 * 0.1 + 0.646 * (– 0.3) + 0.3 = – 0.007 –Output: 1 / (1 + e – 0.007 ) = 0.498

77 Training (2) For node 7, –Error: 0.613 (1 – 0.613) (1 – 0.613) = 0.092 For node 8, –Error: 0.498 (1 – 0.498) (0 – 0.498) = – 0.125 For node 4, –Error: 0.332 (1 – 0.332) (0.092 * (– 0.3) + 0.125 * (– 0.5)) = 0.008 For node 5, –Error: 0.525 (1 – 0.525) (0.092 * (– 0.2) + 0.125 * 0.1) = 0.009 For node 6, –Error: 0.646 (1 – 0.646) (0.092 * 0.1 + 0.125 * (– 0.3)) = 0.008

78 Training (3) For each weight, –w14 = 0.2 + 0.9 (0.008) (0.332) = 0.202 –w15 = – 0.3 + 0.9 (0.009) (0.525) = – 0.296 –… For each bias, –  4 = – 0.4 + 0.9 (0.008) = – 0.393 –  5 = 0.2 + 0.9 (0.009) = 0.208 –…

79 Using ANN for Data Mining Constructing a network –input data representation –selection of number of layers, number of nodes in each layer Training the network using training data Pruning the network Interpret the results

80 Step 1: Constructing the Network o 2 Not-persist o 1 Persist x 3 Demographics x 2 GPA w1w1 w 5…n x 1 # of Terms x 4 Courses x 5 Fin Aid… x j…n Multi-layer perceptron (MLP): feed forward back propagation

81 Constructing the Network (2) The number of input nodes: corresponds to the dimensionality of the input tuples –Thermometer coding: age 20-80: 6 intervals [20, 30)  000001, [30, 40)  000011, …., [70, 80)  111111 Number of hidden nodes: adjusted during training Number of output nodes: number of classes

82 Step 2: Network Training The ultimate objective of training –obtain a set of weights that makes almost all the tuples in the training data classified correctly Steps: –Initial weights are set randomly –Input tuples are fed into the network one by one –Activation values for the hidden nodes are computed –Output vector can be computed after the activation values of all hidden node are available –Weights are adjusted using error (desired output - actual output)

83 Step 3: Network Pruning Fully connected network will be hard to articulate n input nodes, h hidden nodes and m output nodes lead to h(m+n) links (weights) Pruning: Remove some of the links without affecting classification accuracy of the network.

84 Step 4: Extracting Rules from ANN Discretize activation values; replace individual activation value by the cluster average; maintain the network accuracy Enumerate the output from the discretized the activation values to find rules between activation value and output Find the relationship between the input and activation value Combine the above two to have rules related the output to input

85 An Example (I) IBM synthetic data –nine attributes (age, salary, …) –classification function: if ((age 60)  (25K  salary  75K)) then class = A else class = B initial network: –87 input nodes, 2 output nodes, 4 hidden nodes trained network using 1000 tuples pruned network: –7 input nodes, 3 hidden nodes, 2 output nodes –17 links –accuracy 96.30%

86 An Example (II) Hidden node value discretization  1 : (-1, 0, 1)  2 : (0, 1)  3 : (-1, 0.24, 1) Enumerate output from   2 = 0,  3 = -1  1 = -1,  2 = 1,  3 = -1  1 = -1,  2 = 0,  3 = -0.24  C 1 = 1, C 2 = 0 otherwise C 1 = 0, C 2 = 1 I-1I-2I-4I-5I-13I-15I-17

87 An Example (III) From input to hidden node       2 = 0        3 = -1     3 = -1 …… Obtain rules relating input and output          class         class  Transform to original input attributes    age < 40,    salary < 100K

88 ANN vs. Others for Data Mining Advantages –prediction accuracy is generally high –robust, works when training examples contain errors –output may be discrete, real-valued, or a vector of several discrete or real-valued attributes –fast evaluation of the learned target function. Criticism –long training time –difficult to understand the learned function (weights). –not easy to incorporate domain knowledge

1 COMP 578 Artificial Neural Networks for Data Mining Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University.

Similar presentations

Presentation on theme: "1 COMP 578 Artificial Neural Networks for Data Mining Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 COMP 578 Artificial Neural Networks for Data Mining Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University.

Similar presentations

Presentation on theme: "1 COMP 578 Artificial Neural Networks for Data Mining Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University."— Presentation transcript:

Similar presentations

About project

Feedback