Microsoft Enterprise Consortium Data Mining Concepts Introduction to Directed Data Mining: Neural Networks Prepared by David Douglas, University of ArkansasHosted.

Slides:



Advertisements
Similar presentations
Multi-Layer Perceptron (MLP)
Advertisements

Slides from: Doug Gray, David Poole
NEURAL NETWORKS Backpropagation Algorithm
1 Data Mining: and Knowledge Acquizition — Chapter 5 — BIS /2014 Summer.
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Multilayer Perceptrons 1. Overview  Recap of neural network theory  The multi-layered perceptron  Back-propagation  Introduction to training  Uses.
Navneet Goyal, BITS-Pilani Perceptrons. Labeled data is called Linearly Separable Data (LSD) if there is a linear decision boundary separating the classes.
Artificial Neural Networks
Machine Learning Neural Networks
The back-propagation training algorithm
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Connectionist models. Connectionist Models Motivated by Brain rather than Mind –A large number of very simple processing elements –A large number of weighted.
1 Chapter 11 Neural Networks. 2 Chapter 11 Contents (1) l Biological Neurons l Artificial Neurons l Perceptrons l Multilayer Neural Networks l Backpropagation.
Neural Networks. R & G Chapter Feed-Forward Neural Networks otherwise known as The Multi-layer Perceptron or The Back-Propagation Neural Network.
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Neural Networks Chapter Feed-Forward Neural Networks.
Image Compression Using Neural Networks Vishal Agrawal (Y6541) Nandan Dubey (Y6279)
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
Introduction to Directed Data Mining: Neural Networks
Microsoft Enterprise Consortium Data Mining Concepts Introduction to Directed Data Mining: Decision Trees Prepared by David Douglas, University of ArkansasHosted.
Introduction to Directed Data Mining: Decision Trees
Introduction to undirected Data Mining: Clustering
Classification Part 3: Artificial Neural Networks
Multiple-Layer Networks and Backpropagation Algorithms
Cascade Correlation Architecture and Learning Algorithm for Neural Networks.
Artificial Neural Networks
Integrating Neural Network and Genetic Algorithm to Solve Function Approximation Combined with Optimization Problem Term presentation for CSC7333 Machine.
Using Neural Networks in Database Mining Tino Jimenez CS157B MW 9-10:15 February 19, 2009.
Advanced information retreival Chapter 02: Modeling - Neural Network Model Neural Network Model.
Chapter 9 Neural Network.
Neural Networks AI – Week 23 Sub-symbolic AI Multi-Layer Neural Networks Lee McCluskey, room 3/10
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Using Neural Networks to Predict Claim Duration in the Presence of Right Censoring and Covariates David Speights Senior Research Statistician HNC Insurance.
1 Machine Learning The Perceptron. 2 Heuristic Search Knowledge Based Systems (KBS) Genetic Algorithms (GAs)
Artificial Neural Network Supervised Learning دكترمحسن كاهاني
NEURAL NETWORKS FOR DATA MINING
Chapter 7 Neural Networks in Data Mining Automatic Model Building (Machine Learning) Artificial Intelligence.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
Classification / Regression Neural Networks 2
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
Artificial Intelligence Techniques Multilayer Perceptrons.
Artificial Neural Networks. The Brain How do brains work? How do human brains differ from that of other animals? Can we base models of artificial intelligence.
1 Chapter 11 Neural Networks. 2 Chapter 11 Contents (1) l Biological Neurons l Artificial Neurons l Perceptrons l Multilayer Neural Networks l Backpropagation.
Akram Bitar and Larry Manevitz Department of Computer Science
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
EEE502 Pattern Recognition
Neural Networks Vladimir Pleskonjić 3188/ /20 Vladimir Pleskonjić General Feedforward neural networks Inputs are numeric features Outputs are in.
Artificial Neural Networks (ANN). Artificial Neural Networks First proposed in 1940s as an attempt to simulate the human brain’s cognitive learning processes.
Artificial Intelligence CIS 342 The College of Saint Rose David Goldschmidt, Ph.D.
BACKPROPAGATION (CONTINUED) Hidden unit transfer function usually sigmoid (s-shaped), a smooth curve. Limits the output (activation) unit between 0..1.
Artificial Neural Networks An Introduction. Outline Introduction Biological and artificial neurons Perceptrons (problems) Backpropagation network Training.
Chapter 11 – Neural Nets © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Pattern Recognition Lecture 20: Neural Networks 3 Dr. Richard Spillman Pacific Lutheran University.
Business Intelligence and Decision Support Systems (9 th Ed., Prentice Hall) Chapter 6: Artificial Neural Networks for Data Mining.
Artificial Neural Networks
Advanced information retreival
One-layer neural networks Approximation problems
CSE 473 Introduction to Artificial Intelligence Neural Networks
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
CSE P573 Applications of Artificial Intelligence Neural Networks
Classification / Regression Neural Networks 2
Artificial Neural Network & Backpropagation Algorithm
CSE 573 Introduction to Artificial Intelligence Neural Networks
Neural Networks Geoff Hulten.
Lecture Notes for Chapter 4 Artificial Neural Networks
COSC 4335: Part2: Other Classification Techniques
Presentation transcript:

Microsoft Enterprise Consortium Data Mining Concepts Introduction to Directed Data Mining: Neural Networks Prepared by David Douglas, University of ArkansasHosted by the University of Arkansas 1 Microsoft Enterprise Consortium

Neural Networks Hosted by the University of Arkansas 2  Complex learning systems recognized in animal brains  Single neuron has simple structure  Interconnected sets of neurons perform complex learning tasks  Human brain has synaptic connections  Artificial Neural Networks attempt to replicate non-linear learning found in nature—(artificial usually dropped) Adapted from Larose Prepared by David Douglas, University of Arkansas

Microsoft Enterprise Consortium Neural Networks (Cont) Hosted by the University of Arkansas 3  Layers Input, hidden, output  Feed forward  Fully connected  Back propagation  Learning rate  Momentum  Optimization / sub optimization Prepared by David Douglas, University of Arkansas Terms Used

Microsoft Enterprise Consortium Neural Networks (Cont) Hosted by the University of Arkansas 4 Structure of a neural network Adapted from Barry & Linoff Prepared by David Douglas, University of Arkansas

Microsoft Enterprise Consortium Neural Networks (Cont) Prepared by David Douglas, University of ArkansasHosted by the University of Arkansas 5  Inputs uses weights and a combination function to obtain a value for each neuron in the hidden layer.  Then a non-linear response is generated from each neuron in the hidden layer to the output. Activation Function  After initial pass, accuracy is evaluated and back propagation through the network occurs, while changing weights for next pass.  Repeated until apparent answers (delta) are small—beware, this could be a sub optimal solution. Combination Function Transform (Usually a Sigmoid) Hidden Layer Input Layer Output Layer Adapted from Larose

Microsoft Enterprise Consortium  Neural network algorithms require inputs to be within a small numeric range. This is easy to do for numeric variables using the min-max range approach as follows (values between 0 and 1)  Other methods can also be applied  Neural Networks, as with Logistic Regression, do not handle missing values whereas Decision Trees do. Many data mining software packages automatically patches up for missing values but I recommend the modeler know the software is handling the missing values. Neural Networks (Cont) Hosted by the University of Arkansas 6 Adapted from Larose Prepared by David Douglas, University of Arkansas

Microsoft Enterprise Consortium Neural Networks (Cont) Hosted by the University of Arkansas 7 Categorical  Indicator Variables (sometimes referred to as 1 of n) are used when number of category values small  Categorical variable with k classes translated to k – 1 indicator variables  For example, Gender attribute values are “Male”, “Female”, and “Unknown”  Classes k = 3  Create k – 1 = 2 indicator variables named Male_I and Female_I  Male records have values Male_I = 1, Female_I = 0  Female records have values Male_I = 0, Female_I = 1  Unknown records have values Male_I = 0, Female_I = 0 Adapted from Larose Prepared by David Douglas, University of Arkansas

Microsoft Enterprise Consortium Neural Networks (Cont) Hosted by the University of Arkansas 8 Categorical  Be very careful when working with categorical variables in neural networks when mapping the variables to numbers. The mapping introduces an ordering of the variables, which the neural network takes into account. 1 of n solves this problem but is cumbersome for a large number of categories.  Codes for marital status (“single,” “divorced,” “married,” “widowed,” and “unknown”) could be coded. Single 0 Divorced.2 Married.4 Separated.6 Widowed.8 Unknown 1.0 Note the implied ordering. Adapted from Barry & Linoff Prepared by David Douglas, University of Arkansas

Microsoft Enterprise Consortium Neural Networks (Cont) Hosted by the University of Arkansas 9  Data Mining Software Note that most modern data mining software takes care of these issues for you. But you need to be aware that it is happening and what default settings are being used. For example, the following was taken from the PASW Modeler 13 Help topics describing binary set encoding(An advanced topic) Prepared by David Douglas, University of Arkansas  Use binary set encoding If this option is selected, a compressed binary encoding scheme for set fields is used. This option allows you to easily build neural net models using set fields with large numbers of values as inputs. However, if you use this option, you may need to increase the complexity of the network architecture (by adding more hidden units or more hidden layers) to allow the network to properly use the compressed information in binary encoded set fields. Note: The simplemax and softmax scoring methods, SQL generation, and export to PMML are not supported for models that use binary set encoding

Microsoft Enterprise Consortium A Numeric Example Hosted by the University of Arkansas 10  Feed forward restricts network flow to single direction  Fully connected  Flow does not loop or cycle  Network composed of two or more layers x0x0 x1x1 x2x2 x3x3 Adapted from Larose Prepared by David Douglas, University of Arkansas Node 1 Node 2 Node 3 Node B Node A Node Z W 1A W 1B W 2A W 2B W AZ W 3A W 3B W 0A W BZ W 0Z W 0B Input Layer Hidden Layer Output Layer

Microsoft Enterprise Consortium Numeric Example (Cont) Hosted by the University of Arkansas 11  Most networks have input, hidden & output layers.  Network may contain more than one hidden layer.  Network is completely connected.  Each node in given layer, connected to every node in next layer.  Every connection has weight (W ij ) associated with it  Weight values randomly assigned 0 to 1 by algorithm  Number of input nodes dependent on number of predictors  Number of hidden and output nodes configurable  How many nodes in hidden layer?  Large number of nodes increases complexity of model.  Detailed patterns uncovered in data  Leads to overfitting, at expense of generalizability  Reduce number of hidden nodes when overfitting occurs  Increase number of hidden nodes when training accuracy unacceptably low Prepared by David Douglas, University of Arkansas

Microsoft Enterprise Consortium  Combination function produces linear combination of node inputs and connection weights to single scalar value. Consider the following weights:  Combination function to get hidden layer node values Net A =.5(1) +.6(.4) +.8(.2) +.6(.7) = 1.32 Net B =.7(1) +.9(.4) +.8(.2) +.4(.7) = 1.50 Numeric Example (Cont) Hosted by the University of Arkansas 12 Adapted from Larose Prepared by David Douglas, University of Arkansas x 0 = 1.0W 0A = 0.5W 0B = 0.7W 0Z = 0.5 x 1 = 0.4W 1A = 0.6W 1B = 0.9W AZ = 0.9 x 2 = 0.2W 2A = 0.8W 2B = 0.8W BZ = 0.9 x 3 = 0.7W 3A = 0.6W 3B = 0.4

Microsoft Enterprise Consortium  Transformation function is typically the sigmoid function as shown below:  The transformed values for nodes A & B would then be: Numeric Example (Cont) Hosted by the University of Arkansas 13 Adapted from Larose Prepared by David Douglas, University of Arkansas

Microsoft Enterprise Consortium  Node z combines the output of the two hidden nodes A & B as follows: Net z =.5(1) +.9(.7892) +.9(.8716) =  The net z value is then put into the sigmoid function Numeric Example (Cont) Hosted by the University of Arkansas 14 Adapted from Larose Prepared by David Douglas, University of Arkansas

Microsoft Enterprise Consortium Learning via Back Propagation The output from each record that goes through the network can be compared an actual value Then sum the squared differences for all the records (SSE) The idea is then to find weights that minimizes the sum of the squared errors The Gradient Descent method optimizes the weights to minimize the SSE ◦Results in an equation for the output layer nodes and a different equation for hidden layer nodes ◦Utilizes learning rate and momentum Prepared by David Douglas, University of ArkansasHosted by the University of Arkansas 15

Microsoft Enterprise Consortium Gradient Descent Method Equations Output layer nodes ◦R j = output j (1-output j )(actual-output j ) where R j is the responsibility for error at node j Hidden layer nodes ◦R j = output j (1-output j ) ∑ w jk R j where ∑ w jk R j is the weighted sum of the error responsibilities for the downstream nodes Prepared by David Douglas, University of ArkansasHosted by the University of Arkansas 16 downstream

Microsoft Enterprise Consortium  Assume that these values used to calculate the output of.8750 is compared to the actual value of a record value of.8  Then the back propagation changes the weights based on the constant weight (initially.5) for node z—the only output node  The equation for responsibility for error for the output node z R j = output j (1-output j )(actual-output j ) R z =.8750( )( ) = Calculate change for weight transmitting 1 unit and learning rate of.1 ∆w z =.1(-.0082)(1) = Calculate new weight w z,new =( ) = which will now be used instead of.5 Hosted by the University of Arkansas 17 Adapted from Larose Prepared by David Douglas, University of Arkansas Numeric Example (output node)

Microsoft Enterprise Consortium Numeric Example (hidden layer node) Now consider the hidden layer node A The equation is ◦R j = output j (1-output j ) ∑ w jk R j ◦The only downstream node is z; original w AZ =.9 and error responsibility is and output of node A was.7892 ◦Thus  R A =.7892( )(.9)(-.0082) =  ∆w AZ =.1(-.0082)(.7892) =  w AZ,new = = ◦This back-propagation continues through the nodes and the process is repeated until weights change very little Prepared by David Douglas, University of ArkansasHosted by the University of Arkansas 18 downstream

Microsoft Enterprise Consortium Learning rate and Momentum Hosted by the University of Arkansas 19  The learning rate, eta determines the magnitude of changes to the weights.  Momentum, alpha is analogous to the mass of a rolling object as shown below. The mass of the smaller object may not have enough momentum to roll over the top to find the true optimum. Adapted from Larose Prepared by David Douglas, University of Arkansas SSE IABCw IABCw Small Momentum Large Momentum

Microsoft Enterprise Consortium Lessons Learnt Hosted by the University of Arkansas 20  Versatile data mining tool  Proven  Based on biological models of how the brain works  Feed-forward is the most common type  Back propagation for training sets has been replaced with other methods and notable conjugate gradient.  Drawbacks Works best with only a few input variables and it does not help in selecting the input variables No guarantee that weights are optimal—build several and take the best one Biggest problem is that it does not explain what it is doing (No rules) Prepared by David Douglas, University of Arkansas