Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Classification Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair.

Similar presentations


Presentation on theme: "1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Classification Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair."— Presentation transcript:

1 1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Classification Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of Business

2 2 Outline Introduction Classic Methods –Decision Tree –Neural Network

3 3 Introduction Classification –Classifies objects into a set of pre-specified object classes based on the values of relevant object attributes and objects’ class lables Classifier O1O1 O3O3 O2O2 O5O5 O4O4 O6O6 O1O1 O2O2 O6O6 O5O5 O3O3 O4O4 O i :contains relevant attribute values and class labels Class X Class Y Class Z Classes X, Y and Z are pre-determined

4 4 Introduction When to use it? –Discovery (descriptive, explanatory) –Prediction (prescriptive, decision support) –When the relevant object data can be decided and is available Real World Applications –Profiling/predicting customer purchases –Loan/credit approval –Fraud/intrusion detection –Diagnosis decision support

5 5 Example Income Age :Churn :Not Churn

6 6 Notations Income Age :Churn :Not Churn Classification Attributes Class Label Attribute Class Labels Problem Space Classification Samples Prediction Object

7 7 Object Data Required Class Label Attribute: –Dependent variable, output attribute, prediction variable,… –Variable whose values label objects’ classes Classification Attributes: –Independent variables, input attributes, or predictor variables –Object variables whose values affect objects’ class labels Three Types: –numerical (age, income) –categorical (hair color, sex) –ordinal (severity of a injury)

8 8 Classification Vs. Prediction –View 1 Classification: discovery Prediction: predictive utilizing classification results (rules) –View 2 Either discovery or predictive Classification: categorical or ordinal class labels Prediction: numerical (continuous) class labels –Class lectures, assignment and exam: View 1 –Text: View 2

9 9 Classification & Prediction Main Function –Mappings from input attribute values to output attribute values –Methods affect how the mappings are derived and represented Process –Training (supervised): derives the mappings –Testing: evaluate accuracy of the mappings

10 10 Classification & Prediction Classification samples: divided into training and testing sets –Often processed in batched modes –Include class labels Prediction objects –Often processed in online modes –No class labels

11 11 Classification Methods Comparative Criteria –Accuracy –Speed –Robustness –Scalability –Interpretability –Data types Classic methods –Decision Tree –Neural Network –Bayesian Network

12 12 Decision Tree Mapping Principle: Recursively partition the data set so that the subsets contain “pure” data Income Age

13 13 Decision Tree Algorithm: Start from the whole data set; Do { Split the data set into two or more subsets by every possible class label; Choose the split that produce the “purist” subsets; } While (subset not pure)

14 14 Decision Tree Key Question: How is purity (diversity) measured? –Gini Index of diversity: Ecologists’ contribution –Example: 8 cats, 2 tigers The probability of choosing a cat (p1) = 8/10 = 0.8 The probability of choosing a cat AGAIN = 0.8 * 0.8 = 0.64 The probability of choosing a tiger(p2) = 2/10 = 0.2 The probability of choosing a tiger AGAIN = 0.2 * 0.2 = 0.04 What is the probability of choosing two different animals?

15 15 Decision Tree P = 1 - p1*p1 - p2*p2 = 0.32 When is p biggest? --> when cats number = tiger number (p1 = p2 = 0.5) P = 0.5 When is p smallest? -->only one kind of animal (p1 or p2=1) p = 0 Gini index could represent the diversity of the data set

16 16 Decision Tree Gini Index: Suppose there are n different output classes, each class has a probability of pn to appear, the Gini Index is: 1 - When there are only two classes: 1- p1*p1 - p2*p2 = 1 - p1*p1 - (1-p1) *(1-p1) = 2p1 (1 - p1)

17 17 Decision Tree Another Index: Entropy E = - When only two categories: E = - (p1 log 2 p1 + p2 log 2 p2) The bigger E is, the more diverse the data is

18 18 Decision Tree Practice? –Question 1: if the chance of churn is 0.5 and not churn is – 0.5, what is the Entropy? –Answer: - 2 * ( 0.5 * log 2 0.5) = 1 –Question 2: if the chance of churn is 0.25 and not churn is – 0.75, what is the Entropy? –Answer: - ( 0.25 * log 2 0.25 + 0.75 * log 2 0.75 )

19 19 Decision Tree What is the entropy of set 1? –- [5/6 * log2 (5/6) + 1/6 * log2 (1/6)] Set 2? The whole Set? The reduction? Income Age Set 1 Set 2

20 20 Decision Tree Calculation of the reduction in Entropy –Original E: Easy to get –E of the subsets: easy to get –How to deal with the E’s of the subsets? Simply add them together is not good, since their sizes are different –Use a weighted sum: w1 = # of records in subset 1 / total # of records w2 = # of records in subset 2 / total # of records E’ = w1 * E1 + w2 * E2

21 21 Decision Tree The algorithm (divide and conquer): –Select an attribute and partition a data set D into D1, D2 … Dn –Calculate the Entropy En for each of the data set –Get the E’ = –Get E of the data set before partition –Get reduction in Entropy = E - E’ –Divide the data set using the attribute with the largest Entropy reduction; go to next round

22 22 Decision Tree Income Age Income = 23K Age = 55

23 23 Decision Tree Partition by Age = 55: –E = -[9/16 * log (9/16) + 7/16 * log (7/16)] –E1= - [5/6 * log2 (5/6) + 1/6 * log2 (1/6)] –E2= -[ 0.4 * log2 (0.4) + 0.6 * log2 (0.6)] –E’ = 6/16 * E1 + 10/16 * E2 –Reduction = E – E’ Partition by Income = 23k : Similar

24 24 Decision Tree Age? Income? Churn <= 55 > 55 <= 23K Churn Not Churn

25 25 Extract rules from the model Each path from the root to a leaf node forms a IF-THEN rule. In this rule, root and internal nodes are conjuncted to form the IF part. Left node denotes the THEN part of the rule.

26 26 Pruning Noises: inconsistent class labels for the same attribute values Outliers: the # of samples with a given combinations of class labels and input attribute values is small Overfitting: tree branches are created to classify noises and outliers Problem: unreliable tree branches

27 27 Pruning Function: remove unreliable branches Pre-pruning –Halting creations of unreliable branches by statistically determine the goodness of further tree splits –Less time-consuming but less effective Post-pruning –Remove unreliable branches from a full tree –Minimizing error rates or required encoding bits –More time-consuming but more effective

28 28 Decision Tree Pros of Decision Trees: –Clear Rules –Fast Algorithm –Scalable Cons: –The accuracy may suffer with complex problems, e.g., a large number of class labels

29 29 Decision Tree Many Trees out there!! –ID3 –C4.5 --- continuous predictor values –CART –Forest –MDTI –….

30 30 Neural Networks What is it? –Another classification technique to map from input attribute(s) to output attribute(s) –Most widely known but least understood Human Brains: The root of neural network + - ?

31 31 Neural Networks i1i1 i2i2 H2 H1 O1 O3 O2

32 32 Neural Network Let’s start with a simple example: z = 3x + 2y + 1 Input Attributes: x, y Output Attribute: z How to represent the mapping?

33 33 Neural Network x y Input nodes Input layerOutput layer Output nodes 3 2 Weights (SUM) Combination Function +1 Transfer Function

34 34 Two-layer Neural Network Three Major components: –Input layer –Output layer –A weight matrix in between Three Functions: –Combination function: Usually sum –Transfer or activation Function: To “Squash (normalize)” the sum to a certain range Can represent ANY linear functions between the input space and output space

35 35 Neural Network x y Input nodes Input layerOutput layer Output nodes 3 2 Weights SUM Combination Function sigmoid Transfer Function

36 36 Neural Networks How about non - linear relationships? Throw in another layer: Hidden layer Theoretically, a neural net with above structure can represent ANY function between the input space and output space i1i1 i2i2 H2 H1 O1 w1 11 w1 22 w1 21 w1 12 O3 O2

37 37 Neural Networks Data Flow: i1i1 i2i2 H2 H1 O1 w1 11 w1 22 w1 21 w1 12 O2 Age Income sum S(H2) S(H1) sum S(O1) S(O2) Churn Not Churn

38 38 Neural Network Feed-forward: –The above process, in which the input values are transformed through the network to produce the output values, is called FEED-FORWARD When we get new records, we do feed- forward to get the prediction values. But how do we produce a network that can predict?

39 39 Neural Nets Data set Training Set Testing Set Initial Neural Net Training Trained Neural Net Testing Trained Net with Performance Measurement

40 40 Neural Net Split the Data Set: –classifier and error: 2/3 for training, 1/3 for testing –Ten-fold validation: 9/10 for training, 1/10 for testing. Repeat ten times When sample size is small, use this

41 41 Training Neural Net Step 1. Set up an initial neural net: –input, output, and hidden layer nodes, value to 0. –Weight matrix often set to random small values (-0.5, 0.5) Step 2. Feed-forward: –Use historical data, run the predictor values through, get output. initialize Feed-forward : guess Back-propagation: Learn

42 42 Neural Net Step 3: Back-propagation –Critical Step: learning happens here –Compare the result of machine with the historical result: error i = oi real - oi machine Based on this error, go BACK to the hidden -- output layer matrix, change the weights so that error could be smaller Requires calculus (derivatives of the error) Just interpret it as looking for the minimum error on an error surface Repeat the process until the error falls within an acceptable range

43 43 Neural Net Tuning for the training phase –Topology: number of input, output, and hidden nodes hidden = 1/2 (output + input) number of hidden layers: 1 is enough –Learning Rate (0-1): The rate at which weights can be modified from previous weights Very important for learning convergence and performance –Momentum: The adjustment to be included to calculate weight modifications Typically very small or zero. Less important

44 44 Neural Net Pros: Very Powerful (ANY function!) Cons: Time - consuming Black-Box When and Where to use it: Complicated prediction problems Visualization or understanding of the rules are not needed Accuracy is very important

45 45 Summary Basics –Classification versus prediction Mappings from input attributes to class labels Data types of input attributes and class labels: numeric, categorical and ordinal Data-type-based view and discovery-vs-predictive view Decision-tree induction method –Recursive partitions of the data sets to increase the purity (or information gain) level of class labels in individual partitions. –Entropy function: measure of diversity –Tree nodes correspond to partitions and links correspond to partitioning conditions –Pre-pruning or post-pruning remove unreliable tree branches caused by noises or outliers

46 46 Summary Neural Net –Neural net has the following components: input layer, output layer, hidden layer weight matrices –Input layer represents the input attributes –Output layer represents the output classes –Hidden layer and the matrices helps to capture the mapping function

47 47 Summary Neural Net –To use a neural net, go through three steps: Training: feed-forward, back-propagation Testing: feed-forward only, used to measure the accuracy of the model built Prediction: Feed-forward without testing the performance –Most of the tuning occurs in the training phase hidden layer node number learning rate Momentum Readings: T2, Ch. 7.1 – 7.3.3 and Ch. 7.2


Download ppt "1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Classification Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair."

Similar presentations


Ads by Google