Presentation is loading. Please wait.

Presentation is loading. Please wait.

Classification and Prediction Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Similar presentations


Presentation on theme: "Classification and Prediction Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University."— Presentation transcript:

1 Classification and Prediction Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University

2 2 What Is Classification?  The goal of data classification is to organize and categorize data in distinct classes  A model is first created based on the data distribution  The model is then used to classify new data  Given the model, a class can be predicted for new data  Classification = prediction for discrete and nominal values  With classification, I can predict in which bucket to put the ball, but I can’t predict the weight of the ball

3 3 Prediction, Clustering, Classification  What is Prediction?  The goal of prediction is to forecast or deduce the value of an attribute based on values of other attributes  A model is first created based on the data distribution  The model is then used to predict future or unknown values  Supervised vs. Unsupervised Classification  Supervised Classification = Classification  We know the class labels and the number of classes  Unsupervised Classification = Clustering  We do not know the class labels and may not know the number of classes

4 4 Classification: 3 Step Process  1. Model construction (Learning):  Each record (instance) is assumed to belong to a predefined class, as determined by one of the attributes, called the class label  The set of all records used for construction of the model is called training set  The model is usually represented in the form of classification rules, (IF-THEN statements) or decision trees  2. Model Evaluation (Accuracy):  Estimate accuracy rate of the model based on a test set  The known label of test sample is compared with the classified result from model  Accuracy rate: percentage of test set samples correctly classified by the model  Test set is independent of training set otherwise over-fitting will occur  3. Model Use (Classification):  The model is used to classify unseen instances (assigning class labels)  Predict the value of an actual attribute

5 5 Model Construction

6 6 Model Evaluation

7 7 Model Use: Classification

8 8 Classification Methods  Decision Tree Induction  Neural Networks  Bayesian Classification  Association-Based Classification  K-Nearest Neighbor  Case-Based Reasoning  Genetic Algorithms  Fuzzy Sets  Many More  Decision Tree Induction  Neural Networks  Bayesian Classification  Association-Based Classification  K-Nearest Neighbor  Case-Based Reasoning  Genetic Algorithms  Fuzzy Sets  Many More

9 9 Decision Trees  A decision tree is a flow-chart-like tree structure  Internal node denotes a test on an attribute (feature)  Branch represents an outcome of the test  All records in a branch have the same value for the tested attribute  Leaf node represents class label or class label distribution outlook humiditywindy P PN P N sunny overcast rain highnormal true false

10 10 Decision Trees  Example: “is it a good day to play golf?”  a set of attributes and their possible values: outlooksunny, overcast, rain temperaturecool, mild, hot humidityhigh, normal windytrue, false A particular instance in the training set might be: : play In this case, the target class is a binary attribute, so each instance represents a positive or a negative example.

11 11 Using Decision Trees for Classification  Examples can be classified as follows  1. look at the example's value for the feature specified  2. move along the edge labeled with this value  3. if you reach a leaf, return the label of the leaf  4. otherwise, repeat from step 1  Example (a decision tree to decide whether to go on a picnic): outlook humiditywindy P PN P N sunny overcast rain highnormal true false So a new instance: : ? will be classified as “noplay”

12 12 Decision Trees and Decision Rules outlook humiditywindy yes no yes no sunny overcast rain > 75% <= 75% > 20 <= 20 If attributes are continuous, internal nodes may test against a threshold. Rule1: If (outlook=“sunny”) AND (humidity<=0.75) Then (play=“yes”) Rule2: If (outlook=“rainy”) AND (wind>20) Then (play=“no”) Rule3: If (outlook=“overcast”) Then (play=“yes”)... Each path in the tree represents a decision rule:

13 13 Top-Down Decision Tree Generation  The basic approach usually consists of two phases:  Tree construction  At the start, all the training examples are at the root  Partition examples are recursively based on selected attributes  Tree pruning  remove tree branches that may reflect noise in the training data and lead to errors when classifying test data  improve classification accuracy  Basic Steps in Decision Tree Construction  Tree starts a single node representing all data  If sample are all same class then node becomes a leaf labeled with class label  Otherwise, select feature that best separates sample into individual classes.  Recursion stops when:  Samples in node belong to the same class (majority)  There are no remaining attributes on which to split

14 14 Trees Construction Algorithm (ID3)  Decision Tree Learning Method (ID3)  Input: a set of training examples S, a set of features F  1. If every element of S has a class value “yes”, return “yes”; if every element of S has class value “no”, return “no”  2. Otherwise, choose the best feature f from F (if there are no features remaining, then return failure);  3. Extend tree from f by adding a new branch for each attribute value of f  3.1. Set F’ = F – {f},  4. Distribute training examples to leaf nodes (so each leaf node n represents the subset of examples S n of S with the corresponding attribute value  5. Repeat steps 1-5 for each leaf node n with S n as the new set of training examples and F’ as the set of attributes  Main Question:  how do we choose the best feature at each step? Note: ID3 algorithm only deals with categorical attributes, but can be extended (as in C4.5) to handle continuous attributes Note: ID3 algorithm only deals with categorical attributes, but can be extended (as in C4.5) to handle continuous attributes

15 15 Choosing the “Best” Feature  Use Information Gain to find the “best” (most discriminating) feature  Assume there are two classes, P and N (e.g, P = “yes” and N = “no”)  Let the set of instances S (training data) contains p elements of class P and n elements of class N  The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined in terms of entropy, I(p,n):  Note that Pr(P) = p / (p+n) and Pr(N) = n / (p+n)

16 16 Choosing the “Best” Feature  More generally, if we have m classes, and s 1, s 2, …, s m are the number of instances of S in each class, then the entropy is:  where p i is the probability that an arbitrary instance belongs to the class i.

17 17 Choosing the “Best” Feature  Now, assume that using attribute A a set S of instances will be partitioned into sets S 1, S 2, …, S v each corresponding to distinct values of attribute A.  If S i contains p i cases of P and n i cases of N, the entropy, or the expected information needed to classify objects in all subtrees S i is  The encoding information that would be gained by branching on A:  At any point we want to branch using an attribute that provides the highest information gain. where, The probability that an arbitrary instance in S belongs to the partition S i

18 18 Attribute Selection - Example  The “Golf” example: what attribute should we choose as the root? Outlook? overcast sunny rainy S: [9+,5-] [4+,0-] [2+,3-][3+,2-] I(9,5) = -(9/14).log(9/14) - (5/14).log(5/14) = 0.94 I(4,0) = -(4/4).log(4/4) - (0/4).log(0/4) = 0 I(2,3) = -(2/5).log(2/5) - (3/5).log(3/5) = 0.97 I(3,2) = -(3/5).log(3/5) - (2/5).log(2/5) = 0.97 Gain(outlook) =.94 - (4/14)*0 - (5/14)*.97 =.24

19 19 Attribute Selection - Example (Cont.) humidity? high normal S: [9+,5-] (I = 0.940) [3+,4-] (I = 0.985) [6+,1-] (I = 0.592) Gain(humidity) =.940 - (7/14)*.985 - (7/14)*.592 =.151 wind? weak strong S: [9+,5-] (I = 0.940) [6+,2-] (I = 0.811) [3+,3-] (I = 1.00) Gain(wind) =.940 - (8/14)*.811 - (8/14)*1.0 =.048 So, classifying examples by humidity provides more information gain than by wind. Similarly, we must find the information gain for “temp”. In this case, however, you can verify that outlook has largest information gain, so it’ll be selected as root So, classifying examples by humidity provides more information gain than by wind. Similarly, we must find the information gain for “temp”. In this case, however, you can verify that outlook has largest information gain, so it’ll be selected as root

20 20 Attribute Selection - Example (Cont.)  Partially learned decision tree  which attribute should be tested here? Outlook overcast sunny rainy S: [9+,5-] [4+,0-][2+,3-][3+,2-] ? ? yes {D1, D2, …, D14} {D1, D2, D8, D9, D11} {D3, D7, D12, D13}{D4, D5, D6, D10, D14} S sunny = {D1, D2, D8, D9, D11} Gain(S sunny, humidity) =.970 - (3/5)*0.0 - (2/5)*0.0 =.970 Gain(S sunny, temp) =.970 - (2/5)*0.0 - (2/5)*1.0 - (1/5)*0.0 =.570 Gain(S sunny, wind) =.970 - (2/5)*1.0 - (3/5)*.918 =.019

21 21 Dealing With Continuous Variables  Partition continuous attribute into a discrete set of intervals  sort the examples according to the continuous attribute A  identify adjacent examples that differ in their target classification  generate a set of candidate thresholds midway  problem: may generate too many intervals  Another Solution:  take a minimum threshold M of the examples of the majority class in each adjacent partition; then merge adjacent partitions with the same majority class Example: M = 3 70.577.5 Same majority, so they are merged Final mapping: temperature  77.5 ==> “yes”; temperature > 77.5 ==> “no”

22 22 Over-fitting in Classification  A tree generated may over-fit the training examples due to noise or too small a set of training data  Two approaches to avoid over-fitting:  (Stop earlier): Stop growing the tree earlier  (Post-prune): Allow over-fit and then post-prune the tree  Approaches to determine the correct final tree size:  Separate training and testing sets or use cross-validation  Use all the data for training, but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve over entire distribution  Use Minimum Description Length (MDL) principle: halting growth of the tree when the encoding is minimized.  Rule post-pruning (C4.5): converting to rules before pruning

23 23 Pruning the Decision Tree  A decision tree constructed using the training data may need to be pruned  over-fitting may result in branches or leaves based on too few examples  pruning is the process of removing branches and subtrees that are generated due to noise; this improves classification accuracy  Subtree Replacement: merge a subtree into a leaf node  Using a set of data different from the training data  At a tree node, if the accuracy without splitting is higher than the accuracy with splitting, replace the subtree with a leaf node; label it using the majority class color yes no red blue 1 2 Suppose with test set we find 3 red “no” examples, and 2 blue “yes” example. We can replace the tree with a single “no” node. After replacement there will be only 2 errors instead of 5. Suppose with test set we find 3 red “no” examples, and 2 blue “yes” example. We can replace the tree with a single “no” node. After replacement there will be only 2 errors instead of 5.

24 Bayesian Methods  Bayes’s theorem plays a critical role in probabilistic learning and classification  Uses prior probability of each category given no information about an item  Categorization produces a posterior probability distribution over the possible categories given a description of an item  The models are incremental in the sense that each training example can incrementally increase or decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data  Given a data sample X with an unknown class label, H is the hypothesis that X belongs to a specific class C  The conditional probability of hypothesis H given observation X, Pr(H|X), follows the Bayes’s theorem:  Practical difficulty: requires initial knowledge of many probabilities, significant computational cost 24

25 Axioms of Probability Theory  All probabilities between 0 and 1  True proposition has probability 1, false has probability 0. P(true) = 1 P(false) = 0  The probability of disjunction is: 25 A B

26 Conditional Probability  P(A | B) is the probability of A given B  Assumes that B is all and only information known.  Defined by: 26 A B

27 Independence  A and B are independent iff:  Therefore, if A and B are independent:  Bayes’s Rule: 27 These two constraints are logically equivalent

28 Bayesian Categorization  Let set of categories be {c 1, c 2,…c n }  Let E be description of an instance.  Determine category of E by determining for each c i  P(E) can be determined since categories are complete and disjoint. 28

29 Bayesian Categorization (cont.)  Need to know:  Priors: P(c i ) and Conditionals: P(E | c i )  P(c i ) are easily estimated from data.  If n i of the examples in D are in c i,then P(c i ) = n i / |D|  Assume instance is a conjunction of binary features/attributes: 29

30 Naïve Bayesian Categorization  Problem: Too many possible instances (exponential in m) to estimate all P(E | c i )  If we assume features/attributes of an instance are independent given the category (c i ) (conditionally independent)  Therefore, we then only need to know P(e j | c i ) for each feature and category 30

31 Estimating Probabilities  Normally, probabilities are estimated based on observed frequencies in the training data.  If D contains n i examples in category c i, and n ij of these n i examples contains feature/attribute e j, then:  However, estimating such probabilities from small training sets is error-prone.  If due only to chance, a rare feature, e k, is always false in the training data,  c i :P(e k | c i ) = 0.  If e k then occurs in a test example, E, the result is that  c i : P(E | c i ) = 0 and  c i : P(c i | E) = 0 31

32 Smoothing  To account for estimation from small samples, probability estimates are adjusted or smoothed.  Laplace smoothing using an m-estimate assumes that each feature is given a prior probability, p, that is assumed to have been previously observed in a “virtual” sample of size m.  For binary features, p is simply assumed to be 0.5. 32

33 33 Naïve Bayesian Classifier - Example  Here, we have two classes C1=“yes” (Positive) and C2=“no” (Negative)  Pr(“yes”) = instances with “yes” / all instances = 9/14  If a new instance X had outlook=“sunny”, then Pr(outlook=“sunny” | “yes”) = 2/9 (since there are 9 instances with “yes” (or P) of which 2 have outlook=“sunny”)  Similarly, for humidity=“high”, Pr(humidity=“high” | “no”) = 4/5  And so on.

34 34 Naïve Bayes (Example Continued)  Now, given the training set, we can compute all the probabilities  Suppose we have new instance X =. How should it be classified?  Similarly: X = Pr(X | “no”) = 3/5. 2/5. 4/5. 3/5 Pr(X | “yes”) = (2/9. 4/9. 3/9. 3/9)

35 35 Naïve Bayes (Example Continued)  To find out to which class X belongs we need to maximize: Pr(X | C i ).Pr(C i ), for each class C i (here “yes” and “no”)  To convert these to probabilities, we can normalize by dividing each by the sum of the two:  Pr(“no” | X) = 0.04 / (0.04 + 0.007) = 0.85  Pr(“yes” | X) = 0.007 / (0.04 + 0.007) = 0.15  Therefore the new instance X will be classified as “no”. X = Pr(X | “no”).Pr(“no”) = (3/5. 2/5. 4/5. 3/5). 5/14 = 0.04 Pr(X | “yes”).Pr(“yes”) = (2/9. 4/9. 3/9. 3/9). 9/14 = 0.007

36 36 Association-Based Classification  Recall “quantitative” association rules:  If the right-hand-side of the rules are restricted to the class attribute to be predicted, the rules can be used directly for classification  It mines high support and high confidence rules in the form of “cond_set => Y” where Y is a class label.  Has been shown to work better than decision tree models in some cases.

37 37 Measuring Effectiveness of Classification Models  When the output field is ordinal or nominal (e.g., in two-class prediction), we use the classification table, the so-called confusion matrix to evaluate the resulting model  Example  Overall correct classification rate = (18 + 15) / 38 = 87%  Given T, correct classification rate = 18 / 20 = 90%  Given F, correct classification rate = 15 / 18 = 83% Predicted Class Actual Class

38 38  usually used for classification, but can be adopted to other methods  measure change in conditional probability of a target class when going from the general population (full test set) to a biased sample:  Example:  suppose expected response rate to a direct mailing campaign is 5% in the training set  use classifier to assign “yes” or “no” value to a target class “predicted to respond”  the “yes” group will contain a higher proportion of actual responders than the test set  suppose the “yes” group (our biased sample) contains 50% actual responders  this gives lift of 10 = 0.5 / 0.05  What if the lift sample is too small  need to increase the sample size  trade-off between lift and sample size Measuring Effectiveness: Lift Sample size No. of respondents Mass mailing targeted mailing lift

39 39 What Is Prediction?  Prediction is similar to classification  First, construct a model  Second, use model to predict unknown value  Prediction is different from classification  Classification refers to predicting categorical class label (e.g., “yes”, “no”)  Prediction models are used to predict values of a numeric target attribute  They can be thought of as continuous-valued functions  Major method for prediction is regression  Linear and multiple regression  Non-linear regression  K-Nearest-Neighbor  Most common application domains:  recommender systems, credit scoring, customer lifetime values

40 40 Prediction: Regression Analysis  Most common approaches to prediction: linear or multiple regression.  Linear regression: Y =  +  X  The model is a line which best reflects the data distribution; the line allows for prediction of the Y attribute value based on the single attribute X.  Two parameters,  and  specify the line and are to be estimated by using the data at hand  Common approach: apply the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….  Regression applet : http://www.math.csusb.edu/faculty/stanton/probstat/regression.html  Multiple regression: Y = b0 + b1 X1 + b2 X2  Necessary when prediction must be made based on multiple attributes  E.g., predict Customer LTV based on: Age, Income, Spending, Items purchased, etc.  Many nonlinear functions can be transformed into the above.

41 41 Measuring Effectiveness of Prediction  Predictive models are evaluated based on the accuracy of their predictions on unseen data  accuracy measured in terms of error rate (usually % of records classified incorrectly)  error rate on a pre-classified evaluation set estimates the real error rate  Prediction Effectiveness  Difference between predicted scores and the actual results (from evaluation set)  Typically the accuracy of the model is measured in terms of variance (i.e., average of the squared differences)  E.g, Root Mean Squared Error: compute the standard deviation (i.e., square root of the co-variance between predicted and actual ratings)

42 42 Example: Recommender Systems  Basic formulation as a prediction problem  Typically, the profile P u contains interest scores by u on some other items, {i 1, …, i k } different from i t  Interest scores on i 1, …, i k may have been obtained explicitly (e.g., movie ratings) or implicitly (e.g., time spent on a product page or news article) Given a profile P u for a user u, and a target item i t, predict the interest score of user u on item i t

43 43 Example: Recommender Systems  Content-based recommenders  Predictions for unseen (target) items are computed based on their similarity (in terms of content) to items in the user profile.  E.g., user profile P u contains recommend highly: and recommend “mildly”:

44 44 Content-Based Recommender Systems

45 45 Example: Recommender Systems  Collaborative filtering recommenders  Predictions for unseen (target) items are computed based the other users’ with similar interest scores on items in user u’s profile  i.e. users with similar tastes (aka “nearest neighbors)  requires computing correlations between user u and other users according to interest scores or ratings Can we predict Karen’s rating on the unseen item Independence Day?

46 46 Example: Recommender Systems  Collaborative filtering recommenders  Predictions for unseen (target) items are computed based the other users’ with similar interest scores on items in user u’s profile  i.e. users with similar tastes (aka “nearest neighbors)  requires computing correlations between user u and other users according to interest scores or ratings prediction Correlation to Karen Predictions for Karen on Indep. Day based on the K nearest neighbors

47 47 Possible Interesting Project Ideas  Build a content-based recommender for  Movies (e.g., previous example)  News stories (requires basic text processing and indexing of documents)  Music (based on features such as genre, artist, etc.  Build a collaborative recommender for  Movies (using movie ratings), e.g., movielens.org  Music, e.g., pandora.com  Recommend songs or albums based on collaborative ratings  Or, recommend whole playlists based on playlists from other users (this might be a good candidate application for association rule mining (why?)

48 48 Other Forms of Collaborative and Social Filtering  Social Tagging (Folksonomy)  people add free-text tags to their content  where people happen to use the same terms then their content is linked  frequently used terms floating to the top to create a kind of positive feedback loop for popular tags.  Examples:  Del.icio.us Del.icio.us  Flickr Flickr

49 49 Social Tagging  Deviating from standard mental models  No browsing of topical, categorized navigation or searching for an explicit term or phrase  Instead is use language I use to define my world (tagging)  Sharing my language and contexts will create community  Tagging creates community through the overlap of perspectives  This leads to the creation of social networks which may further develop and evolvesocial networks  But, does this lead to dynamic evolution of complex concepts or knowledge? Collective intelligence?

50 50 Clustering and Collaborative Filtering :: clustering based on ratings: movielens

51 51 Clustering and Collaborative Filtering :: tag clustering example

52 52 Classification Example - Bank Data  Want to determine likely responders to a direct mail campaign  a new product, a "Personal Equity Plan" (PEP)  training data include records kept about how previous customers responded and bought the product  in this case the target class is “pep” with binary value  want to build a model and apply it to new data (a customer list) in which the value of the class attribute is not available

53 53 Data Preparation  Several steps for prepare data for Weka and for See5  open training data in Excel, remove the “id” column, save results (as a comma delimited file (e.g., “bank.csv”)  do the same with new customer data, but also add a new column called “pep” as the last column; the value of this column for each record should be “?”  Weka  must convert the the data to ARFF format  attribute specification and data are in the same file  the data portion is just the comma delimited data file without the label row  See5/C5  create a “name” file and a “data” file  “name” file contains attribute specification; “data” file is same as above  first line of “name” file must be the name(s) of the target class(es) - in this case “pep”

54 54 Data File Format for Weka @relation ’train-bank-data' @attribute 'age' real @attribute 'sex' {'MALE','FEMALE'} @attribute 'region' {'INNER_CITY','RURAL','TOWN','SUBURBAN'} @attribute 'income' real @attribute 'married' {'YES','NO'} @attribute 'children' real @attribute 'car' {'YES','NO'} @attribute 'save_act' {'YES','NO'} @attribute 'current_act' {'YES','NO'} @attribute 'mortgage' {'YES','NO'} @attribute 'pep' {'YES','NO'} @data 48,FEMALE,INNER_CITY,17546,NO,1,NO,NO,NO,NO,YES 40,MALE,TOWN,30085.1,YES,3,YES,NO,YES,YES,NO... @relation ’train-bank-data' @attribute 'age' real @attribute 'sex' {'MALE','FEMALE'} @attribute 'region' {'INNER_CITY','RURAL','TOWN','SUBURBAN'} @attribute 'income' real @attribute 'married' {'YES','NO'} @attribute 'children' real @attribute 'car' {'YES','NO'} @attribute 'save_act' {'YES','NO'} @attribute 'current_act' {'YES','NO'} @attribute 'mortgage' {'YES','NO'} @attribute 'pep' {'YES','NO'} @data 48,FEMALE,INNER_CITY,17546,NO,1,NO,NO,NO,NO,YES 40,MALE,TOWN,30085.1,YES,3,YES,NO,YES,YES,NO... @relation 'new-bank-data' @attribute 'age' real @attribute 'region' {'INNER_CITY','RURAL','TOWN','SUBURBAN'}... @attribute 'pep' {'YES','NO'} @data 23,MALE,INNER_CITY,18766.9,YES,0,YES,YES,NO,YES,? 30,MALE,RURAL,9915.67,NO,1,NO,YES,NO,YES,? @relation 'new-bank-data' @attribute 'age' real @attribute 'region' {'INNER_CITY','RURAL','TOWN','SUBURBAN'}... @attribute 'pep' {'YES','NO'} @data 23,MALE,INNER_CITY,18766.9,YES,0,YES,YES,NO,YES,? 30,MALE,RURAL,9915.67,NO,1,NO,YES,NO,YES,? Training Data New Cases

55 55 C4.5 Implementation in Weka  To build a model (decision tree) using the classifiers.trees.j48..J48 class children <= 2 | children <= 0 | | married = YES | | | mortgage = YES | | | | save_act = YES: NO (16.0/2.0) | | | | save_act = NO: YES (9.0/1.0) | | | mortgage = NO: NO (59.0/6.0) | | married = NO | | | mortgage = YES | | | | save_act = YES: NO (12.0) | | | | save_act = NO: YES (3.0) | | | mortgage = NO: YES (29.0/2.0) | children > 0 | | income <= 29622 | | | children <= 1 | | | | income <= 12640.3: NO (5.0) | | | | income > 12640.3 | | | | | current_act = YES: YES (28.0/1.0) | | | | | current_act = NO | | | | | | income <= 17390.1: NO (3.0) | | | | | | income > 17390.1: YES (6.0) | | | children > 1: NO (47.0/3.0) | | income > 29622: YES (48.0/2.0) children > 2 | income <= 43228.2: NO (30.0/2.0) | income > 43228.2: YES (5.0) children <= 2 | children <= 0 | | married = YES | | | mortgage = YES | | | | save_act = YES: NO (16.0/2.0) | | | | save_act = NO: YES (9.0/1.0) | | | mortgage = NO: NO (59.0/6.0) | | married = NO | | | mortgage = YES | | | | save_act = YES: NO (12.0) | | | | save_act = NO: YES (3.0) | | | mortgage = NO: YES (29.0/2.0) | children > 0 | | income <= 29622 | | | children <= 1 | | | | income <= 12640.3: NO (5.0) | | | | income > 12640.3 | | | | | current_act = YES: YES (28.0/1.0) | | | | | current_act = NO | | | | | | income <= 17390.1: NO (3.0) | | | | | | income > 17390.1: YES (6.0) | | | children > 1: NO (47.0/3.0) | | income > 29622: YES (48.0/2.0) children > 2 | income <= 43228.2: NO (30.0/2.0) | income > 43228.2: YES (5.0) Decision Tree Output (pruned)

56 56 C4.5 Implementation in Weka The model can be saved to be later applied to the test data (or to new unclassified instances). === Error on training data === Correctly Classified Instances 281 93.6667 % Incorrectly Classified Instances 19 6.3333 % Mean absolute error 0.1163 Root mean squared error 0.2412 Relative absolute error 23.496 % Root relative squared error 48.4742 % Total Number of Instances 300 === Confusion Matrix === a b <-- classified as 122 13 | a = YES 6 159 | b = NO === Stratified cross-validation === Correctly Classified Instances 274 91.3333 % Incorrectly Classified Instances 26 8.6667 % Mean absolute error 0.1434 Root mean squared error 0.291 Relative absolute error 28.9615 % Root relative squared error 58.4922 % Total Number of Instances 300 === Confusion Matrix === a b <-- classified as 118 17 | a = YES 9 156 | b = NO === Error on training data === Correctly Classified Instances 281 93.6667 % Incorrectly Classified Instances 19 6.3333 % Mean absolute error 0.1163 Root mean squared error 0.2412 Relative absolute error 23.496 % Root relative squared error 48.4742 % Total Number of Instances 300 === Confusion Matrix === a b <-- classified as 122 13 | a = YES 6 159 | b = NO === Stratified cross-validation === Correctly Classified Instances 274 91.3333 % Incorrectly Classified Instances 26 8.6667 % Mean absolute error 0.1434 Root mean squared error 0.291 Relative absolute error 28.9615 % Root relative squared error 58.4922 % Total Number of Instances 300 === Confusion Matrix === a b <-- classified as 118 17 | a = YES 9 156 | b = NO The rest of the output contains statistical information about the model, including confusion matrix, error rates, etc.

57 57

58 58

59 59

60 60

61 61

62 62

63 63

64 64

65 65

66 66

67 67

68 68

69 69

70 70


Download ppt "Classification and Prediction Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University."

Similar presentations


Ads by Google