Presentation is loading. Please wait.

Presentation is loading. Please wait.

7/4/20081 Decision Tree Approach in Data Mining What is data mining ? The process of extracting previous unknown and potentially useful information from.

Similar presentations


Presentation on theme: "7/4/20081 Decision Tree Approach in Data Mining What is data mining ? The process of extracting previous unknown and potentially useful information from."— Presentation transcript:

1 7/4/20081 Decision Tree Approach in Data Mining What is data mining ? The process of extracting previous unknown and potentially useful information from large database Several data mining approaches nowadays Association Rules Decision Tree Neutral Network Algorithm

2 7/4/20082 Decision Tree Induction A decision tree is a flow-chart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf nodes represent classes or class distribution.

3 7/4/20083 Data Mining Approach - Decision Tree a model that is both predictive and descriptive can help identify which factors to consider and how each factor associated to a business decision most commonly used for classification (predicting what group a case belongs to) several decision tree induction algorithms, e.g. C4.5, CART, CAL5, ID3 etc.

4 7/4/20084 Algorithm for building Decision Trees Decision trees are a popular structure for supervised learning. They are constructed using attributes best able to differentiate the concepts to be learned. A decision tree is built by initially selecting a subset of instances from a training set. This subset is then used by the algorithm to construct a decision tree. The remaining training set instances test the accuracy of the constructed tree.

5 7/4/20085 If the decision tree classified the instances correctly, the procedure terminates. If an instance is incorrectly classified, the instance is added to the selected subset of training instances and a new tree is constructed. This process continues until a tree that correctly classify all nonselected instances is created or the decision tree is built from the entire training set.

6 7/4/20086 Entropy (a) shows probability p range from 0 to 1 = log(1/p) (b) Shows probability of an event occurs = p log(1/p) (c) Shows probability of an expected value (occurs+not occurs) = p log(1/p) + (1-p) log (1/(1-p))

7 7/4/20087 Training Process | Data Preparation Stage | Tree Building Stage |--- Prediction Stage ---|

8 7/4/20088 Basic algorithm for inducing a decision tree Algorithm: Generate_decision_tree. Generate a decision tree from the given training data. Input: The training samples, represented by discrete-valued attributes; the set of candidate attributes, attribute-list; Output: A decision tree

9 7/4/20089 Begin Partition (S) If (all records in S are of the same class or only 1 record found in S) then return; For each attribute Ai do evaluate splits on attribute Ai; Use best split found to partition S into S1 and S2 to grow a tree with two Partition (S1) and Partition (S2); Repeat partitioning for Partition (S1) and (S2) until it meets tree stop growing criteria; End;

10 7/4/ Information Gain Difference between information needed for correct classification before and after the split. For example, before split, there are 4 possible outcomes represented in 2 bits in the information of A, B, … Outcome. After split on attribute A, the split results in two branches of the tree, and each tree branch represent two outcomes represented in 1 bit. Thus, choosing attribute A results in an information gain of one bit.

11 7/4/ Classification Rule Generation Generate Rules –rewrite the tree to a collection of rules, one for each tree leaf –e.g. Rule 1: IF ‘ outlook = rain ’ AND ‘ windy = false ’ THEN ‘ play ’ Simplifying Rules –delete any irrelevant rule condition without affecting its accuracy –e.g. Rule R-: IF r1 AND r2 AND r3 THEN class1 –Condition: Error Rate (R-) without r1 delete this rule condition r1 –Resultant Rule: IF r2 AND r3 THEN class1 Ranking Rules –order the rules according to the error rate

12 7/4/ Decision Tree Rules Rules are more appealing than trees, variations of the basic tree to rule mapping must be presented. Most variations focus on simplifying and/or eliminating existing rules.

13 7/4/ Example of simplifying rules of credit cards

14 14/4/ A rule created by following one path of the tree is: Case 1: If Age<=43 & Sex=Male & Credit Card Insurance=No Then Life Insurance Promotion = No The conditions for this rule cover 4 of 15 instances with 75% accuracy in which 3 out of 4 meet the successful rate. Case 2: If Sex=Male & Credit Card Insurance=No Then Life Insurance Promotion = No The conditions for this rule cover 5 of 6 instances with 83.3% accuracy Therefore, the simplified rule is more general and more accurate than the original rule.

15 7/4/ C4.5 Tree Induction Algorithm Involves two phases for decision tree construction –growing tree phase –pruning tree phase Growing Tree Phase –a top-down approach which repeatedly build the tree, it is a specialization process Pruning Tree Phase –a bottom-up approach which removes sub- trees by replacing them with leaves, it is a generalization process

16 7/4/ Expected information before splitting Let S be a set consisting of s data samples. Suppose the class label attribute has m distinct values defining m distinct classes, C i for i=1,..m. Let S i be the number of samples of S in class C i. The expected information needed to classify a given sample S i is given by: m Info(S)= -  S i log 2 S i i=1 S S Note that a log function to the base 2 is used since the information is encoded in bit

17 7/4/ Expected information after splitting Let attribute A have v distinct values {a 1, a 2, … a v }, and is used to split S into v subsets {S 1, … S v } where S j contains those samples in S that have value a j of A. After splitting, then these subsets would correspond to the branches partitioned in S. v Info A (S) =  S 1j + … +S mj Info(S j ) j=1 S Gain (A) = Info (S) – Info A (S)

18 7/4/ C4.5 Algorithm - Growing Tree Phase Let S = any set of training case Let |S| = number of classes in set S Let Freq (C i, S) = number of cases in S that belong to class C i Info(S) = average amount of information needed to identify the class in S Info x (S) = expected information to identify the class of a case in S after partitioning S with the test on attribute X Gain (X) = information gained by partitioning S according to the test on attribute X

19 7/4/ C4.5 Algorithm - Growing Tree Phase Select Decisive Attribute for Tree Splitting ( Informational Gain Ratio ) m Info(S)= -  S i log 2 S i i=1 S S v Info A (S) =  S 1j + … +S mj Info(S j ) j=1 S Gain (X) = Info (S) – Info x (S)

20 7/4/ C4.5 Algorithm - Growing Tree Phase Let S be the training set Info (S) = -9 log 2 (9) - 5 log 2 (5) = = Where log 2 (9/14)= log 2 log (9/14) Info Outlook (S) = 5 (- 2 log 2 (2) - 3 log 2 (3) ) (- 4 log 2 (4) - 0 log 2 (0) ) (- 3 log 2 (3) - 2 log 2 (2) ) = Gain (Outlook) = = Similarly,computed information Gain(Windy) =Info(S) - Info Windy (S) = = Thus, decision tree splits on attribute Outlook with higher information gain. Root | Outlook | Sunny Overcast Rain

21 7/4/ After first splitting

22 7/4/ Decision Tree after grow tree phase Root | Outlook / | \ Sunny Overcast Rain / \ | / \ Wendy not Play Windy not wendy (100%) wendy / \ / \ Play not play Play not play (40%) (60%)

23 7/4/200823

24 7/4/ Continuous-valued data If input sample data consists of an attribute that is continuous-valued, rather than discrete- valued. For example, people ’ s Ages is continuous- valued. For such a scenario, we must determine the “ best ” split-point for the attribute. An example is to take an average of the continuous values.

25 7/4/ C4.5 Algorithm - Pruning Tree Phase ( Error-Based Pruning Algorithm ) U 25% (E,N) = Predicted Error Rate = the number of misclassified test cases * 100% the total number of test cases where E is no. of error cases in the class, N is no. of cases in the class

26 7/4/ Case study of predicting student enrolment by decision tree Enrolment Relational schema AttributeData type IDNumber ClassVarchar Sex Varchar Fin_Support Varchar Emp_Code Varchar Job_Code Varchar IncomeVarchar Qualification Varchar Marital_Status Varchar

27 7/4/ Student Enrolment Analysis –deduce influencing factors associated to student course enrolment –Three selected courses ’ enrolment data is sampled: Computer Science, English Studies and Real Estate Management –with 100 training records and 274 testing records –prediction result –Generate Classification Rules –Decision tree -  Classification Rule –Students Enrolment: 41 Computer Science, 46 English Studies and 13 Real Estate Management

28 7/4/ Growing Tree Phase C4.5 tree induction algorithm gain ratio of all possible data attributes Note: Emp_code shows highest information gain, and thus is the top priority in decision tree.

29 7/4/ Growing Tree Phase Decision Tree

30 7/4/ Growing Tree Phase classification rules -Root -Emp_Code = Manufacturing(English Studies = 67%) -Quali = Form 4 Form 5(English studies = 100%) -Quali = Form 6 or equi.(English studies = 100%) -Quali = First degree(Computer science = 100%) -Quali = Master degree(computer science = 100%) -Emp_Code = Social work (computer science = 100%) -Emp_Code = Tourism, Hotel(English studies = 67%) -Emp_Code = Trading (English studies = 75%) -Emp_Code = Property (Real estate = 100%) -Emp_Code = Construction (Real estate = 56%) -Emp_Code = Education (computer science = 73%) -Emp_Code = Engineering (Real estate = 60%) -Emp_Code = Fin/Accounting (computer science = 54%) -Emp_Code = Government (computer science = 50%) -Emp_Code = Info. Tech. (computer science = 50%) -Emp_code = Others(English studies= 82%)

31 7/4/ Pruned Decision Tree Given: Error rate of Pruned Sub-tree Emp_code = “Manufacturing” =3.34 Non-Pruned Sub-tree Condition Error Rate Emp_Code=“Manufacturing”0.75 -Quali = Form 4 and Quali = Form Quali = First Degree 0.75 Total3.36 Note: Prune tree since Pruning Error rate 3.34 < no pruning error rate 3.36

32 7/4/ Prune Tree Phase Decision Tree

33 7/4/ Prune Tree Phase classification Rules No. RuleClass 1IF Emp_Code = “Government” AND Income = “$250,000 - $299,999”Real Estate Mgt 2 IF Emp_Code = “Tourism, Hotel”English Studies 3 IF Emp_Code = “Education”Computer Science 4 IF Emp_Code = “Others”English Studies 5 IF Emp_Code = “Government” AND Income = “$150,000 - $199,999”English Studies 6 IF Emp_Code = “Construction” AND Job_Code = “Professional, Technical”Real Estate Mgt 7 IF Emp_Code = “Manufacturing”English Studies 8 IF Emp_Code = “Trading” AND Sex = “Female”English Studies 9 IF Emp_Code = “Construction” AND Job_Code = “Executive”Real Estate Mgt 10 IF Emp_Code = “Engineering” AND Job_Code = “Sales”Computer Science 11 IF Emp_Code = “Engineering” AND Job_Code = “Professional, Technical”Real Estate Mgt 12 IF Emp_Code = “Government” AND Income = “$800,000 - $999,999”Real Estate Mgt 13 IF Emp_Code = “Info. Technology” AND Sex = “Female”English Studies 14 IF Emp_Code = “Info. Technology” AND Sex = “Male”Computer Science 15 IF Emp_Code = “Social Work”Computer Science 16 IF Emp_Code = “Fin/Accounting”Computer Science 17IF Emp_Code = “Trading” AND Sex = “Male”Computer Science 18IF Emp_Code = “Construction” AND Job_Code = “Clerical”English Studies

34 7/4/ Simplify classification rules by deleting unnecessary conditions Pessimistic error rate is due to its disappearance is minimal If the condition disappears, then the error rate is

35 7/4/ Simplified Classification Rules No. RuleClass 1IF Emp_Code = “Government” AND Income = “$250,000 - $299,999”Real Estate Mgt 2 IF Emp_Code = “Tourism, Hotel”English Studies 3 IF Emp_Code = “Education”Computer Science 4 IF Emp_Code = “Others”English Studies 5 IF Emp_Code = “Manufacturing”English Studies 6 IF Emp_Code = “Trading” AND Sex = “Female”English Studies 7 IF Emp_Code = “Construction” AND Job_Code = “Executive”Real Estate Mgt 8 IF Job_Code = “Sales”Computer Science 9 IF Emp_Code = “Engineering” AND Job_Code = “Professional, Technical”Real Estate Mgt 10 IF Emp_Code = “Info. Technology” AND Sex = “Female”English Studies 11 IF Emp_Code = “Info. Technology” AND Sex = “Male”Computer Science 12 IF Emp_Code = “Social Work”Computer Science 13 IF Emp_Code = “Fin/Accounting”Computer Science 14IF Emp_Code = “Trading” AND Sex = “Male”Computer Science 15IF Job_Code = “Clerical”English Studies 16 IF Emp_Code = “Property”Real Estate 17 IF Emp_Code = “Government” AND Income = “$200,000 - $249,999”English Studies c

36 7/4/ Ranking Rules After simplifying the classification rule set, the remaining step is to rank the rules according to their prediction reliability percentage defined as (1 – misclassify cases / total cases of the rule) * 100% For the rule If Employment = “ Trading ” and “ Sex= ‘ female ’” then class = “ English Studies ” Gives out 6 cases with 0 misclassify cases. Therefore, give out 100% reliability percentage and thus is ranked first rule in the rule set.

37 7/4/ Success rate ranked classification rules No. RuleClass 1 IF Emp_Code = “Trading” AND Sex = “Female”English Studies 2 IF Emp_Code = “Construction” AND Job_Code = “Executive”Real Estate Mgt 3 IF Emp_Code = “Info. Technology” AND Sex = “Male”Computer Science 4 IF Emp_Code = “Social Work”Computer Science 5IF Emp_Code = “Government” AND Income = “$250,000 - $299,999”Real Estate Mgt 6 IF Emp_Code = “Government” AND Income = “$200,000 - $249,999”English Studies 7 IF Emp_Code = “Trading” AND Sex = “Male”Computer Science 8 IF Emp_Code = “Property”Real Estate 9 IF Job_Code = “Sales”Computer Science 10 IF Emp_Code = “Others”English Studies 11 IF Emp_Code = “Info. Technology” AND Sex = “Female”English Studies 12 IF Emp_Code = “Engineering” AND Job_Code = “Professional, Technical”Real Estate Mgt 13 IF Emp_Code = “Education”Computer Science 14 IF Emp_Code = “Manufacturing”English Studies 15 IF Emp_Code = “Tourism, Hotel”English Studies 16 IF Job_Code = “Clerical”English Studies 17 IF Emp_Code = “Fin/Accounting”Computer Science

38 7/4/ Data Prediction Stage Classifier No. of misclassify cases Error rate(%) Pruned Decision Tree % Classification Rule set % Both prediction results are reasonable good. The prediction error rate obtained is 30%, which means nearly 70% of unseen test cases can have accurate prediction result.

39 7/4/ Summary “ Employment Industry ” is the most significant factor affecting an student enrolment Decision Tree Classifier gives the best better prediction result Windowing mechanism improves prediction accuracy

40 7/4/ Reading Assignment “ Data Mining: Concepts and Techniques ” 2 nd edition, by Han and Kamber, Morgan Kaufmann publishers, 2007, Chapter 6, pp

41 7/4/ Lecture Review Question 11 (i)Explain the term “ Information Gain ” in Decision Tree. (ii)What is the termination condition of Growing tree phase? (iii)Given a decision tree, which option do you prefer to prune the resulting rule and why? (a)Converting the decision tree to rules and then prune the resulting rules. (b)Pruning the decision tree and then converting the pruned tree to rules.

42 7/4/ CS5483 tutorial question 11 Apply C4.5 algorithm to construct a decision tree after first splitting for purchasing records from the following data after dividing the tuples into two groups according to “age”: one is less than 25, and another is greater than or equal to 25. Show all the steps and calculation for the construction. LocationCustomer SexAgePurchase records AsiaMale15Yes AsiaFemale23No AmericaFemale20No EuropeMale18No EuropeFemale10No AsiaFemale40Yes EuropeMale33Yes AsiaMale24Yes AmericaMale25Yes AsiaFemale27Yes AmericaFemale15Yes EuropeMale19No EuropeFemale33No AsiaFemale35No EuropeMale14Yes AsiaMale29Yes AmericaMale30No


Download ppt "7/4/20081 Decision Tree Approach in Data Mining What is data mining ? The process of extracting previous unknown and potentially useful information from."

Similar presentations


Ads by Google