Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Tree Based Data mining Techniques Haiqin Yang. 2 Why Data Mining? - Necessity is the Mother of Invention  The amount of data increases  Need to convert.

Similar presentations


Presentation on theme: "1 Tree Based Data mining Techniques Haiqin Yang. 2 Why Data Mining? - Necessity is the Mother of Invention  The amount of data increases  Need to convert."— Presentation transcript:

1 1 Tree Based Data mining Techniques Haiqin Yang

2 2 Why Data Mining? - Necessity is the Mother of Invention  The amount of data increases  Need to convert such data -> useful knowledge  The gap between data and analysts is increasing Hidden information is not always evident  Applications Business management, Production control, Bioinformatics, Market analysis, Engineering design, Science exploration...

3 3 What is Data Mining?  Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data  Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

4 4 KDD Process Database Selection Transformation Preparation Mining Training Evaluation/ Verification Model, Patterns

5 5 Main Tasks in KDD  Classification  Clustering  Regression  Association rule learning

6 6 Top 10 Algorithms in KDD  Identified in ICDM ’ 06  A series of criteria  Select from 18 nominations and vote from various researchers  In 10 topics association analysis, classification, clustering, statistical learning, bagging and boosting, sequential patterns, integrated mining, rough sets, link mining, and graph mining

7 7 Top 10 Algorithms in KDD  C4.5  k -means  Support vector machines  The Apriori algorithms (Association rule analysis)  The EM algorithm  PageRank  AdaBoost  k NN: k -nearest neighbor classification  Naive Bayes  CART: Classification and Regression Tree

8 8 Good Resources  Weka-Data Mining Software in Java  Spider-Machine Learning in Matlab  PyML-Machine Learning in Python  Scikits.learn-Machine Learning in Python  KDnuggets

9 9 Tree-Based Data Mining Techniques  Decision Tree (C4.5)  CART  Adaboost (Ensemble Method)  Random Forest (Ensemble Method)  …

10 10 An Example  Data: Loan application data  Task: Predict whether a loan should be approved or not  Performance measure: accuracy

11 11 An Example  Data: Loan application data  Task: Predict whether a loan should be approved or not  Performance measure: accuracy No learning: classify all future applications (test data) to the majority class (i.e., Yes): Accuracy = 9/15 = 60%  Can we do better than 60% with learning?

12 12 Decision tree  Decision tree learning is one of the most widely used techniques for classification Its classification accuracy is competitive with other methods, and it is very efficient  The classification model is a tree, called decision tree.  C4.5 by Ross Quinlan is perhaps the best known system. Open source

13 13 A decision tree from the loan data Decision nodes and leaf nodes (classes)

14 14 Prediction No

15 15 Is the decision tree unique? No. Here is a simpler tree. Objective: smaller tree and accurate tree Easy to understand and perform better Finding the best tree is NP-hard All current tree building algorithms are heuristic

16 16 Algorithm for decision tree learning  Basic algorithm (a greedy divide-and-conquer algorithm) Assume attributes are categorical (continuous attributes can be handled too) Tree is constructed in a top-down recursive manner At start, all the training examples are at the root Examples are partitioned recursively based on selected attributes Attributes are selected on the basis of an impurity function (e.g., information gain)  Conditions for stopping partitioning All examples for a given node belong to the same class There are no remaining attributes for further partitioning – majority class is the leaf There are no examples left

17 17 Choose an attribute to partition data  The key to building a decision tree - which attribute to choose in order to branch.  The objective is to reduce impurity or uncertainty in data as much as possible. A subset of data is pure if all instances belong to the same class.  The heuristic in C4.5 is to choose the attribute with the maximum Information Gain or Gain Ratio based on information theory.

18 18 CART-Classification and Regression Tree  By L. Breiman, J. Friedman, R. Olshen, and C. Stone (1984)  A non-parametric technique using tree building  Binary splits  Splits based only on one variable

19 19 CART-Load Example (again)

20 20 Three Steps in CART 1. Tree building 2. Pruning 3. Optimal tree selection If the attribute is categorical, then a classification tree is used If it is continuous, regression trees are used

21 21 Step of Tree Building 1. For each non-terminal node 1. For a variable 1. At all its split points, splits samples into two binary nodes 2. Select the best split in the variable in terms of the reduction in impurity (gini index) 2. Rank all of the best splits and select the variable that achieves the highest purity at root 3. Assign classes to the nodes according to a rule that minimizes misclassification costs T 4. Grow a very large tree T max until all terminal nodes are either small or pure or contain identical measurement vectors 2. Prune and choose final tree using the cross validation

22 22 Advantages of CART  Can cope with any data structure or type  Classification has a simple form  Uses conditional information effectively  Invariant under transformations of the variables  Is robust with respect to outliers  Gives an estimate of the misclassification rate

23 23 Disadvantages of CART  CART does not use combinations of variables  Tree can be deceptive – if variable not included it could be as it was “ masked ” by another  Tree structures may be unstable – a change in the sample may give different trees  Tree is optimal at each split – it may not be globally optimal

24 24 AdaBoost  Definition of Boosting: Boosting refers to the general problem of producing a very accurate prediction rule by combining rough and moderately inaccurate rules-of-thumb.  History 1990-Boost by majority (Freund) 1995-AdaBoost (Freund & Schapire) 1997-Generalized AdaBoost (Schapire & Singer) 2001-AdaBoost in face detection (Viola & Jones)  Properties A linear classifier with all its desirable properties Output converges to the logarithm of likelihood ratio Good generalization properties A feature selector by minimizing upper bound of empirical error Close to sequential decision making (simple to complex classifiers)

25 25 Boosting Algorithm  The intuitive idea Altering the distribution over the domain in a way that increases the probability of the “ harder ” parts of the space, thus forcing the weak learner to generate new classifier that make less mistakes on these parts.  AdaBoost algorithm Input variables / Formal setting  D: The distribution over all the training samples  Weak Learner: A weak learning algorithm to be boosted  T: The specified number of iterations  h: Weak classifier h : X → R ∈ {+1,-1}, H = {h(x)}

26 26 Two-class AdaBoost Initialize the weights, uniform distribution As  t increases,  t decreases Increase (decrease) distribution D of wrongly (correctly) classified samples The new distribution is used to train the next classifier. The process is iterated.

27 27 AdaBoost Ensemble Decision  Final classification is given by In two class case, the final output is sign[f(x)]  Advantages of adaboost Adaboost adjusts the errors of the weak classifiers adaptively by Weak Learner The update rule reduces the weight assigned to those examples on which the classifier makes a good predictions, and increases the weight of the examples on which the prediction is poor

28 28 An illustration of AdaBoost  Weak Learner: Threshold applied to one or other axis  Ensemble decision boundary is Green  Current base learner is dashed black line  Size of circle indicates weight assigned Three red circles in the left side and two blue circles in the right side are misclassified. Weights for misclassified points are increased and weights of others are decreased. Find another weak classifier based on current distribution. Add a weak learner: dashed black line. Weights of misclassified samples are Increased and weights of correctly classified samples are decreased. Focus on difficult training samples.

29 29 Conclusion  Basic concepts on data mining  Three tree based data mining techniques Decision tree CART AdaBoost

30 30 Information Gain  Entropy High: uniform, less predictable Low: peaks and valley, more predictable  Conditional Entropy

31 31 Entropy  Suppose X takes n values, V 1, V 2, … V n, P(X=V 1 )=p 1, P(X=V 2 )=p 2, … P(X=V n )=p n  What is the smallest number of bits, on average, per symbol, needed to transmit the symbols drawn from distribution of X?  H(X) = the entropy of X

32 32 Entropy  Suppose X takes n values, V 1, V 2, … V n, P(X=V 1 )=p 1, P(X=V 2 )=p 2, … P(X=V n )=p n  What is the smallest number of bits, on average, per symbol, needed to transmit the symbols drawn from distribution of X?  H(X) = the entropy of X

33 33 Specific Conditional Entropy, H(Y|X=v) XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes  Given input X, to predict Y  From data we can estimate probabilities P(LikeG = Yes) = 0.5 P(Major=Math & LikeG=No) = 0.25 P(Major=Math) = 0.5 P(Major=History & LikeG=Yes) = 0  Note H(X) = 1.5 H(Y) = 1 X = College Major Y = Likes “Gladiator”

34 34 Specific Conditional Entropy, H(Y|X=v) XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes  Definition  H(Y|X=v) = entropy of Y among only those records in which X has value v  Example: H(Y|X=Math) = 1 H(Y|X=History) = 0 H(Y|X=CS) = 0 X = College Major Y = Likes “Gladiator”

35 35 Conditional Entropy, H(Y|X=v) XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes  Definition  H(Y|X) = the average conditional entropy of Y = Σ i P(X=v i ) H(Y|X=v i )  Example: H(Y|X) = 0.5*1+0.25*0+0.25*0 = 0.5 X = College Major Y = Likes “Gladiator” vivi P(X=v i )H(Y|X=v i ) Math0.51 History0.250 CS0.250

36 36 Information Gain XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes  Definition  IG(Y|X) = I must transmit Y. How many bits on average would it save me if both ends of the line knew X? IG(Y|X) = H(Y) – H(Y|X)  Example: H(Y) = 1 H(Y|X) = 0.5 Thus: IG(Y|X) = 1 – 0.5 = 0.5 X = College Major Y = Likes “ Gladiator ”

37 37 Decision tree learning algorithm

38 38 Acknowledgement  Part of the slides on Decision Tree borrowed from Bing Liu  Part of the slides on Information Gain borrowed from Andrew Moore  Example of Adaboost is extracted from Hongbo ’ s slide


Download ppt "1 Tree Based Data mining Techniques Haiqin Yang. 2 Why Data Mining? - Necessity is the Mother of Invention  The amount of data increases  Need to convert."

Similar presentations


Ads by Google