1 Tree Based Data mining Techniques Haiqin Yang. 2 Why Data Mining? - Necessity is the Mother of Invention  The amount of data increases  Need to convert.

Slides:



Advertisements
Similar presentations
DECISION TREES. Decision trees  One possible representation for hypotheses.
Advertisements

Random Forest Predrag Radenković 3237/10
CHAPTER 9: Decision Trees
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Decision Tree Approach in Data Mining
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
K-NEAREST NEIGHBORS AND DECISION TREE Nonparametric Supervised Learning.
Decision Tree.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Data Mining Classification: Alternative Techniques
Overview Previous techniques have consisted of real-valued feature vectors (or discrete-valued) and natural measures of distance (e.g., Euclidean). Consider.
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Longin Jan Latecki Temple University
Decision Tree Rong Jin. Determine Milage Per Gallon.
Sparse vs. Ensemble Approaches to Supervised Learning
Decision Tree Algorithm
Ensemble Learning: An Introduction
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Machine Learning (Recitation 1) By Jimeng Sun 9/15/05.
Lecture 5 (Classification with Decision Trees)
Three kinds of learning
Machine Learning: Ensemble Methods
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Ensemble Learning (2), Tree and Forest
Introduction to Directed Data Mining: Decision Trees
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Chapter 9 – Classification and Regression Trees
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Benk Erika Kelemen Zsolt
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Learning with AdaBoost
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Lecture Notes for Chapter 4 Introduction to Data Mining
1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Classification and Regression Trees
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
DECISION TREES An internal node represents a test on an attribute.
k-Nearest neighbors and decision tree
Trees, bagging, boosting, and stacking
Ch9: Decision Trees 9.1 Introduction A decision tree:
Data Science Algorithms: The Basic Methods
Supervised Learning Seminar Social Media Mining University UC3M
Data Mining Classification: Basic Concepts and Techniques
Classification and Prediction
Roberto Battiti, Mauro Brunato
Introduction to Data Mining, 2nd Edition
Statistical Learning Dong Liu Dept. EEIS, USTC.
Ensemble learning.
Presentation transcript:

1 Tree Based Data mining Techniques Haiqin Yang

2 Why Data Mining? - Necessity is the Mother of Invention  The amount of data increases  Need to convert such data -> useful knowledge  The gap between data and analysts is increasing Hidden information is not always evident  Applications Business management, Production control, Bioinformatics, Market analysis, Engineering design, Science exploration...

3 What is Data Mining?  Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data  Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

4 KDD Process Database Selection Transformation Preparation Mining Training Evaluation/ Verification Model, Patterns

5 Main Tasks in KDD  Classification  Clustering  Regression  Association rule learning

6 Top 10 Algorithms in KDD  Identified in ICDM ’ 06  A series of criteria  Select from 18 nominations and vote from various researchers  In 10 topics association analysis, classification, clustering, statistical learning, bagging and boosting, sequential patterns, integrated mining, rough sets, link mining, and graph mining

7 Top 10 Algorithms in KDD  C4.5  k -means  Support vector machines  The Apriori algorithms (Association rule analysis)  The EM algorithm  PageRank  AdaBoost  k NN: k -nearest neighbor classification  Naive Bayes  CART: Classification and Regression Tree

8 Good Resources  Weka-Data Mining Software in Java  Spider-Machine Learning in Matlab  PyML-Machine Learning in Python  Scikits.learn-Machine Learning in Python  KDnuggets

9 Tree-Based Data Mining Techniques  Decision Tree (C4.5)  CART  Adaboost (Ensemble Method)  Random Forest (Ensemble Method)  …

10 An Example  Data: Loan application data  Task: Predict whether a loan should be approved or not  Performance measure: accuracy

11 An Example  Data: Loan application data  Task: Predict whether a loan should be approved or not  Performance measure: accuracy No learning: classify all future applications (test data) to the majority class (i.e., Yes): Accuracy = 9/15 = 60%  Can we do better than 60% with learning?

12 Decision tree  Decision tree learning is one of the most widely used techniques for classification Its classification accuracy is competitive with other methods, and it is very efficient  The classification model is a tree, called decision tree.  C4.5 by Ross Quinlan is perhaps the best known system. Open source

13 A decision tree from the loan data Decision nodes and leaf nodes (classes)

14 Prediction No

15 Is the decision tree unique? No. Here is a simpler tree. Objective: smaller tree and accurate tree Easy to understand and perform better Finding the best tree is NP-hard All current tree building algorithms are heuristic

16 Algorithm for decision tree learning  Basic algorithm (a greedy divide-and-conquer algorithm) Assume attributes are categorical (continuous attributes can be handled too) Tree is constructed in a top-down recursive manner At start, all the training examples are at the root Examples are partitioned recursively based on selected attributes Attributes are selected on the basis of an impurity function (e.g., information gain)  Conditions for stopping partitioning All examples for a given node belong to the same class There are no remaining attributes for further partitioning – majority class is the leaf There are no examples left

17 Choose an attribute to partition data  The key to building a decision tree - which attribute to choose in order to branch.  The objective is to reduce impurity or uncertainty in data as much as possible. A subset of data is pure if all instances belong to the same class.  The heuristic in C4.5 is to choose the attribute with the maximum Information Gain or Gain Ratio based on information theory.

18 CART-Classification and Regression Tree  By L. Breiman, J. Friedman, R. Olshen, and C. Stone (1984)  A non-parametric technique using tree building  Binary splits  Splits based only on one variable

19 CART-Load Example (again)

20 Three Steps in CART 1. Tree building 2. Pruning 3. Optimal tree selection If the attribute is categorical, then a classification tree is used If it is continuous, regression trees are used

21 Step of Tree Building 1. For each non-terminal node 1. For a variable 1. At all its split points, splits samples into two binary nodes 2. Select the best split in the variable in terms of the reduction in impurity (gini index) 2. Rank all of the best splits and select the variable that achieves the highest purity at root 3. Assign classes to the nodes according to a rule that minimizes misclassification costs T 4. Grow a very large tree T max until all terminal nodes are either small or pure or contain identical measurement vectors 2. Prune and choose final tree using the cross validation

22 Advantages of CART  Can cope with any data structure or type  Classification has a simple form  Uses conditional information effectively  Invariant under transformations of the variables  Is robust with respect to outliers  Gives an estimate of the misclassification rate

23 Disadvantages of CART  CART does not use combinations of variables  Tree can be deceptive – if variable not included it could be as it was “ masked ” by another  Tree structures may be unstable – a change in the sample may give different trees  Tree is optimal at each split – it may not be globally optimal

24 AdaBoost  Definition of Boosting: Boosting refers to the general problem of producing a very accurate prediction rule by combining rough and moderately inaccurate rules-of-thumb.  History 1990-Boost by majority (Freund) 1995-AdaBoost (Freund & Schapire) 1997-Generalized AdaBoost (Schapire & Singer) 2001-AdaBoost in face detection (Viola & Jones)  Properties A linear classifier with all its desirable properties Output converges to the logarithm of likelihood ratio Good generalization properties A feature selector by minimizing upper bound of empirical error Close to sequential decision making (simple to complex classifiers)

25 Boosting Algorithm  The intuitive idea Altering the distribution over the domain in a way that increases the probability of the “ harder ” parts of the space, thus forcing the weak learner to generate new classifier that make less mistakes on these parts.  AdaBoost algorithm Input variables / Formal setting  D: The distribution over all the training samples  Weak Learner: A weak learning algorithm to be boosted  T: The specified number of iterations  h: Weak classifier h : X → R ∈ {+1,-1}, H = {h(x)}

26 Two-class AdaBoost Initialize the weights, uniform distribution As  t increases,  t decreases Increase (decrease) distribution D of wrongly (correctly) classified samples The new distribution is used to train the next classifier. The process is iterated.

27 AdaBoost Ensemble Decision  Final classification is given by In two class case, the final output is sign[f(x)]  Advantages of adaboost Adaboost adjusts the errors of the weak classifiers adaptively by Weak Learner The update rule reduces the weight assigned to those examples on which the classifier makes a good predictions, and increases the weight of the examples on which the prediction is poor

28 An illustration of AdaBoost  Weak Learner: Threshold applied to one or other axis  Ensemble decision boundary is Green  Current base learner is dashed black line  Size of circle indicates weight assigned Three red circles in the left side and two blue circles in the right side are misclassified. Weights for misclassified points are increased and weights of others are decreased. Find another weak classifier based on current distribution. Add a weak learner: dashed black line. Weights of misclassified samples are Increased and weights of correctly classified samples are decreased. Focus on difficult training samples.

29 Conclusion  Basic concepts on data mining  Three tree based data mining techniques Decision tree CART AdaBoost

30 Information Gain  Entropy High: uniform, less predictable Low: peaks and valley, more predictable  Conditional Entropy

31 Entropy  Suppose X takes n values, V 1, V 2, … V n, P(X=V 1 )=p 1, P(X=V 2 )=p 2, … P(X=V n )=p n  What is the smallest number of bits, on average, per symbol, needed to transmit the symbols drawn from distribution of X?  H(X) = the entropy of X

32 Entropy  Suppose X takes n values, V 1, V 2, … V n, P(X=V 1 )=p 1, P(X=V 2 )=p 2, … P(X=V n )=p n  What is the smallest number of bits, on average, per symbol, needed to transmit the symbols drawn from distribution of X?  H(X) = the entropy of X

33 Specific Conditional Entropy, H(Y|X=v) XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes  Given input X, to predict Y  From data we can estimate probabilities P(LikeG = Yes) = 0.5 P(Major=Math & LikeG=No) = 0.25 P(Major=Math) = 0.5 P(Major=History & LikeG=Yes) = 0  Note H(X) = 1.5 H(Y) = 1 X = College Major Y = Likes “Gladiator”

34 Specific Conditional Entropy, H(Y|X=v) XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes  Definition  H(Y|X=v) = entropy of Y among only those records in which X has value v  Example: H(Y|X=Math) = 1 H(Y|X=History) = 0 H(Y|X=CS) = 0 X = College Major Y = Likes “Gladiator”

35 Conditional Entropy, H(Y|X=v) XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes  Definition  H(Y|X) = the average conditional entropy of Y = Σ i P(X=v i ) H(Y|X=v i )  Example: H(Y|X) = 0.5*1+0.25*0+0.25*0 = 0.5 X = College Major Y = Likes “Gladiator” vivi P(X=v i )H(Y|X=v i ) Math0.51 History0.250 CS0.250

36 Information Gain XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes  Definition  IG(Y|X) = I must transmit Y. How many bits on average would it save me if both ends of the line knew X? IG(Y|X) = H(Y) – H(Y|X)  Example: H(Y) = 1 H(Y|X) = 0.5 Thus: IG(Y|X) = 1 – 0.5 = 0.5 X = College Major Y = Likes “ Gladiator ”

37 Decision tree learning algorithm

38 Acknowledgement  Part of the slides on Decision Tree borrowed from Bing Liu  Part of the slides on Information Gain borrowed from Andrew Moore  Example of Adaboost is extracted from Hongbo ’ s slide