Decision tree and random forest

Slides:

Advertisements

Similar presentations

Conceptual Clustering

Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.

Random Forest Predrag Radenković 3237/10

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Data Mining Classification: Alternative Techniques

Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.

Classification Techniques: Decision Tree Learning

Chapter 7 – Classification and Regression Trees

Chapter 7 – Classification and Regression Trees

Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005

Sparse vs. Ensemble Approaches to Supervised Learning

Ensemble Learning: An Introduction

Three kinds of learning

Classification and Prediction: Regression Analysis

Ensemble Learning (2), Tree and Forest

Issues with Data Mining

Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 6 Ensembles of Trees.

Chapter 9 – Classification and Regression Trees

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.

Ensemble Methods: Bagging and Boosting

Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

CLASSIFICATION: Ensemble Methods

Decision Trees. MS Algorithms Decision Trees The basic idea –creating a series of splits, also called nodes, in the tree. The algorithm adds a node to.

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

Classification Ensemble Methods 1

Data Analytics CMIS Short Course part II Day 1 Part 3: Ensembles Sam Buttrey December 2015.

1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.

Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Regression Tree Ensembles Sergey Bakin. Problem Formulation §Training data set of N data points (x i,y i ), 1,…,N. §x are predictor variables (P-dimensional.

Classification and Regression Trees

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

Lecture 16. Bagging Random Forest and Boosting¶

Ensemble Classifiers.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning: Ensemble Methods

Data Transformation: Normalization

Chapter 7. Classification and Prediction

Reading: R. Schapire, A brief introduction to boosting

Decision Trees an introduction.

Bagging and Random Forests

Week 2 Presentation: Project 3

Zaman Faisal Kyushu Institute of Technology Fukuoka, JAPAN

Artificial Intelligence

Lecture 17. Boosting¶ CS 109A/AC 209A/STAT 121A Data Science: Harvard University Fall 2016 Instructors: P. Protopapas, K. Rader, W. Pan.

Trees, bagging, boosting, and stacking

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Data Science Algorithms: The Basic Methods

Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.

Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.

Decision Tree Saed Sayad 9/21/2018.

CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,

Introduction to Data Mining, 2nd Edition by

A “Holy Grail” of Machine Learing

Introduction to Data Mining, 2nd Edition by

Data Mining Practical Machine Learning Tools and Techniques

Multiple Decision Trees ISQS7342

Statistical Learning Dong Liu Dept. EEIS, USTC.

Ensemble learning.

Ensemble learning Reminder - Bagging of Trees Random Forest

Model generalization Brief summary of methods

Classification with CART

Chapter 7: Transformations

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Presentation transcript:

Decision tree and random forest Lecture 1

Playing golf according to weather, example from http://www. saedsayad

Decide whether one plays golf or not according to weather(yes or no). Decision attributes are overlook, temp, humidity, windy and the frequency table is

Entropy

Entropy curve for binary class, q=1-p

Information gain; Gain=Entropy(T)-Entropy(T,X) The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneous branches). Step 1: Calculate entropy of the target.

Step 2: The dataset is then split on the different attributes Step 2: The dataset is then split on the different attributes. The entropy for each branch is calculated. Then it is added proportionally, to get total entropy for the split. The resulting entropy is subtracted from the entropy before the split. The result is the Information Gain, or decrease in entropy. G(playgolf,outlook)=E(playgolf)-E(playgolf,outlook)=0.94-0.693=0.247

Step 3: Choose attribute with the largest information gain as the decision node.

Step 4a: A branch with entropy of 0 is a leaf node.

Step 4b: A branch with entropy more than 0 needs further splitting.

Step 5: The algorithm is run recursively on the non-leaf branches, until all data is classified.

Decision Tree to Decision Rules A decision tree can easily be transformed to a set of rules by mapping from the root node to the leaf nodes one by one.

Random forest (Wikipedia) Decision trees are a popular method for various machine learning tasks. Tree learning "come closest to meeting the requirements for serving as an off-the-shelf procedure for data mining", say Hastie et al., because it is invariant under scaling and various other transformations of feature values, is robust to inclusion of irrelevant features, and produces inspectable models. However, they are seldom accurate. In particular, trees that are grown very deep tend to learn highly irregular patterns: they overfit their training sets, i.e. have low bias, but very high variance. Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance. This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance of the final model.

Tree bagging The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners. Given a training set X = x1, ..., xn with responses Y = y1, ..., yn, bagging repeatedly (B times) selects a random sample with replacement of the training set and fits trees to these samples: For b = 1, ..., B: 1.Sample, with replacement, B training examples from X, Y; call these Xb, Yb. Train a decision or regression tree fb on Xb, Yb.

Averaging After training, predictions for unseen samples x' can be made by averaging the predictions from all the individual regression trees on x': or by taking the majority vote in the case of decision trees.

This bootstrapping procedure leads to better model performance because it decreases the variance of the model, without increasing the bias. This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated. Simply training many trees on a single training set would give strongly correlated trees (or even the same tree many times, if the training algorithm is deterministic); bootstrap sampling is a way of de-correlating the trees by showing them different training sets.

The number of samples/trees, B, is a free parameter The number of samples/trees, B, is a free parameter. Typically, a few hundred to several thousand trees are used, depending on the size and nature of the training set. An optimal number of trees B can be found using cross-validation, or by observing the out-of-bag error: the mean prediction error on each training sample xᵢ, using only the trees that did not have xᵢ in their bootstrap sample. The training and test error tend to level off after some number of trees have been fit.

From bagging to random forests The above procedure describes the original bagging algorithm for trees. Random forests differ in only one way from this general scheme: they use a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features. This process is sometimes called "feature bagging". The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be selected in many of the B trees, causing them to become correlated. An analysis of how bagging and random subspace projection contribute to accuracy gains under different conditions is given by Ho. Typically, for a classification problem with p features, √p (rounded down) features are used in each split. For regression problems the inventors recommend p/3 (rounded down) with a minimum node size of 5 as the default.

ExtraTrees Adding one further step of randomization yields extremely randomized trees, or ExtraTrees. These are trained using bagging and the random subspace method, like in an ordinary random forest, but additionally the top-down splitting in the tree learner is randomized. Instead of computing the locally optimal feature/split combination (based on, e.g., information gain or the Gini impurity), for each feature under consideration, a random value is selected for the split. This value is selected from the feature's empirical range (in the tree's training set, i.e., the bootstrap sample)

Variable importance Random forests can be used to rank the importance of variables in a regression or classification problem in a natural way. The following technique was described in Breiman's original paper and is implemented in the R package randomForest. The first step in measuring the variable importance in a data set is to fit a random forest to the data. During the fitting process the out-of-bag error for each data point is recorded and averaged over the forest (errors on an independent test set can be substituted if bagging is not used during training).

To measure the importance of the j-th feature after training, the values of the j-th feature are permuted among the training data and the out-of-bag error is again computed on this perturbed data set. The importance score for the j-th feature is computed by averaging the difference in out-of-bag error before and after the permutation over all trees. The score is normalized by the standard deviation of these differences. Features which produce large values for this score are ranked as more important than features which produce small values.

This method of determining variable importance has some drawbacks This method of determining variable importance has some drawbacks. For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Methods such as partial permutations and growing unbiased trees can be used to solve the problem. If the data contain groups of correlated features of similar relevance for the output, then smaller groups are favored over larger groups.

Relationship to nearest neighbors A relationship between random forests and the k-nearest neighbor algorithm (k-NN) was pointed out by Lin and Jeon in 2002. It turns out that both can be viewed as so-called weighted neighborhoods schemes. These are models built from a training set { ( x i , y i ) } i = 1 ,…, n that make predictions y^ for new points x' by looking at the "neighborhood" of the point, formalized by a weight function W: is nonnegative weight.

In k-NN, the weights are W ( xi , x ′ ) = 1 / k if xi is one of the k points closest to x', and zero otherwise. In a tree, W ( xi , x ′ ) = 1 / k ′ if xi is one of the k' points in the same leaf as x', and zero otherwise. Since a forest averages the predictions of a set of m trees with individual weight functions W_j, its predictions are

This shows that the whole forest is again a weighted neighborhood scheme, with weights that average those of the individual trees. The neighbors of x' in this interpretation are the points xi which fall in the same leaf as x' in at least one tree of the forest. In this way, the neighborhood of x' depends in a complex way on the structure of the trees, and thus on the structure of the training set. Lin and Jeon show that the shape of the neighborhood used by a random forest adapts to the local importance of each feature

Unsupervised learning with random forests As part of their construction, RF predictors naturally lead to a dissimilarity measure between the observations. One can also define an RF dissimilarity measure between unlabeled data: the idea is to construct an RF predictor that distinguishes the “observed” data from suitably generated synthetic data. The observed data are the original unlabeled data and the synthetic data are drawn from a reference distribution. An RF dissimilarity can be attractive because it handles mixed variable types well, is invariant to monotonic transformations of the input variables, and is robust to outlying observations. The RF dissimilarity easily deals with a large number of semi-continuous variables due to its intrinsic variable selection; for example, the "Addcl 1" RF dissimilarity weighs the contribution of each variable according to how dependent it is on other variables. The RF dissimilarity has been used in a variety of applications, e.g. to find clusters of patients based on tissue marker data.

Thanks for your attention!