CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,

Slides:



Advertisements
Similar presentations
Analysis of High-Throughput Screening Data C371 Fall 2004.
Advertisements

Random Forest Predrag Radenković 3237/10
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
A Quick Overview By Munir Winkel. What do you know about: 1) decision trees 2) random forests? How could they be used?
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning: An Introduction
Three kinds of learning
Ensemble Learning (2), Tree and Forest
Data Mining Chun-Hung Chou
Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.
Chapter 9 – Classification and Regression Trees
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
CLASSIFICATION: Ensemble Methods
BAGGING ALGORITHM, ONLINE BOOSTING AND VISION Se – Hoon Park.
Ensemble Methods in Machine Learning
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Lecture Notes for Chapter 4 Introduction to Data Mining
Using Classification Trees to Decide News Popularity
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
Lecture 16. Bagging Random Forest and Boosting¶
Decision tree and random forest
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Machine Learning with Spark MLlib
Bagging and Random Forests
Week 2 Presentation: Project 3
Evaluating Classifiers
Lecture Notes for Chapter 4 Introduction to Data Mining
Lecture 17. Boosting¶ CS 109A/AC 209A/STAT 121A Data Science: Harvard University Fall 2016 Instructors: P. Protopapas, K. Rader, W. Pan.
CSSE463: Image Recognition Day 11
Trees, bagging, boosting, and stacking
Supervised Learning Seminar Social Media Mining University UC3M
Basic machine learning background with Python scikit-learn
Predict House Sales Price
Introduction Feature Extraction Discussions Conclusions Results
Transportation Mode Recognition using Smartphone Sensor Data
Introduction to Data Mining, 2nd Edition by
Ungraded quiz Unit 6.
Lecture Notes for Chapter 4 Introduction to Data Mining
Introduction to Data Mining, 2nd Edition by
Data Mining Classification: Alternative Techniques
Introduction to Data Mining, 2nd Edition by
Data Mining Practical Machine Learning Tools and Techniques
CSSE463: Image Recognition Day 11
Introduction to Data Mining, 2nd Edition
Introduction to Predictive Modeling
iSRD Spam Review Detection with Imbalanced Data Distributions
Ensemble learning.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Lecture 06: Bagging and Boosting
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Ensemble learning Reminder - Bagging of Trees Random Forest
Classification with CART
CSSE463: Image Recognition Day 11
CSSE463: Image Recognition Day 11
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
STT : Intro. to Statistical Learning
Outlines Introduction & Objectives Methodology & Workflow
Presentation transcript:

CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano, G., Garcia-Rodriguez, J., Garcia-Garcia, A., Perez-Sanchez, H., Benediktsson, J. A., Thapa, A., & Barr, A. on Automatic selection of molecular descriptors using random forest: Application to drug discovery Description: The optimal feature selection is an essential pre-processing step for the efficient application of DM techniques in drug discovery. The paper exams a Random Forest-based approach to automatically select features for classification. The reduction of features helps to reduces the computing time over existing approaches and permits the exploration for much larger datasets. The result for fitting model by reduced features is compared with manual feature selection on RF, SVM and NN approaches.

References [1]Cano, G., Garcia-Rodriguez, J., Garcia-Garcia, A., Perez-Sanchez, H., Benediktsson, J. A., Thapa, A., & Barr, A. (2017). Automatic selection of molecular descriptors using random forest: Application to drug discovery. Expert Systems with Applications, 72, 151-159. [2] James, G., Witten, D., Hastie, T., & Tibshirani, R., An Introduction to Statistical Learning. Springer. (2015). [3] Introduction to Data Mining P.-N. Tan, M. Steinbach, V. Kumar. Addison-Wesley 2005. ISBN-10: 0321321367 ISBN-13: 9780321321367 [4] ROC Curve: https://en.wikipedia.org/wiki/Receiver_operating_characteristic

Introduction The Importance of Drug Discovery Methods Finding good molecule descriptors Predict molecule bioactivity Virtual Screening Method A challenging task Novelty: Using Random Forest (RF) as both a feature selection and classification tool Reduction of data and features Improved Performance Reduce noise and irrelevant features Bhon

Datasets Bhon Three datasets Target Class

Methodology-OOB (Out of Bag) Suppose we sample observation with replacement from {1,2,3,4,5} to get 10 bootstrapped samples. On average, ⅓(1/e) is not used in the bootstrap for each time. Youqiao 1min 3min30s

Methodology-OOB Decision Tree => Tree Bagging OOB - Out of Bag Youqiao 45s Before we get into random forest, there are some basic information we need to know first. So there are actual process from decision tree advanced to random forest. First we all know decision tree. However, decision tree can lead to overfitting, which means fit the model too good. So what we could have is bagging. Bagging, also called bootstrap, means that we sample n number of data with replacement. Because of the replacement, there are some observation that would not appear in the sampled data. According to textbook and our experiment, ⅓ of the observations would not be used. Bagging could reduce the variance by averaging the effect.

Methodology-Random Forest Sample Observations (Bootstrap) Sample Features Youqiao 1min So now we do sample on observation, why not on features too. Then we get random forest. For each split, RF sampled certain number of features to get split by gini index. That number by default would be square root of number of features. In the prediction process, each tree would give a result, and RF takes the majority vote for classification. For example, in this picture, we have result from each tree in the forest. And if most of them, predict it belongs to category C. The RF would say it is C.

Methodology-Error Estimation Algorithm: For observation i in dataset(1,2…,n): For tree j in random forest(1,2,...,m): If observation i in OOB of tree j: tree j.predict(observation i); majority vote for observation i among trees; If majority vote != yi: error = error + 1; error estimation = Sum(error) / n; Youqiao 1min The paper also mentioned error estimation method. Based on the algorithm, it is similar to cross validation. To get the error estimation of the RF

Methodology - AUC AUC Area under the Curve Youqiao TPR: how many observations that are predicted as True are real True. FPR: how many observations that are predicted as False are real False. What I can tell, is that AUC is more sensitive, if we have an imbalance dataset. Too many Yes, but too few False. The model could just predict all the data to be True, the accuracy is still high. But it does not make sense. because the model cannot identify those No classes.

Research Structure Youqiao

Procedure - Feature Selection Importance of Variables MDA - Mean Decrease Accuracy Yimin

Procedure - Feature Selection (Cont.) MDG (Mean Decrease Gini) Relative importance of predictors of MR dataset Yimin Selection Strategy 1.Adhoc (Manual selection) 2.Auto

Procedure - Classification The model behavior is influenced by two parameters: the number of trees and the number of partition to be made. Number of Splits - mtry Ran

Procedure - Classification Number of Trees - ntree Ran

Results - Feature Selection Shaoju Wu

Results - Comparison Unstable behavior in Support Vector Machine (SVM) and Neural Networks (NNET) results could come from their inability to deal with datasets with high-dimensional data with low number of observations. RF outperforms other two using a minimum subset of relevant features. Shaoju Wu

Results - Comparison (cont.) Shaoju Wu

Conclusion Random Forests: A data mining algorithm that operates by constructing multiple decisions tree using random subsets of the data at training time and outputting class as the mode of the individual trees RF-based method outperforms classification results provided by Support Vector Machine (SVM) and Neural Networks (NN) approaches. Reduces Features and Runtime, allowing larger sets of data to be processed Bhon