Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department.

Slides:

Advertisements

Similar presentations

On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

Advertisements

Random Forest Predrag Radenković 3237/10

Ignas Budvytis*, Tae-Kyun Kim*, Roberto Cipolla * - indicates equal contribution Making a Shallow Network Deep: Growing a Tree from Decision Regions of.

Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen

Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions Paper by Umar Syed and Golan Yona department of CS, Cornell.

Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.

Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville.

Decision Tree Algorithm

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Feature Selection for Regression Problems

Ensemble Learning: An Introduction

Classification Continued

Three kinds of learning

Correlation Immune Functions and Learning Lisa Hellerstein Polytechnic Institute of NYU Brooklyn, NY Includes joint work with Bernard Rosell (AT&T), Eric.

Classification.

Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.

Ensemble Learning (2), Tree and Forest

EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Monday, April 3, 2000 DingBing.

Data mining and machine learning A brief introduction.

Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.

Computational Intelligence: Methods and Applications Lecture 19 Pruning of decision trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.

Comparing Univariate and Multivariate Decision Trees Olcay Taner Yıldız Ethem Alpaydın Department of Computer Engineering Bogazici University

Learning from Observations Chapter 18 Through

Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)

Design of an Evolutionary Algorithm M&F, ch. 7 why I like this textbook and what I don’t like about it!

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

A hybrid SOFM-SVR with a filter-based feature selection for stock market forecasting Huang, C. L. & Tsai, C. Y. Expert Systems with Applications 2008.

Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

Manu Chandran. Outline Background and motivation Over view of techniques Cross validation Bootstrap method Setting up the problem Comparing AIC,BIC,Crossvalidation,Bootstrap.

Additive Data Perturbation: the Basic Problem and Techniques.

Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

Decision Trees, Part 1 Reading: Textbook, Chapter 6.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs.

Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.

Feature Selection Poonam Buch. 2 The Problem  The success of machine learning algorithms is usually dependent on the quality of data they operate on.

Using decision trees to build an a framework for multivariate time- series classification 1 Present By Xiayi Kuang.

Classification and Regression Trees

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Decision/Classification Trees Readings: Murphy ; Hastie 9.2.

1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.

Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.

Ensemble Classifiers.

DECISION TREES An internal node represents a test on an attribute.

Privacy-Preserving Data Mining

Boosted Augmented Naive Bayes. Efficient discriminative learning of

Chapter 6 Classification and Prediction

Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani

CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,

Introduction to Data Mining, 2nd Edition by

Classification and Prediction

Introduction to Data Mining, 2nd Edition by

Differential Privacy (2)

Introduction to Sensor Interpretation

CS639: Data Management for Data Science

Introduction to Sensor Interpretation

Machine Learning: Lecture 5

Presentation transcript:

Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department of Computer Sciences University of Wisconsin, Madison USA

Main Contribution Greedy tree learning algorithms suffer from myopia This is remedied by Lookahead, which is computationally very expensive We present an approach to efficiently address the myopia of tree learners

Task Setting Given: m examples of n Boolean attributes each, labeled according to a function f over some subset of the n attributes Do: Learn the Boolean function f

TDIDT Algorithm Top Down Induction of Decision Trees Greedy algorithm Chooses feature that locally optimizes some measure of “purity” of class labels Information Gain Gini Index

TDIDT Example +110 −010 − Valuex3x3 x2x2 x1x1

x1x1 x 1 =0 (2+, 1−) x 1 =1 (1−) ─ x 2 =0 (1+) x 2 =1 (1+,1−) x2x2 TDIDT Example

Outline Introduction to TDIDT algorithm Myopia and “Hard” Functions Skewing Experiments with Skewing Algorithm Sequential Skewing Experiments with Sequential Skewing Conclusions and Future Work

Myopia and Correlation Immunity For certain Boolean functions, no variable has “gain” according to standard purity measures (e.g., entropy, Gini) No variable correlated with class In cryptography, correlation immune Given such a target function, every variable looks equally good (bad) In an application, the learner will be unable to differentiate between relevant and irrelevant variables

A Correlation Immune Function f=x 1  x 2 x2x2 x1x

Examples In Drosophila, Survival is an exclusive-or function of Gender and the expression of the SxL gene In drug binding (Ligand-Domain), binding may have an exclusive-or subfunction of Ligand Charge and Domain Charge

Learning Hard Functions Standard method of learning hard functions with TDIDT: depth-k Lookahead O(mn 2 k+1 -1 ) for m examples in n variables Can we devise a technique that allows TDIDT algorithms to efficiently learn hard functions?

Key Idea Correlation immune functions aren’t hard – if the data distribution is significantly different from uniform

Example Uniform distribution can be sampled by setting each variable (feature) independently of all others, with probability 0.5 of being set to 1 Consider a distribution where each variable has probability 0.75 of being set to 1.

Example x1x1 x2x2 x3x3 f

x1x1 x2x2 x3x3 fWeightSum

Example x1x1 x2x2 x3x3 fSum

Example x1x1 x2x2 x3x3 fSum

Example x1x1 x2x2 x3x3 fSum

Example x1x1 x2x2 x3x3 fWeight

Key Idea Given a large enough sample and a second distribution sufficiently different from the first, we can learn functions that are hard for TDIDT algorithms under the original distribution.

Issues to Address How can we get a “sufficiently different” distribution? Our approach: “skew” the given sample by choosing “favored settings” for the variables Not-large-enough sample effects? Our approach: Average “goodness” of any variable over multiple skews

Skewing Algorithm For T trials do Choose a favored setting for each variable Reweight the sample Calculate entropy of each variable split under this weighting For each variable that has sufficient gain, increment a counter Split on the variable with the highest count

Experiments ID3 vs. ID3 with Skewing (ID3 to avoid issues to do with parameters, pruning, etc.) Synthetic Propositional Data Examples of 30 Boolean variables Target Boolean functions of 2-6 of these variables Randomly chosen targets and randomly chosen hard targets UCI Datasets ( Perlich et al, JMLR 2003 ) 10 fold cross validation Evaluation metric: Weighted Accuracy = average of accuracy over positives and negatives

Results (3-variable Boolean functions) Random functionsHard functions

Results (4-variable Boolean functions) Random functionsHard functions

Results (5-variable Boolean functions) Random functionsHard functions

Results (6-variable Boolean functions) Random functionsHard functions

Current Shortcomings Sensitive to noise, high-dimensional data Very small signal on the hardest CI functions (parity) given more than 3 relevant variables Only very small gains on real-world datasets attempted so far Few correlation immune functions in practice? Noise, dimensionality, not enough examples?