BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.

Slides:



Advertisements
Similar presentations
DECISION TREES. Decision trees  One possible representation for hypotheses.
Advertisements

Classification Algorithms
Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen
Data Mining Techniques: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists.
Decision Tree Approach in Data Mining
Bab /44 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 1 Classification With Decision tree.
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Scalable Classification Robert Neugebauer David Woo.
Classification Techniques: Decision Tree Learning
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
Lecture outline Classification Decision-tree classification.
Classification and Prediction
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Probabilistic Data Aggregation Ling Huang, Ben Zhao, Anthony Joseph Sahara Retreat January, 2004.
Lecture 5 (Classification with Decision Trees)
Classification.
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
Birch: An efficient data clustering method for very large databases
CS4432: Database Systems II
Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.
Ensemble Learning (2), Tree and Forest
Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.
Basic Data Mining Techniques
INTRODUCTION TO MACHINE LEARNING David Kauchak CS 451 – Fall 2013.
Data Mining: Classification
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
1 SD-Rtree: A Scalable Distributed Rtree Witold Litwin & Cédric du Mouza & Philippe Rigaux.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco.
Basics of Decision Trees  A flow-chart-like hierarchical tree structure –Often restricted to a binary structure  Root: represents the entire dataset.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
Scaling up Decision Trees. Decision tree learning.
CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on.
B-TREE. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so much data that it won’t.
Lecture Notes for Chapter 4 Introduction to Data Mining
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Bootstrapped Optimistic Algorithm for Tree Construction
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Decision Trees.
Classification and Regression Trees
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
PEP-PMMA Training Session Statistical inference Lima, Peru Abdelkrim Araar / Jean-Yves Duclos 9-10 June 2007.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
BY International School of Engineering {We Are Applied Engineering} Disclaimer: Some of the Images and content have been taken from multiple online sources.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Mining High-Speed Data Streams Presented by: William Kniffin Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Conference
FREQUENCY DISTRIBUTION
DECISION TREES An internal node represents a test on an attribute.
CS 540 Database Management Systems
Ch9: Decision Trees 9.1 Introduction A decision tree:
Chapter 6 Classification and Prediction
Decision Tree Saed Sayad 9/21/2018.
Classification and Prediction
Spatial Online Sampling and Aggregation
Communication and Memory Efficient Parallel Decision Tree Construction
Data Mining: Concepts and Techniques
Basic Concepts and Decision Trees
Bootstrapped Optimistic Algorithm for Tree Construction
Data Mining – Chapter 3 Classification
Classification and Prediction
Fast and Exact K-Means Clustering
Decision Trees for Mining Data Streams
Classification.
Learning from Data Streams
Using Clustering to Make Prediction Intervals For Neural Networks
Presentation transcript:

BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.

Problem Efficient construction of decision trees As few passes of the database as possible Sample of dataset to give insight to the full database

Motivation Standard decision tree construction is not sufficient –For tree of height h, need h passes through entire database –To include new data, must rebuild the tree –For large databases, this is not feasible Need fast, scalable method

Intuition Begin with sample of data –Build decision tree on the sample –For numeric data, use a confidence interval for split Make limited passes of full data to both verify sampled tree and construct full tree –Only data that falls in confidence interval needs be rescanned to determine how to propagate

Criteria selection Use of impurity functions to generate the attribute to split on –Entropy, gini, index of correlation –Calculated for sample, could be wrong in the full dataset Minimize the impurity function in attribute selection and confidence interval

Confidence Interval Construct T trees If at node n, the splitting attribute is not the same in all trees, discard n and its subtree in all trees For categorical data, if the splitting subset is not identical in all trees, remove node n and its subtree in all trees

Confidence Interval Confidence interval on numeric attributes determined by the range of split points on the T trees Exact split point is likely to be between the min and max of the values of the split points on the T trees.

Verification Verifying predictions –Use a lower bound for the impurity function to determine if confidence interval and splitting attribute are correct –Discard node and its subtree completely if incorrect Rerun algorithm on any set of data related to a discarded node

Invalidated Predictions Discarded top nodes would result in resampling of entire database –No savings on full scans –Doesn’t usually happen –Basic probability distribution likely captured by sample –Error in the detail (low) level

Dynamic Environments No need to frequently rebuild the decision trees Store the confidence intervals Only need rebuild of tree if underlying probability distribution changes

Experimental Results Used tuple sample size Grew 20 trees on tuples drawn from pool Datasets of 1.5 million tuples Outperforms brute-force method by a factor of 2 or 3

Experimental Results Robust to noise –Noise affects detail-level probability distribution –Affected the lower levels, requiring rescans of small amounts of data Dynamic updating data –BOAT is much faster than brute-force

Weak Points May not be as useful on complex probability distributions –Failure at high level of tree means that most of the tree is discarded Hypotheses generate as simple as regular decision trees –Simply a way to speed generation

Suggested Improvements Clustering to give a better sample to draw from –Groups of datapoints with a measure of frequency of occurance –Would give better samples of the data and its underlying probability distribution

Suggested Improvements Extremely large datasets –For TB+ datasets, even two or three passes of DB may be too many –Use MCMC to draw many different samples –Estimate probability density function by resampling Would not guarantee tree accuracy

Conclusion Effective way to build scalable decision trees Much faster than the standard method Useful in large datasets