A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Slides:



Advertisements
Similar presentations
Data Mining Lecture 9.
Advertisements

Mining High-Speed Data Streams
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Decision Tree Approach in Data Mining
Mining High-Speed Data Streams Presented by: Tyler J. Sawyer UVM Spring CS 332 Data Mining Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International.
M INING H IGH -S PEED D ATA S TREAMS Presented by: Yumou Wang Dongyun Zhang Hao Zhou.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Deriving rules from data Decision Trees a.j.m.m (ton) weijters.
BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.
Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC /5/27 報告人:董原賓.
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
1 Decision Tree Classification Tomi Yiu CS 632 — Advanced Database Systems April 5, 2001.
Decision Tree Algorithm
1 An Adaptive Nearest Neighbor Classification Algorithm for Data Streams Yan-Nei Law & Carlo Zaniolo University of California, Los Angeles PKDD, Porto,
Ensemble Learning: An Introduction
1 Mining Decision Trees from Data Streams Tong Suk Man Ivy CSIS DB Seminar February 12, 2003.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Decision Trees Chapter 18 From Data to Knowledge.
Three kinds of learning
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Mining High Speed Data Streams
Department of Computer Science, University of Waikato, New Zealand Geoffrey Holmes, Bernhard Pfahringer and Richard Kirkby Traditional machine learning.
Handling Numeric Attributes in Hoeffding Trees Bernhard Pfahringer, Geoff Holmes and Richard Kirkby.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
For Friday No reading No homework. Program 4 Exam 2 A week from Friday Covers 10, 11, 13, 14, 18, Take home due at the exam.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
Scaling up Decision Trees. Decision tree learning.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
Decision Trees. MS Algorithms Decision Trees The basic idea –creating a series of splits, also called nodes, in the tree. The algorithm adds a node to.
Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence Presented by: Afsoon.
CS690L Data Mining: Classification
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
1 Mining Decision Trees from Data Streams Thanks: Tong Suk Man Ivy HKU.
Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length Abdomen Length.
Tracking Malicious Regions of the IP Address Space Dynamically.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Bootstrapped Optimistic Algorithm for Tree Construction
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Mining High-Speed Data Streams Presented by: William Kniffin Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Conference
Bias Management in Time Changing Data Streams We assume data is generated randomly according to a stationary distribution. Data comes in the form of streams.
University of Waikato, New Zealand
DECISION TREES An internal node represents a test on an attribute.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
C4.5 - pruning decision trees
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Artificial Intelligence
Mining Time-Changing Data Streams
Ch9: Decision Trees 9.1 Introduction A decision tree:
“Mining Highspeed Data streams” - Pedro Domingos and Geoff Hulten CSCE 566 SPRING 2017 By, Mounika Pylla mxp5826.
Data Science Algorithms: The Basic Methods
Communication and Memory Efficient Parallel Decision Tree Construction
Decision Trees By Cole Daily CSCI 446.
Fast and Exact K-Means Clustering
An Adaptive Nearest Neighbor Classification Algorithm for Data Streams
Decision Trees for Mining Data Streams
Mining Decision Trees from Data Streams
Learning from Data Streams
Presentation transcript:

A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos

Mining Massive Data Streams High-speed data streams abundant –Large retailers –Long distance & cellular phone call records –Scientific projects –Large Web sites Build model of the process creating data Use model to interact more efficiently

Growing Mismatch Between Algorithms and Data State of the art data mining algorithms –One shot learning –Work with static databases –Maximum of 1 million – 10 million records Properties of Data Streams –Data stream exists over months or years –10s – 100s of millions of new records per day –Process generating data changing over time

The Cost of This Mismatch Fraction of data we can effectively mine shrinking towards zero Models learned from heuristically selected samples of data Models out of date before being deployed

Need New Algorithms Monitor a data stream and have a model available at all times Improve the model as data arrives Adapt the model as process generating data changes Have quality guarantees Work within strict resource constraints

Solution: General Framework Applicable to algorithms based on discrete search Semi-automatically converts algorithm to meet our design needs Uses sampling to select data size for each search step Extensions to continuous searches and relational data

Outline Introduction Scaling up Decision Trees Our Framework for Scaling Other Applications and Results Conclusion

Decision Trees Examples: Encode: Nodes contain tests Leaves contain predictions Gender? False Age? MaleFemale < 25>= 25 FalseTrue

Decision Tree Induction DecisionTree(Data D, Tree T, Attributes A) If D is pure Let T be a leaf predicting class in D Return Let X be best of A according to D and G() Let T be a node that splits on X For each value V of X Let D^ be the portion of D with V for X Let T^ be the child of T for V DecisionTree(D^, T^, A – X)

VFDT (Very Fast Decision Tree) In order to pick split attribute for a node looking at a few example may be sufficient Given a stream of examples: –Use the first to pick the split at the root –Sort succeeding ones to the leaves –Pick best attribute there –Continue… Leaves predict most common class Very fast, incremental, any time decision tree induction algorithm

How Much Data? Make sure best attribute is better than second –That is: Using a sample so need Hoeffding bound –Collect data till:

Core VFDT Algorithm Proceedure VFDT(Stream, δ) Let T = Tree with single leaf (root) Initialize sufficient statistics at root For each example (X, y) in Stream Sort (X, y) to leaf using T Update sufficient statistics at leaf Compute G for each attribute If G(best) – G(2 nd best) > ε, then Split leaf on best attribute For each branch Start new leaf, init sufficient statistics Return T x1? y=0 x2? y=0y=1 malefemale > 65<= 65

Quality of Trees from VFDT Model may contain incorrect splits, useful? Bound the difference with infinite data tree –Chance an arbitrary example takes different path Intuition: example on level i of tree has i chances to go through a mistaken node

Complete VFDT System Memory management –Memory dominated by sufficient statistics –Deactivate less promising leaves when needed Ties: –Wasteful to decide between identical attributes Check for splits periodically Pre-pruning –Only make splits that improve the value of G(.) Early stop on bad attributes

VFDT (Continued) Bootstrap with traditional learner Rescan dataset when time available Time changing data streams Post pruning Continuous attributes Batch mode

Experiments Compared VFDT and C4.5 (Quinlan, 1993) Same memory limit for both (40 MB) –100k examples for C4.5 VFDT settings: δ = 10^-7, τ = 5% Domains: 2 classes, 100 binary attributes Fifteen synthetic trees 2.2k – 500k leaves Noise from 0% to 30%

Running Times Pentium III at 500 MHz running Linux C4.5 takes 35 seconds to read and process 100k examples; VFDT takes 47 seconds VFDT takes 6377 seconds for 20 million examples; 5752s to read 625s to process VFDT processes 32k examples per second (excluding I/O)

Real World Data Sets: Trace of UW Web requests Stream of Web page request from UW One week 23k clients, 170 orgs. 244k hosts, 82.8M requests (peak: 17k/min), 20GB Goal: improve cache by predicting requests 1.6M examples, 61% default class C4.5 on 75k exs, 2975 secs. –73.3% accuracy VFDT ~3000 secs., 74.3% accurate

Outline Introduction Scaling up Decision Trees Our Framework for Scaling Overview of Applications and Results Conclusion

Data Mining as Discrete Search... Initial state –Empty – prior – random Search operators –Refine structure Evaluation function –Likelihood – many other Goal state –Local optimum, etc.

Data Mining As Search... Training Data

Example: Decision Tree... Training Data X1? Xd? ??... X1?... X1? Training Data Initial state –Root node Search operators –Turn any leaf into a test on attribute Evaluation –Entropy Reduction Goal state –No further gain –Post prune

Overview of Framework Cast the learning algorithm as a search Begin monitoring data stream –Use each example to update sufficient statistics where appropriate (then discard it) –Periodically pause and use statistical tests Take steps that can be made with high confidence –Monitor old search decisions Change them when data stream changes

How Much Data is Enough?... Training Data Xd? X1?

How Much Data is Enough?... Sample of Data 1.6 +/- ε Use statistical bounds –Normal distribution –Hoeffding bound Applies to scores that are average over examples Can select a winner if –Score1 > Score2 + ε 1.4 +/- ε Xd? X1?

Global Quality Guarantee δ – probability of error in single decision b – branching factor of search d – depth of search c – number of checks for winner δ* = δbdc

Identical States And Ties Fails if states are identical (or nearly so) τ – user supplied tie parameter Select winner early if alternatives differ by less than τ –Score1 > Score2 + ε or –ε <= τ

Dealing with Time Changing Concepts Maintain a window of the most recent examples Keep model up to date with this window Effective when window size similar to concept drift rate Traditional approach –Periodically reapply learner –Very inefficient! Our approach –Monitor quality of old decisions as window shifts –Correct decisions in fine-grained manner

Alternate Searches When new test looks better grow alternate sub-tree Replace the old when new is more accurate This smoothly adjusts to changing concepts Gender? Pets?College? Hair? false true false true

RAM Limitations Each search requires sufficient statistics structure Decision Tree –O(avc) RAM Bayesian Network –O(c^p) RAM

RAM Limitations Active Temporarily inactive

Outline Introduction Data Mining as Discrete Search Our Framework for Scaling Application to Decision Trees Other Applications and Results Conclusion

Applications VFDT (KDD ’00) – Decision Trees CVFDT (KDD ’01) – VFDT + concept drift VFBN & VFBN2 (KDD ’02) – Bayesian Networks Continuous Searches –VFKM (ICML ’01) – K-Means clustering –VFEM (NIPS ’01) – EM for mixtures of Gaussians Relational Data Sets –VFREL (Submitted) – Feature selection in relational data

CFVDT Experiments

Activity Profile for VFBN

Other Real World Data Sets Trace of all web requests from UW campus –Use clustering to find good locations for proxy caches KDD Cup 2000 Data set –700k page requests from an e-commerce site –Categorize pages into 65 categories, predict which a session will visit UW CSE Data set –8 Million sessions over two years –Predict which of 80 level 2 directories each visits Web Crawl of.edu sites –Two data sets each with two million web pages –Use relational structure to predict which will increase in popularity over time

Related Work DB Mine: A Performance Perspective (Agrawal, Imielinski, Swami ‘93) –Framework for scaling rule learning RainForest (Gehrke, Ramakrishnan, Ganti ‘98) –Framework for scaling decision trees ADtrees (Moore, Lee ‘97) –Accelerate computing sufficient stats PALO (Greiner ‘92) –Accelerate hill climbing search via sampling DEMON (Ganti, Gehrke, Ramakrishnan ‘00) –Framework for converting incremental algs. for time changing data streams

Future Work Combine framework for discrete search with frameworks for continuous search and relational learning Further study time changing processes Develop a language for specifying data stream learning algorithms Use framework to develop novel algorithms for massive data streams Apply algorithms to more real-world problems

Conclusion Framework helps scale up learning algorithms based on discrete search Resulting algorithms: –Work on databases and data streams –Work with limited resources –Adapt to time changing concepts –Learn in time proportional to concept complexity Independent of amount of training data! Benefits have been demonstrated in a series of applications