1 Mining Decision Trees from Data Streams Thanks: Tong Suk Man Ivy HKU.

Slides:



Advertisements
Similar presentations
Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science.
Advertisements

Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson.
Is Random Model Better? -On its accuracy and efficiency-
Classification, Regression and Other Learning Methods CS240B Presentation Peter Huang June 4, 2014.
Random Forest Predrag Radenković 3237/10
Mining High-Speed Data Streams
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Mining High-Speed Data Streams Presented by: Tyler J. Sawyer UVM Spring CS 332 Data Mining Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International.
M INING H IGH -S PEED D ATA S TREAMS Presented by: Yumou Wang Dongyun Zhang Hao Zhou.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
G. Alonso, D. Kossmann Systems Group
Classification Techniques: Decision Tree Learning
BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.
Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC /5/27 報告人:董原賓.
Decision Tree under MapReduce Week 14 Part II. Decision Tree.
Planning under Uncertainty
1 Decision Tree Classification Tomi Yiu CS 632 — Advanced Database Systems April 5, 2001.
On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions Jing Gao† Wei Fan‡ Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
1 An Adaptive Nearest Neighbor Classification Algorithm for Data Streams Yan-Nei Law & Carlo Zaniolo University of California, Los Angeles PKDD, Porto,
Induction of Decision Trees
1 Mining Decision Trees from Data Streams Tong Suk Man Ivy CSIS DB Seminar February 12, 2003.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
Classification.
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.
CS4432: Database Systems II
Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Chapter 7 Decision Tree.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Mining High Speed Data Streams
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Handling Numeric Attributes in Hoeffding Trees Bernhard Pfahringer, Geoff Holmes and Richard Kirkby.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
COSC 2007 Data Structures II Chapter 15 External Methods.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
Load-Balancing Routing in Multichannel Hybrid Wireless Networks With Single Network Interface So, J.; Vaidya, N. H.; Vehicular Technology, IEEE Transactions.
Artificial Intelligence in Game Design N-Grams and Decision Tree Learning.
Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence Presented by: Afsoon.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
1 An Arc-Path Model for OSPF Weight Setting Problem Dr.Jeffery Kennington Anusha Madhavan.
B-TREE. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so much data that it won’t.
Bootstrapped Optimistic Algorithm for Tree Construction
Decision Trees.
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
BY International School of Engineering {We Are Applied Engineering} Disclaimer: Some of the Images and content have been taken from multiple online sources.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Mining High-Speed Data Streams Presented by: William Kniffin Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Conference
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
University of Waikato, New Zealand
Multiway Search Trees Data may not fit into main memory
Mining Time-Changing Data Streams
Ch9: Decision Trees 9.1 Introduction A decision tree:
“Mining Highspeed Data streams” - Pedro Domingos and Geoff Hulten CSCE 566 SPRING 2017 By, Mounika Pylla mxp5826.
Introduction to Data Mining, 2nd Edition by
Spatial Online Sampling and Aggregation
Indexing and Hashing Basic Concepts Ordered Indices
Database Design and Programming
An Adaptive Nearest Neighbor Classification Algorithm for Data Streams
Decision Trees for Mining Data Streams
Mining Decision Trees from Data Streams
Learning from Data Streams
Presentation transcript:

1 Mining Decision Trees from Data Streams Thanks: Tong Suk Man Ivy HKU

2 Contents Introduction: problems in mining data streams Classification of stream data VFDT algorithm Window approach CVFDT algorithm Experimental results Conclusions Future work

3 Data Streams Characteristics Large volume of ordered data points, possibly infinite Arrive continuously Fast changing Appropriate model for many applications: Phone call records Network and security monitoring Financial applications (stock exchange) Sensor networks

4 Problems in Mining Data Streams Traditional data mining techniques usually require Entire data set to be present Random access (or multiple passes) to the data Much time per data item Challenges of stream mining Impractical to store the whole data Random access is expensive Simple calculation per data due to time and space constraints

5 Processing Data Streams: Motivation A growing number of applications generate streams of data Performance measurements in network monitoring and traffic management Call detail records in telecommunications Transactions in retail chains, ATM operations in banks Log records generated by Web Servers Sensor network data Application characteristics Massive volumes of data (several terabytes) Records arrive at a rapid rate Goal: Mine patterns, process queries and compute statistics on data streams in real-time (from VLDB’02 Tutorial)

6 Data Streams: Computation Model A data stream is a (massive) sequence of elements: Stream processing requirements Single pass: Each record is examined at most once Bounded storage: Limited Memory (M) for storing synopsis Real-time: Per record processing time (to maintain synopsis) must be low Stream Processing Engine (Approximate) Answer Synopsis in Memory Data Streams

7 Network Management Application Network Management involves monitoring and configuring network hardware and software to ensure smooth operation Monitor link bandwidth usage, estimate traffic demands Quickly detect faults, congestion and isolate root cause Load balancing, improve utilization of network resources Network Operations Center Network Measurements Alarms (from VLDB’02 Tutorial)

8 IP Network Measurement Data IP session data (collected using Cisco NetFlow) AT&T collects 100 GBs of NetFlow data each day! AT&T collects 100 GB of NetFlow data per day! Source Destination Duration Bytes Protocol K http K http K http K http K http K ftp K ftp K ftp (from VLDB’02 Tutorial)

9 Network Data Processing Traffic estimation How many bytes were sent between a pair of IP addresses? What fraction network IP addresses are active? List the top 100 IP addresses in terms of traffic Traffic analysis What is the average duration of an IP session? What is the median of the number of bytes in each IP session? Fraud detection List all sessions that transmitted more than 1000 bytes Identify all sessions whose duration was more than twice the normal Security/Denial of Service List all IP addresses that have witnessed a sudden spike in traffic Identify IP addresses involved in more than 1000 sessions (from VLDB’02 Tutorial)

10 Data Stream Processing Algorithms Generally, algorithms compute approximate answers Difficult to compute answers accurately with limited memory Approximate answers - Deterministic bounds Algorithms only compute an approximate answer, but bounds on error Approximate answers - Probabilistic bounds Algorithms compute an approximate answer with high probability With probability at least, the computed answer is within a factor of the actual answer Single-pass algorithms for processing streams also applicable to (massive) terabyte databases! (from VLDB’02 Tutorial)

11 Classification of Stream Data VFDT algorithm “Mining High-Speed Data Streams”, KDD Pedro Domingos, Geoff Hulten CVFDT algorithm (window approach) “Mining Time-Changing Data Streams”, KDD Geoff Hulten, Laurie Spencer, Pedro Domingos

12 Hoeffding Trees

13 Definitions A classification problem is defined as: N is a set of training examples of the form (x, y) x is a vector of d attributes y is a discrete class label Goal: To produce from the examples a model y=f(x) that predict the classes y for future examples x with high accuracy

14 Decision Tree Learning One of the most effective and widely-used classification methods Induce models in the form of decision trees Each node contains a test on the attribute Each branch from a node corresponds to a possible outcome of the test Each leaf contains a class prediction A decision tree is learned by recursively replacing leaves by test nodes, starting at the root Age<30? Car Type= Sports Car? No Yes No

15 Challenges Classic decision tree learners assume all training data can be simultaneously stored in main memory Disk-based decision tree learners repeatedly read training data from disk sequentially Prohibitively expensive when learning complex trees Goal: design decision tree learners that read each example at most once, and use a small constant time to process it

16 Key Observation In order to find the best attribute at a node, it may be sufficient to consider only a small subset of the training examples that pass through that node. Given a stream of examples, use the first ones to choose the root attribute. Once the root attribute is chosen, the successive examples are passed down to the corresponding leaves, and used to choose the attribute there, and so on recursively. Use Hoeffding bound to decide how many examples are enough at each node

17 Hoeffding Bound Consider a random variable a whose range is R Suppose we have n observations of a Mean: Hoeffding bound states: With probability 1- , the true mean of a is at least, where

18 How many examples are enough? Let G( X i ) be the heuristic measure used to choose test attributes (e.g. Information Gain, Gini Index) X a : the attribute with the highest attribute evaluation value after seeing n examples. X b : the attribute with the second highest split evaluation function value after seeing n examples. Given a desired , if after seeing n examples at a node, Hoeffding bound guarantees the true, with probability 1- . This node can be split using X a, the succeeding examples will be passed to the new leaves.

19 Algorithm Calculate the information gain for the attributes and determines the best two attributes Pre-pruning: consider a “null” attribute that consists of not splitting the node At each node, check for the condition If condition satisfied, create child nodes based on the test at the node If not, stream in more examples and perform calculations till condition satisfied

20 Data Stream Age<30? Yes No Age<30? Car Type= Sports Car? No Yes No

21 Performance Analysis p: probability that an example passed through DT to level i will fall into a leaf at that point The expected disagreement between the tree produced by Hoeffding tree algorithm and that produced using infinite examples at each node is no greater than  /p. Required memory: O(leaves * attributes * values * classes)

22 VFDT

23 VFDT (Very Fast Decision Tree) A decision-tree learning system based on the Hoeffding tree algorithm Split on the current best attribute, if the difference is less than a user-specified threshold Wasteful to decide between identical attributes Compute G and check for split periodically Memory management Memory dominated by sufficient statistics Deactivate or drop less promising leaves when needed Bootstrap with traditional learner Rescan old data when time available

24 VFDT(2) Scales better than pure memory-based or pure disk-based learners Access data sequentially Use subsampling to potentially require much less than one scan VFDT is incremental and anytime New examples can be quickly incorporated as they arrive A usable model is available after the first few examples and then progressively defined

25 Experiment Results (VFDT vs. C4.5) Compared VFDT and C4.5 (Quinlan, 1993) Same memory limit for both (40 MB) 100k examples for C4.5 VFDT settings: δ= 10 -7, τ= 5%, n min =200 Domains: 2 classes, 100 binary attributes Fifteen synthetic trees 2.2k – 500k leaves Noise from 0% to 30%

26 Experiment Results Accuracy as a function of the number of training examples

27 Experiment Results Tree size as a function of number of training examples

28 Mining Time-Changing Data Stream Most KDD systems, include VFDT, assume training data is a sample drawn from stationary distribution Most large databases or data streams violate this assumption Concept Drift: data is generated by a time-changing concept function, e.g. Seasonal effects Economic cycles Goal: Mining continuously changing data streams Scale well

29 Window Approach Common Approach: when a new example arrives, reapply a traditional learner to a sliding window of w most recent examples Sensitive to window size If w is small relative to the concept shift rate, assure the availability of a model reflecting the current concept Too small w may lead to insufficient examples to learn the concept If examples arrive at a rapid rate or the concept changes quickly, the computational cost of reapplying a learner may be prohibitively high.

30 CVFDT

31 CVFDT CVFDT (Concept-adapting Very Fast Decision Tree learner) Extend VFDT Maintain VFDT’s speed and accuracy Detect and respond to changes in the example- generating process

32 Observations With a time-changing concept, the current splitting attribute of some nodes may not be the best any more. An outdated subtree may still be better than the best single leaf, particularly if it is near the root. Grow an alternative subtree with the new best attribute at its root, when the old attribute seems out-of-date. Periodically use a bunch of samples to evaluate qualities of trees. Replace the old subtree when the alternate one becomes more accurate.

33 CVFDT algorithm Alternate trees for each node in HT start as empty. Process examples from the stream indefinitely. For each example (x, y), Pass (x, y) down to a set of leaves using HT and all alternate trees of the nodes (x, y) passes through. Add (x, y) to the sliding window of examples. Remove and forget the effect of the oldest examples, if the sliding window overflows. CVFDTGrow CheckSplitValidity if f examples seen since last checking of alternate trees. Return HT.

34 CVFDT algorithm: process each example Pass example down to leaves add example to sliding window Window overflow? Forget oldest example CVFDTGrow CheckSplitValidty f examples since last checking? Yes No Yes Read new example

35 CVFDT algorithm: process each example Pass example down to leaves add example to sliding window Window overflow? Forget oldest example CVFDTGrow CheckSplitValidty f examples since last checking? Yes No Yes Read new example

36 CVFDTGrow For each node reached by the example in HT, Increment the corresponding statistics at the node. For each alternate tree T alt of the node, CVFDTGrow If enough examples seen at the leaf in HT which the example reaches, Choose the attribute that has the highest average value of the attribute evaluation measure (information gain or gini index). If the best attribute is not the “null” attribute, create a node for each possible value of this attribute

37 CVFDT algorithm: process each example Pass example down to leaves add example to sliding window Window overflow? Forget oldest example CVFDTGrow CheckSplitValidty f examples since last checking? Yes No Yes Read new example

38 Forget old example Maintain the sufficient statistics at every node in HT to monitor the validity of its previous decisions. VFDT only maintain such statistics at leaves. HT might have grown or changed since the example was initially incorporated. Assigned each node a unique, monotonically increasing ID as they are created. forgetExample (HT, example, maxID) For each node reached by the old example with node ID no larger than the max leave ID the example reaches, Decrement the corresponding statistics at the node. For each alternate tree Talt of the node, forget(Talt, example, maxID).

39 CVFDT algorithm: process each example Read new example Pass example down to leaves add example to sliding window Window overflow? Forget oldest example CVFDTGrow CheckSplitValidty f examples since last checking? Yes No Yes

40 CheckSplitValidtiy Periodically scans the internal nodes of HT. Start a new alternate tree when a new winning attribute is found. Tighter criteria to avoid excessive alternate tree creation. Limit the total number of alternate trees.

41 Smoothly adjust to concept drift Alternate trees are grown the same way HT is. Periodically each node with non-empty alternate trees enter a testing mode. M training examples to compare accuracy. Prune alternate trees with non-increasing accuracy over time. Replace if an alternate tree is more accurate. No Age<30? Car Type= Sports Car? No Yes No Married? YesNo Yes No Experience <1 year? NoYes No

42 Adjust to concept drift(2) Dynamically change the window size Shrink the window when many nodes gets questionable or data rate changes rapidly. Increase the window size when few nodes are questionable.

43 Performance Require memory O(nodes * attributes * attribute values * classes). Independent of the total number of examples. Running time O(L c * attributes * attribute values * number of classes). L c : the longest length an example passes through * number of alternate trees. Model learned by CVFDT vs. the one learned by VFDT-Window: Similar in accuracy O(1) vs. O(window size) per new example.

44 Experiment Results Compare CVFDT, VFDT, VFDT-Window 5 million training examples Concept changed at every 50k examples Drift Level: average percentage of the test points that changes label at each concept change. About 8% of test points change label each drift 100,000 examples in window 5% noise Test the model every 10k examples throughout the run, averaged these results.

45 Experiment Results (CVFDT vs. VFDT) Error rate as a function of number of attributes drift level

46 Experiment Results (CVFDT vs. VFDT) Tree size as a function of number of attributes

47 Experiment Results (CVFDT vs. VFDT) Error rates of learners as a function of the number of examples seen Portion of data set that is labelled -ve

48 Experiment Results (CVFDT vs. VFDT) Error rates as a function of the amount of concept drift

49 Experiment Results CVFDT’s drift characteristics

50 Experiment Results (CVFDT vs. VFDT vs. VFDT-window) Error rates over time of CVFDT, VFDT, and VFDT-window Stimulated by running VFDT on W for every 100K examples instead of every example Error Rate: VFDT: 19.4% CVFDT: 16.3% VFDT-Window: 15.3% Running Time: VFDT: 10 minutes CVFDT: 46 minutes VFDT-Window: expect 548 days

51 Experiment Results CVFDT not use too much RAM D=50, CVFDT never uses more than 70MB Use as little as half the RAM of VFDT VFDT often had twice as many leaves as the number of nodes in CVFDT’s HT and alternate subtrees combined Reason: VFDT considers many more outdated examples and is forced to grow larger trees to make up for its earlier wrong decisions due to concept drift

52 Conclusions CVFDT – a decision-tree induction system capable of learning accurate models from high speed, concept-drifting data streams Grow an alternative subtree whenever an old one becomes questionable Replace the old subtree when the new more accurate Similar in accuracy to applying VFDT to a moving window of examples

53 Future Work Concepts changed periodically and removed subtrees may become useful again Comparisons with related systems Continuous attributes Weighting examples

54 Reference List P. Domingos and G. Hulten. Mining high-speed data streams. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001 V. Ganti, J. Gehrke, and R. Ramakrishnan. DEMON: Mining and monitoring evolving data. In Proceedings of the Sixteenth International Conference on Data Engineering, J. Gehrke, V. Ganti, R. Ramakrishnan, and W.L. Loh. BOAT: optimistic decision tree construction. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, 1999.

55 The end Q & A

56 Thank You!