Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon.

Slides:



Advertisements
Similar presentations
Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.
Advertisements

Decision trees for stream data mining – new results
Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics.
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
Mining High-Speed Data Streams
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen
Decision Tree Approach in Data Mining
Mining High-Speed Data Streams Presented by: Tyler J. Sawyer UVM Spring CS 332 Data Mining Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International.
M INING H IGH -S PEED D ATA S TREAMS Presented by: Yumou Wang Dongyun Zhang Hao Zhou.
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Mining in Anticipation for Concept Change: Proactive-Reactive Prediction in Data Streams YING YANG, XINDONG WU, XINGQUAN ZHU Data Mining and Knowledge.
Scalable Classification Robert Neugebauer David Woo.
BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.
FP-Growth algorithm Vasiljevic Vladica,
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC /5/27 報告人:董原賓.
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
1 Decision Tree Classification Tomi Yiu CS 632 — Advanced Database Systems April 5, 2001.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
1 An Adaptive Nearest Neighbor Classification Algorithm for Data Streams Yan-Nei Law & Carlo Zaniolo University of California, Los Angeles PKDD, Porto,
1 Mining Decision Trees from Data Streams Tong Suk Man Ivy CSIS DB Seminar February 12, 2003.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Lecture 5 (Classification with Decision Trees)
Three kinds of learning
Cost-Based Plan Selection Choosing an Order for Joins Chapter 16.5 and16.6 by:- Vikas Vittal Rao ID: 124/227 Chiu Luk ID: 210.
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.
Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.
Issues with Data Mining
NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.
Mining High Speed Data Streams
Identifying Reversible Functions From an ROBDD Adam MacDonald.
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.
Handling Numeric Attributes in Hoeffding Trees Bernhard Pfahringer, Geoff Holmes and Richard Kirkby.
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
Mining Data Streams Challenges, Techniques, and Future Work Ruoming Jin Joint work with Prof. Gagan Agrawal.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.
Implementation of “A New Two-Phase Sampling Based Algorithm for Discovering Association Rules” Tokunbo Makanju Adan Cosgaya Faculty of Computer Science.
Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8.
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.
Bayesian Classification Using P-tree  Classification –Classification is a process of predicting an – unknown attribute-value in a relation –Given a relation,
Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.
1 Mining Decision Trees from Data Streams Thanks: Tong Suk Man Ivy HKU.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Lecture Notes for Chapter 4 Introduction to Data Mining
Bootstrapped Optimistic Algorithm for Tree Construction
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
B-Trees Katherine Gurdziel 252a-ba. Outline What are b-trees? How does the algorithm work? –Insertion –Deletion Complexity What are b-trees used for?
Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
Mining Concept-Drifting Data Streams Using Ensemble Classifiers Haixun Wang Wei Fan Philip S. YU Jiawei Han Proc. 9 th ACM SIGKDD Internal Conf. Knowledge.
Mining High-Speed Data Streams Presented by: William Kniffin Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Conference
University of Waikato, New Zealand
Mining Time-Changing Data Streams
Ch9: Decision Trees 9.1 Introduction A decision tree:
“Mining Highspeed Data streams” - Pedro Domingos and Geoff Hulten CSCE 566 SPRING 2017 By, Mounika Pylla mxp5826.
Spatial Online Sampling and Aggregation
Communication and Memory Efficient Parallel Decision Tree Construction
Farzaneh Mirzazadeh Fall 2007
Fast and Exact K-Means Clustering
An Adaptive Nearest Neighbor Classification Algorithm for Data Streams
Decision Trees for Mining Data Streams
Mining Decision Trees from Data Streams
Learning from Data Streams
Presentation transcript:

Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence Presented by: Afsoon Yousefi

Powerpoint Templates 2 Outlines Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As

Powerpoint Templates 3 Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As

Powerpoint Templates 4 Introduction  In today’s information society, extraction of knowledge is becoming a very important task for many people. We live in an age of knowledge revolution.  Many organizations have more than very large data bases that grow at a rate of several million records per day.  Opportunities  Challenges  Main limited resources in knowledge discovery systems:  Time  Memory  Sample size

Powerpoint Templates 5 Introduction—cont.  Traditional systems:  Small amount of data is available  Using a fraction of available computational power  Current systems:  The bottleneck is time and memory  Using a fraction of available samples of data  Try to mine databases that don’t fit in main memory  Available algorithms:  Efficient, but not guarantee a similar learned model to the batch mode. Never recover from an unfavorable set of early examples. Sensitive to example ordering.  Produce the same model as batch version, but not efficiently. Slower than the batch algorithm.

Powerpoint Templates 6 Introduction—cont.  Requirements of algorithms to overcome these problems:  Operate continuously and indefinitely  Incorporate examples as they arrive  Never loosing potentially valuable information  Build a model using at most one scan of the data.  Use only a fixed amount of main memory.  Require small constant time per record.  Make a usable model available at any point in time.  Produce a model equivalent to the one obtained by ordinary database mining algorithm.  By changing the data-generating over time, the model at any time should be up-to-date.

Powerpoint Templates 7 Introduction—cont.  Such requirements are fulfilled by:  Incremental learning methods  Online methods  Successive methods  Sequential methods

Powerpoint Templates 8 Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As

Powerpoint Templates 9 Hoeffding Trees  Classic decision tree learners:  CART, ID3, C4.5  All examples simultaneously in main memory.  Disk based decision tree learners:  SLIQ, SPRINT  Examples are stored on disk.  Expensive to learn complex trees or very large datasets.  Consider a subset of training examples to find the best attribute:  For extremely large datasets.  Read each examples at most once.  Directly mine online data sources.  Build complex trees with acceptable computational cost.

Powerpoint Templates 10 Hoeffding Trees—cont.

Powerpoint Templates 11 Hoeffding Trees—cont.  Given a stream of examples:  Use first ones to choose the root test.  Pass succeeding ones to corresponding leaves.  Pick best attributes there.  … And so on recursively  How many examples are necessary at each node?  Hoeffding Bound  Additive Chernof Bound  A statistical result

Powerpoint Templates 12 Hoeffding Trees—cont.

Powerpoint Templates 13 Hoeffding Trees—cont.

Powerpoint Templates 14 Hoeffding Tree algorithm

Powerpoint Templates 15 Hoeffding Tree algorithm—cont.

Powerpoint Templates 16 Hoeffding Tree algorithm—cont.

Powerpoint Templates 17 Hoeffding Trees—cont.

Powerpoint Templates 18 Hoeffding Trees—cont.

Powerpoint Templates 19 Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As

Powerpoint Templates 20 The VFDT System

Powerpoint Templates 21 The VFDT System—cont.

Powerpoint Templates 22 The VFDT System—cont.

Powerpoint Templates 23 The VFDT System—cont.

Powerpoint Templates 24 The VFDT System—cont.

Powerpoint Templates 25 The VFDT System—cont.  Initialization  VFDT can be initialized with the tree produced by a conventional RAM-based learner on a small subset of the data.  The tree can either be input as it is, or over-pruned.  Gives VFDT a “head start”.

Powerpoint Templates 26 The VFDT System—cont.  rescans  VFDT can rescan previously-seen examples.  Can be activate if:  The data arrives slowly enough that there is time for it.  The dataset is finite and small enough that it is feasible.

Powerpoint Templates 27 Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As

Powerpoint Templates 28 Synthetic Data Study

Powerpoint Templates 29 Synthetic Data Study—cont.

Powerpoint Templates 30 Synthetic Data Study—cont.

Powerpoint Templates 31 Synthetic Data Study—cont.  Accuracy as a function of the noise level.  4 runs on same concept (C4.5:100k,VFDT:20million examples)

Powerpoint Templates 32 Lesion Study

Powerpoint Templates 33 Web Data

Powerpoint Templates 34 Web Data—cont.  Performance on Web data

Powerpoint Templates 35 Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As

Powerpoint Templates 36 Conclusion  Hoeffding trees:  A method for learning online.  Learns the high-volume data streams.  Allows learning in very small constant time per example.  Guarantees high similarity to the corresponding batch trees.  VFDT system:  A high performance data mining system.  Based on Hoeffding trees.  Effective in taking advantage of massive number of examples.

Powerpoint Templates 37 Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As

Powerpoint Templates 38 Qs & As  Name 4 requirements of algorithms to overcome current disk- based available algorithms?  Operate continuously and indefinitely  Incorporate examples as they arrive  Never loosing potentially valuable information  Build a model using at most one scan of the data.  Use only a fixed amount of main memory.  Require small constant time per record.  Make a usable model available at any point in time.  Produce a model equivalent to the one obtained by ordinary database mining algorithm.  By changing the data-generating over time, the model at any time should be up-to-date

Powerpoint Templates 39 Qs & As  What are the benefits of considering a subset of training examples to find the best attribute:  For extremely large datasets.  Read each examples at most once.  Directly mine online data sources.  Build complex trees with acceptable computational cost.

Powerpoint Templates 40 Qs & As