Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon.

Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon Yousefi

Powerpoint Templates 2 Outlines Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As

Powerpoint Templates 3 Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As

Powerpoint Templates 4 Introduction  In today’s information society, extraction of knowledge is becoming a very important task for many people. We live in an age of knowledge revolution.  Many organizations have more than very large data bases that grow at a rate of several million records per day.  Opportunities  Challenges  Main limited resources in knowledge discovery systems:  Time  Memory  Sample size

Powerpoint Templates 5 Introduction—cont.  Traditional systems:  Small amount of data is available  Using a fraction of available computational power  Current systems:  The bottleneck is time and memory  Using a fraction of available samples of data  Try to mine databases that don’t fit in main memory  Available algorithms:  Efficient, but not guarantee a similar learned model to the batch mode. Never recover from an unfavorable set of early examples. Sensitive to example ordering.  Produce the same model as batch version, but not efficiently. Slower than the batch algorithm.

Powerpoint Templates 6 Introduction—cont.  Requirements of algorithms to overcome these problems:  Operate continuously and indefinitely  Incorporate examples as they arrive  Never loosing potentially valuable information  Build a model using at most one scan of the data.  Use only a fixed amount of main memory.  Require small constant time per record.  Make a usable model available at any point in time.  Produce a model equivalent to the one obtained by ordinary database mining algorithm.  By changing the data-generating over time, the model at any time should be up-to-date.

Powerpoint Templates 7 Introduction—cont.  Such requirements are fulfilled by:  Incremental learning methods  Online methods  Successive methods  Sequential methods

Powerpoint Templates 9 Hoeffding Trees  Classic decision tree learners:  CART, ID3, C4.5  All examples simultaneously in main memory.  Disk based decision tree learners:  SLIQ, SPRINT  Examples are stored on disk.  Expensive to learn complex trees or very large datasets.  Consider a subset of training examples to find the best attribute:  For extremely large datasets.  Read each examples at most once.  Directly mine online data sources.  Build complex trees with acceptable computational cost.

Powerpoint Templates 10 Hoeffding Trees—cont.

Powerpoint Templates 11 Hoeffding Trees—cont.  Given a stream of examples:  Use first ones to choose the root test.  Pass succeeding ones to corresponding leaves.  Pick best attributes there.  … And so on recursively  How many examples are necessary at each node?  Hoeffding Bound  Additive Chernof Bound  A statistical result

Powerpoint Templates 14 Hoeffding Tree algorithm

Powerpoint Templates 15 Hoeffding Tree algorithm—cont.

Powerpoint Templates 16 Hoeffding Tree algorithm—cont.

Powerpoint Templates 20 The VFDT System

Powerpoint Templates 21 The VFDT System—cont.

Powerpoint Templates 25 The VFDT System—cont.  Initialization  VFDT can be initialized with the tree produced by a conventional RAM-based learner on a small subset of the data.  The tree can either be input as it is, or over-pruned.  Gives VFDT a “head start”.

Powerpoint Templates 26 The VFDT System—cont.  rescans  VFDT can rescan previously-seen examples.  Can be activate if:  The data arrives slowly enough that there is time for it.  The dataset is finite and small enough that it is feasible.

Powerpoint Templates 28 Synthetic Data Study

Powerpoint Templates 29 Synthetic Data Study—cont.

Powerpoint Templates 30 Synthetic Data Study—cont.

Powerpoint Templates 31 Synthetic Data Study—cont.  Accuracy as a function of the noise level.  4 runs on same concept (C4.5:100k,VFDT:20million examples)

Powerpoint Templates 32 Lesion Study

Powerpoint Templates 33 Web Data

Powerpoint Templates 34 Web Data—cont.  Performance on Web data

Powerpoint Templates 36 Conclusion  Hoeffding trees:  A method for learning online.  Learns the high-volume data streams.  Allows learning in very small constant time per example.  Guarantees high similarity to the corresponding batch trees.  VFDT system:  A high performance data mining system.  Based on Hoeffding trees.  Effective in taking advantage of massive number of examples.

Powerpoint Templates 38 Qs & As  Name 4 requirements of algorithms to overcome current disk- based available algorithms?  Operate continuously and indefinitely  Incorporate examples as they arrive  Never loosing potentially valuable information  Build a model using at most one scan of the data.  Use only a fixed amount of main memory.  Require small constant time per record.  Make a usable model available at any point in time.  Produce a model equivalent to the one obtained by ordinary database mining algorithm.  By changing the data-generating over time, the model at any time should be up-to-date

Powerpoint Templates 39 Qs & As  What are the benefits of considering a subset of training examples to find the best attribute:  For extremely large datasets.  Read each examples at most once.  Directly mine online data sources.  Build complex trees with acceptable computational cost.

Powerpoint Templates 40 Qs & As

Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon.

Similar presentations

Presentation on theme: "Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon.

Similar presentations

Presentation on theme: "Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon."— Presentation transcript:

Similar presentations

About project

Feedback