Presentation is loading. Please wait.

Presentation is loading. Please wait.

Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon.

Similar presentations


Presentation on theme: "Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon."— Presentation transcript:

1 Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon Yousefi

2 Powerpoint Templates 2 Outlines Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As

3 Powerpoint Templates 3 Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As

4 Powerpoint Templates 4 Introduction  In today’s information society, extraction of knowledge is becoming a very important task for many people. We live in an age of knowledge revolution.  Many organizations have more than very large data bases that grow at a rate of several million records per day.  Opportunities  Challenges  Main limited resources in knowledge discovery systems:  Time  Memory  Sample size

5 Powerpoint Templates 5 Introduction—cont.  Traditional systems:  Small amount of data is available  Using a fraction of available computational power  Current systems:  The bottleneck is time and memory  Using a fraction of available samples of data  Try to mine databases that don’t fit in main memory  Available algorithms:  Efficient, but not guarantee a similar learned model to the batch mode. Never recover from an unfavorable set of early examples. Sensitive to example ordering.  Produce the same model as batch version, but not efficiently. Slower than the batch algorithm.

6 Powerpoint Templates 6 Introduction—cont.  Requirements of algorithms to overcome these problems:  Operate continuously and indefinitely  Incorporate examples as they arrive  Never loosing potentially valuable information  Build a model using at most one scan of the data.  Use only a fixed amount of main memory.  Require small constant time per record.  Make a usable model available at any point in time.  Produce a model equivalent to the one obtained by ordinary database mining algorithm.  By changing the data-generating over time, the model at any time should be up-to-date.

7 Powerpoint Templates 7 Introduction—cont.  Such requirements are fulfilled by:  Incremental learning methods  Online methods  Successive methods  Sequential methods

8 Powerpoint Templates 8 Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As

9 Powerpoint Templates 9 Hoeffding Trees  Classic decision tree learners:  CART, ID3, C4.5  All examples simultaneously in main memory.  Disk based decision tree learners:  SLIQ, SPRINT  Examples are stored on disk.  Expensive to learn complex trees or very large datasets.  Consider a subset of training examples to find the best attribute:  For extremely large datasets.  Read each examples at most once.  Directly mine online data sources.  Build complex trees with acceptable computational cost.

10 Powerpoint Templates 10 Hoeffding Trees—cont.

11 Powerpoint Templates 11 Hoeffding Trees—cont.  Given a stream of examples:  Use first ones to choose the root test.  Pass succeeding ones to corresponding leaves.  Pick best attributes there.  … And so on recursively  How many examples are necessary at each node?  Hoeffding Bound  Additive Chernof Bound  A statistical result

12 Powerpoint Templates 12 Hoeffding Trees—cont.

13 Powerpoint Templates 13 Hoeffding Trees—cont.

14 Powerpoint Templates 14 Hoeffding Tree algorithm

15 Powerpoint Templates 15 Hoeffding Tree algorithm—cont.

16 Powerpoint Templates 16 Hoeffding Tree algorithm—cont.

17 Powerpoint Templates 17 Hoeffding Trees—cont.

18 Powerpoint Templates 18 Hoeffding Trees—cont.

19 Powerpoint Templates 19 Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As

20 Powerpoint Templates 20 The VFDT System

21 Powerpoint Templates 21 The VFDT System—cont.

22 Powerpoint Templates 22 The VFDT System—cont.

23 Powerpoint Templates 23 The VFDT System—cont.

24 Powerpoint Templates 24 The VFDT System—cont.

25 Powerpoint Templates 25 The VFDT System—cont.  Initialization  VFDT can be initialized with the tree produced by a conventional RAM-based learner on a small subset of the data.  The tree can either be input as it is, or over-pruned.  Gives VFDT a “head start”.

26 Powerpoint Templates 26 The VFDT System—cont.  rescans  VFDT can rescan previously-seen examples.  Can be activate if:  The data arrives slowly enough that there is time for it.  The dataset is finite and small enough that it is feasible.

27 Powerpoint Templates 27 Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As

28 Powerpoint Templates 28 Synthetic Data Study

29 Powerpoint Templates 29 Synthetic Data Study—cont.

30 Powerpoint Templates 30 Synthetic Data Study—cont.

31 Powerpoint Templates 31 Synthetic Data Study—cont.  Accuracy as a function of the noise level.  4 runs on same concept (C4.5:100k,VFDT:20million examples)

32 Powerpoint Templates 32 Lesion Study

33 Powerpoint Templates 33 Web Data

34 Powerpoint Templates 34 Web Data—cont.  Performance on Web data

35 Powerpoint Templates 35 Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As

36 Powerpoint Templates 36 Conclusion  Hoeffding trees:  A method for learning online.  Learns the high-volume data streams.  Allows learning in very small constant time per example.  Guarantees high similarity to the corresponding batch trees.  VFDT system:  A high performance data mining system.  Based on Hoeffding trees.  Effective in taking advantage of massive number of examples.

37 Powerpoint Templates 37 Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As

38 Powerpoint Templates 38 Qs & As  Name 4 requirements of algorithms to overcome current disk- based available algorithms?  Operate continuously and indefinitely  Incorporate examples as they arrive  Never loosing potentially valuable information  Build a model using at most one scan of the data.  Use only a fixed amount of main memory.  Require small constant time per record.  Make a usable model available at any point in time.  Produce a model equivalent to the one obtained by ordinary database mining algorithm.  By changing the data-generating over time, the model at any time should be up-to-date

39 Powerpoint Templates 39 Qs & As  What are the benefits of considering a subset of training examples to find the best attribute:  For extremely large datasets.  Read each examples at most once.  Directly mine online data sources.  Build complex trees with acceptable computational cost.

40 Powerpoint Templates 40 Qs & As


Download ppt "Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon."

Similar presentations


Ads by Google