1 Active Mining of Data Streams Wei Fan, Yi-an Huang, Haixun Wang and Philip S. Yu Proc. SIAM International Conference on Data Mining 2004 Speaker: Pei-Min.

Presentation on theme: "1 Active Mining of Data Streams Wei Fan, Yi-an Huang, Haixun Wang and Philip S. Yu Proc. SIAM International Conference on Data Mining 2004 Speaker: Pei-Min."— Presentation transcript:

1 Active Mining of Data Streams Wei Fan, Yi-an Huang, Haixun Wang and Philip S. Yu Proc. SIAM International Conference on Data Mining 2004 Speaker: Pei-Min Chou Date:2005/01/14

2 Introduction  In most real-world problems, labelled data streams are rarely immediately available  models are refreshed periodically  we propose a new concept of demand-driven active data mining.

3 Method  Step1:Detect potential changes of data streams --- ” Guess ”  Step2:If guessed loss or error rate higher than tolerable maximum--- choose a small number of data records  Step3:If statistically estimated loss higher than tolerable maximum--- Reconstruct the old model

4 Definition(1)  D c :complete data set  D:training set  S:data stream  dt:Decision tree constructed from D  Tolerable Maximum: Exact values are completely defined by each application

5 Definition(2)  n l :number of instance classified by leaf l  N:size of data stream  Statistic at leaf l  Σp( l )=1

6 Example Name Banklocalprice MaryICEA500 JohnIAEB700 BillyICEA100 EllaICEB300 BobIDEC500 PaulIBEB700 TomICEA100 AmyIBEB700 Name Banklocalpriceclass MaryICEA500C2 JohnIAEB700C4 BillyICEA100C1 EllaICEB300C3 PaulIBEB700C6 TomICEA100C1 AmyIBEB700C6 D:training set Dc:complete set

7 Example---decision tree Bank is ICE Local is ABank is IBE Price is 100 Local is B C1: Billy Tom C2: Mary C3: Ella C4: John C6: Paul Amy yes no P D ( l )=2/7 1/7 2/7 C5 0 yes

8 Observable Statistics(1)  p s ( l ):statistic at leaf l in S  p D ( l ): statistic at leaf l in D  Change of leaf statistic on data stream  PS means that significant change occur

9 Example(2) Name Banklocalprice ErinICEA500 JoJoIAEB700 BossIBEC500 HebeICEA500 SamIBEC500 Bank is ICE Local is ABank is IBE Price is 100Local is B C1C2: Erin Hebe C4: Boss Sam C5: JoJo C6 yes no yes no P s ( l )=0 2/5 1/5 0 S: New data stream C3 0

10 Observable Statistics(2)  L a :validation loss  L e :sum of expected loss at every leaf  LS:potential change in loss due to changes in the data stream  Difference :LS take the loss function into account

11 Example(3) Name Banklocalprice HebeICEA- SamIBEC700 Bank is ICE Local is ABank is IBE Price is 100Local is B C1C2C4: Boss Sam C5: JoJo C6: yes no yes no S: New data stream C3 Major 0.7 L e(C2)=(1-0.7)*30%=9% ErinHebe 30%

12 Loss Estimation  When two statistics above tolerable maximum occur  Investigate true class labels of a selected number of example  Assume loss of each example:{l 1. l 2. l 3…. l n }  Average loss : Σl i/n  Standard error: ( )  Investigation cost :not for free

13 Experiment(1) Changing statistics is good indicator of change

14 Experiment(2)

15 Experiment(3)

16 Experiment(4)

17 Experiment---Result  Two statistics are very well correlated with the amount of change  Statistically estimated loss range is very close to true value

18 Conclusion  Estimates the error without knowing the true class labels  statistical sampling method to estimate the range of true loss  Model reconstruction whenever estimated loss is higher than tolerable maximum.

Download ppt "1 Active Mining of Data Streams Wei Fan, Yi-an Huang, Haixun Wang and Philip S. Yu Proc. SIAM International Conference on Data Mining 2004 Speaker: Pei-Min."

Similar presentations