Presentation is loading. Please wait.

Presentation is loading. Please wait.

WindMine: Fast and Effective Mining of Web-click Sequences SDM 2011Y. Sakurai et al.1 Yasushi Sakurai (NTT) Lei Li (Carnegie Mellon Univ.) Yasuko Matsubara.

Similar presentations


Presentation on theme: "WindMine: Fast and Effective Mining of Web-click Sequences SDM 2011Y. Sakurai et al.1 Yasushi Sakurai (NTT) Lei Li (Carnegie Mellon Univ.) Yasuko Matsubara."— Presentation transcript:

1 WindMine: Fast and Effective Mining of Web-click Sequences SDM 2011Y. Sakurai et al.1 Yasushi Sakurai (NTT) Lei Li (Carnegie Mellon Univ.) Yasuko Matsubara (Kyoto Univ.) Christos Faloutsos (Carnegie Mellon Univ.)

2 Introduction Web-click sequence applications Web masters and web-site owners -Capacity planning -Intrusion detection -Advertisement design Goal -Find meaningful patterns for web-click data (e.g., the lunch-break trend, huge spike, anomalies) -Find periodicity (daily and/or weekly, etc) -Determine suitable window sizes automatically SDM 2011Y. Sakurai et al.2

3 Introduction Examples access count from a business news site SDM 2011Y. Sakurai et al.3 Original web-click sequence

4 Problem definition Web-click sequences of m URLs: (X 1, …, X m ) Web-click sequence X of duration n : X = (x 1,…, x t,…, x n ) Local Component Analysis: Given m sequences of duration n, (X 1, …, X m ) -Find patterns, main components of the sequences -Find the ‘best window’ size w for the analysis Final challenge: scalable algorithm for the local component analysis SDM 2011Y. Sakurai et al.4

5 Background Independent component analysis (ICA) - PCA vs. ICA SDM 2011Y. Sakurai et al.5

6 Why not ‘PCA’? Example of component analysis SDM 2011Y. Sakurai et al.6 Source Mix

7 Why not ‘PCA’? Example of component analysis SDM 2011Y. Sakurai et al.7 ICA recognizes the components successfully and separately PCA ICA

8 Main idea (1) Multi-scale local component analysis SDM 2011Y. Sakurai et al.8 Divide a sequence into subsequences of length w Compute the local components from the window matrix a a b b c c d d e e f f g g h h a a b b c c d d e e f f g g h h a a b b c c d d e e f f g g h h

9 SDM 20119 Main idea (2) Best window size selection Proposed criterion: CEM (Component Entropy Maximization) -Estimate the optimal number of w for the sequence set -Compute the entropy of the weight values of the mixing matrix A -‘popular’ (widely-used) components show high CEM scores Y. Sakurai et al. Q : How to estimate a ‘good window size’ automatically when we have multiple sequences?

10 Main idea (2) CEM criterion: -CEM score of the j -th component for the window size w -Probability for the j -th component (size of the j -th component’s contribution to each subsequence) -Normalized weight values for each subsequences -Mixing matrix SDM 2011Y. Sakurai et al.10 k : # of components M : # of subsequences

11 WindMine-part Efficient solution Hierarchical partitioning approach: WindMine-part -Partition the original window matrix into sub-matrices -Extract local components each from the sub-matrices -Reuse the local components for the component analysis on the higher level SDM 2011Y. Sakurai et al.11 Q : How do we efficiently extract the best local component from large sequence sets?

12 WindMine-part SDM 2011Y. Sakurai et al.12 local componentswindow matrixsub-matrices Level 1 Level 2

13 Experimental Results Experiments with real and datasets Ondemand TV, WebClick, Automobile, Temperature, Sunspots Evaluation Accuracy for pattern discovery Accuracy for the best window size Computation time SDM 2011Y. Sakurai et al.13

14 Pattern discovery Ondemand TV access count of users SDM 2011Y. Sakurai et al.14 Original sequence PCA: failed Anomaly spikes Weekly pattern Daily pattern

15 Pattern discovery WebClick Q & A site SDM 2011Y. Sakurai et al.15 Weekly pattern Low activity during sleeping time Dip at dinner time Increase from morning to night and reach a peak Increase from morning to night and reach a peak

16 Pattern discovery WebClick job-seeking site SDM 2011Y. Sakurai et al.16 High activity on week days (daily access decreases as the weekend approaches) High activity on week days (daily access decreases as the weekend approaches) Workers arrive at their office Job seeking during a short break Large spike during the lunch break

17 Pattern discovery WebClick other websites SDM 2011Y. Sakurai et al.17 Educational site for kids (they visit here after school, 3pm) Educational site for kids (they visit here after school, 3pm) Website for baby nursery (the main users will be their parents, rather than babies!) Website for baby nursery (the main users will be their parents, rather than babies!) High activity 8am-11pm, weekday (business purposes) High activity 8am-11pm, weekday (business purposes)

18 Pattern discovery WebClick other websites SDM 2011Y. Sakurai et al.18 The users visit three times a day (early morning, noon, early evening) The users visit three times a day (early morning, noon, early evening) The users rarely visit here late in the evening (which is indeed good for their health!) The users rarely visit here late in the evening (which is indeed good for their health!) Access count is still high in the night, 0am-1am (healthy diet should include an earlier bed time!) Access count is still high in the night, 0am-1am (healthy diet should include an earlier bed time!) Access count increases after meal times

19 Pattern discovery Generalization of WindMine SDM 2011Y. Sakurai et al.19

20 Choice of best window size CEM score for various window sizes SDM 2011Y. Sakurai et al.20

21 Computation time Wall clock time vs. # of subsequences -Up to 70 times faster SDM 2011Y. Sakurai et al.21

22 Computation time Wall clock time vs. duration SDM 2011Y. Sakurai et al.22

23 Conclusions Scalable pattern extraction and anomaly detection in large web-click sequences 1.Scalable, parallelizable method for breaking sequences into a few, fundamental ingredients 2.Linearly over the sequence duration, and near-linearly on the number of sequence SDM 2011Y. Sakurai et al.23


Download ppt "WindMine: Fast and Effective Mining of Web-click Sequences SDM 2011Y. Sakurai et al.1 Yasushi Sakurai (NTT) Lei Li (Carnegie Mellon Univ.) Yasuko Matsubara."

Similar presentations


Ads by Google