Presentation is loading. Please wait.

Presentation is loading. Please wait.

Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008.

Similar presentations


Presentation on theme: "Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008."— Presentation transcript:

1 Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008

2 Outline Introduction and motivation SWIM algorithm DTV 、 DFV algorithm Experiments Conclusion

3 Introduction and motivation Conditional counting Verifiers: DTV,DFV verify the frequency of previously frequent itemsets over newly arriving windows Fast verifier for incremental frequent itemset mining: Sliding window incremental miner (SWIM)

4 SWIM algorithm The difficulty: a new pattern is added to pattern tree for the first time, its true frequency in the whole window is not known, since this pattern wasn`t frequent in the previous n-1 slides W: window PT (Pattern tree): a superset of the frequent patterns over W aux_array: stores the frequency of a pattern for each window, for which the frequency is unknown p.fi: the frequency of p in the ith slide p.freq: p`s cumulative frequency in the current window

5 SWIM algorithm (cont.) Example: S2S2 S3S3 S4S4 S5S5 S6S6 S7S7 …… W 4 : aux_array= p.freq=p.f 4 W 5 : aux_array= p.freq=p.f 4 +p.f 5 W 6 : aux_array= p.freq=p.f 4 +p.f 5 +p.f 6 W 7 : p.freq=p.f 5 +p.f 6 +p.f 7 W4W4 W5W5 W6W6 W7W7

6 Analysis of SWIM algorithm Delay: the frequency of pattern turns out to be larger than the minimum support Maximum delay:n-1 slides (n: number of slides) Bottleneck: counting frequencies of itemsets over a given dataset( delay=L, n-L+1slides)

7 Conditional counting Goal: verifies counts for a given set of patterns 1.p`s true frequency in D if it has occurred at least min_freq times 2.reports it has occurred less than min_freq (frequency not required in this case, it can skip any pattern whose frequency less than min_freq)

8 Conditional counting (cont.) Verification given a set of transaction T, a set of pattern P and a threshold s goal: find the exact freq of each p P w.r.t T, iff its freq is ≧ s if s=0,verification=counting, but if s>0 extra computation can be avoided Proposed fast verifiers DTV, DFV, hybrid ∈

9 Double-Tree Verifier (DTV) FP-tree root:? b:?g:? e:? d:? a b d c e f g h f:?g:? Pattern-tree b d c e f a root:? b:? d:? root d:4 b:5 a:5 c:5 e:1 b:1 g:1 e:1 h:1g:1 f:1 a b d c e f g h a b d c e f root a:3b:1 c:3 b:3e:1 d:2 c:2 a:2 root b:2 a b c b d c e f a root:4 b:4 d:2 root:? b:?g:4 e:? d:? a b d c e f g h f:?g:2 Conditionalized fp-tree on gConditionalized fp-tree |g on dOriginal fp-tree Initial pattern treepattern tree | ”g”pattern tree | ”g” after verification against FP-tree Filling original pattern tree using reverse pointers g:2

10 Double-Tree Verifier (DTV) for very small min_freq values, it becomes impossible to run FP-growth due to the exponential number of paths Advantage: it is useful when the minimum support decreases

11

12 Depth-First Verifier (DFV) Ancestor Failure: if a path in the fp-tree has already proved to not contain a prefix of the pattern p, then it does not contain p itself either (apriori property) Smaller Sibling Equivalence: if a path in the fp-tree has already been marked to (or not to) contain a smaller sibling of the pattern p, then it does (or does not) contain p itself too Parent Success: if a path in the fp-tree has already been marked to contain the parent pattern of p, then it also contain p

13

14 Hybrid Version many transactions in the fp-tree and many patterns in the pattern tree :DTV is faster than DFV trees are small: DFV is faster than DFV Hybrid: start with DTV until the conditionalized tree are “small enough” and after that point switch to DFV

15 Experiments

16 Experiments (cont.) transaction=100k

17 Conclusion Speed up many other application: incremental mining (SWIM) enhancing static algorithms (counting phase) privacy preserving techniques (long transaction) monitoring /concept shift detection Hybrid : no exactly point to switch DTV to DFV


Download ppt "Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008."

Similar presentations


Ads by Google