Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jian Pei Jiawei Han Behzad Mortazavi-Asl Helen Pinto ICDE’01

Similar presentations


Presentation on theme: "Jian Pei Jiawei Han Behzad Mortazavi-Asl Helen Pinto ICDE’01"— Presentation transcript:

1 Jian Pei Jiawei Han Behzad Mortazavi-Asl Helen Pinto ICDE’01
PrefixSpan : Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth Jian Pei Jiawei Han Behzad Mortazavi-Asl Helen Pinto ICDE’01

2 Outlines What is sequential pattern mining? Our previous study
FreeSpan approach PrefixSpan approach Conclusion 2005/4/14 MAKING

3 Sequential Pattern Mining
Finding complete set of frequent subsequences(time-related) Broad applications : Most data and applications are time-related Customer shopping sequences: Natural disasters Disease and treatment Stock market fluctuation 2005/4/14 MAKING

4 Sequential Pattern Mining
A sequence : <(ef) (ab)(df)cb > A sequence database SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> Elements : items with an element are listed alphabetically <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)> SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> Given support threshold Min_sup = 2, <(ab)c> is a frequent sequential pattern 2005/4/14 MAKING

5 Previous studies on sequential pattern mining
Concept introduction and an initial Apriori-like algo. R.Agrawal & R.Srikant. Mining sequential patterns, ICDE’95 GSP : an apriori-based, influential mining method R.Skrikant & R.Agrawal. Mining sequential patterns: Generalizations and performance improvements, EDBT’96 A projection-based sequential pattern mining method J. Han, J. Pei, B. Mortazavi. FreeSpan: Frequent pattern-projected sequential pattern mining, KDD’00 2005/4/14 MAKING

6 FreeSpan : Frequent Pattern-Projected Sequential Pattern Mining
A divide-and-conquer approach Recursively project a sequence database into a set of smaller databases based on the current set of frequent patterns Mining each projected database to find its patterns 2005/4/14 MAKING

7 FreeSpan approach Def: Step 1: finding f_list -projected database
Ex : b(ce) - projected database = <b(ce)b> Step 1: finding f_list Min_sup = 2 f_list = <b:5, c:4, a:3, d:3, e:3, f:3> SID sequence 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)> 2005/4/14 MAKING

8 FreeSpan approach Step 2: using the f_list to construct the following matrix Table 1: Frequent item matrix after scan of S 2005/4/14 MAKING

9 FreeSpan approach Step 3 : using the matrix to generate the ann. of repeating items and projected DBs Table 2 : Pattern generation from the frequent item matrix 2005/4/14 MAKING

10 FreeSpan approach Based on the ann. for item-repeating patterns
We get {<bbf>:2, <fbf>:2, <(bf)b>:2, <(bf)f>:2 …} Based on the ann. for projected DBs Table 3 : Projected databases and their sequential patterns 2005/4/14 MAKING

11 PrefixSpan Def : Prefix and Postfix(Projection)
<a>, <aa>, <a(ab)>, and <a(abc)> are prefixes of sequence <a(abc)(ac)d(cf)> Given sequence <a(abc)(ac)d(cf)> Prefix Postfix / Projection <a> <(abc)(ac)d(cf)> <aa> <(_bc)(ac)d(cf)> <ab> <(_c)(ac)d(cf)> 2005/4/14 MAKING

12 PrefixSpan approach Step 1: find length-1 sequential patterns
<a>, <b>, <c>, <d>, <e>, <f> Step 2: divide search space.The complete seq. pat. can be partitioned into 6 subsets: The ones having prefix <a>; The ones having prefix <b>; The ones having prefix <f>; SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> 2005/4/14 MAKING

13 PrefixSpan approach Step 3 : Finding subsets of sequential patterns
Only need to consider projections w.r.t <a> <a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc> Find all the length-2 seq. pat. having prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> Further partition into 6 subsets Having prefix <aa>; Having prefix <ab>; Having prefix <af>; 2005/4/14 MAKING

14 Completeness of PrefixSpan
SDB SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> Length-1 sequential patterns <a>, <b>, <c>, <d>, <e>, <f> Having prefix <c>, …, <f> Having prefix <a> Having prefix <b> <a>-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc> <b>-projected database Length-2 sequential patterns <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> … … Having prefix <aa> Having prefix <af> <aa>-proj. db <af>-proj. db 2005/4/14 MAKING

15 Conclusion FreeSpan v.s. PrefixSpan PrefixSpan:
Projection-based : No candidate sequence needs to be generated Projection can be performed at any point in the sequence, and the projected sequences do will not shrink much PrefixSpan: Projection-based Only prefix-based projection : less projections and quickly shrinking sequences 2005/4/14 MAKING


Download ppt "Jian Pei Jiawei Han Behzad Mortazavi-Asl Helen Pinto ICDE’01"

Similar presentations


Ads by Google