Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining: Concepts and Techniques 1 Mining Sequence Patterns in Transactional Databases CS240B --UCLA Notes by Carlo Zaniolo Based on those by J. Han.

Similar presentations


Presentation on theme: "Data Mining: Concepts and Techniques 1 Mining Sequence Patterns in Transactional Databases CS240B --UCLA Notes by Carlo Zaniolo Based on those by J. Han."— Presentation transcript:

1 Data Mining: Concepts and Techniques 1 Mining Sequence Patterns in Transactional Databases CS240B --UCLA Notes by Carlo Zaniolo Based on those by J. Han

2 2 Sequence Databases & Sequential Patterns zTransaction databases, time-series databases vs. sequence databases zFrequent patterns vs. (frequent) sequential patterns zApplications of sequential pattern mining yCustomer shopping sequences: xFirst buy computer, then CD-ROM, and then digital camera, within 3 months. yMedical treatments, natural disasters (e.g., earthquakes), science & eng. processes, stocks and markets, etc. yTelephone calling patterns, Weblog click streams yDNA sequences and gene structures

3 3 What Is Sequential Pattern Mining? zGiven a set of sequences, find the complete set of frequent subsequences A sequence database A sequence : An element may contain a set of items. Items within an element are unordered and we list them alphabetically. is a subsequence of SIDsequence 10 20 30 40

4 4 Subsequence is a subsequence of Def: S1 is a subsequence of S2 if S1 can be obtained from S2 by eliminating some of its elements. This is a partial order, not a lattice. No proper union and intersection operations A sequence database SIDsequence 10 20 30 40 The pattern Has support 2 in our Database.

5 5 The Apriori Property of Sequential Patterns zA basic property: Apriori (Agrawal & Sirkant’94) yIf a sequence S is not frequent yThen none of the super-sequences of S is frequent:antimonotonicity yE.g, is infrequent  so do and 50 40 30 20 10 SequenceSeq. ID Given support threshold min_sup =2

6 6 GSP—Generalized Sequential Pattern Mining zGSP (Generalized Sequential Pattern) mining algorithm yproposed by Agrawal and Srikant, EDBT’96 zOutline of the method yInitially, every item in DB is a candidate of length- 1 yfor each level (i.e., sequences of length-k) do xscan database to collect support count for each candidate sequence xgenerate candidate length-(k+1) sequences from length-k frequent sequences using Apriori yrepeat until no frequent sequence or no candidate can be found zMajor strength: Candidate pruning by Apriori

7 7 Finding Length-1 Sequential Patterns zExamine GSP using an example zInitial candidates: all singleton sequences y,,,,,,, zScan database once, count support for candidates 50 40 30 20 10 SequenceSeq. ID min_sup =2 CandSup 3 5 4 3 3 2 1 1

8 8 GSP: Generating Length-2 Candidates 51 length-2 Candidates Without Apriori property, 8*8+8*7/2=92 candidates Apriori prunes 44.57% candidates

9 9 The GSP Mining Process … … … … 1 st scan: 8 cand. 6 length-1 seq. pat. 2 nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all 3 rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all 4 th scan: 8 cand. 6 length-4 seq. pat. 5 th scan: 1 cand. 1 length-5 seq. pat. Cand. cannot pass sup. threshold Cand. not in DB at all 50 40 30 20 10 SequenceSeq. ID min_sup =2

10 10 Candidate Generate-and-test: Drawbacks zA huge set of candidate sequences generated. yEspecially 2-item candidate sequence. zMultiple Scans of database needed. yThe length of each candidate grows by one at each database scan. zInefficient for mining long sequential patterns. yA long pattern grow up from short patterns yThe number of short patterns is exponential to the length of mined patterns yWindows can be used to limit the search yMaximum intervals can be imposed between items. zNo efficient algorithm at hand for data streams.

11 11 From Sequential Patterns to Structured Patterns zSets, sequences, trees, graphs, and other structures yTransaction DB: Sets of items x{{i 1, i 2, …, i m }, …} ySeq. DB: Sequences of sets: x{, …} ySets of Sequences: x{{, …, }, …} ySets of trees: {t 1, t 2, …, t n } ySets of graphs (mining for frequent subgraphs): x{g 1, g 2, …, g n } zMining structured patterns in XML documents, bio- chemical structures, etc.

12 12 Episodes and Episode Pattern Mining zOther methods for specifying the kinds of patterns ySerial episodes: A  B yParallel episodes: A & B yRegular expressions: (A | B)C*(D  E) zMethods for episode pattern mining yVariations of Apriori-like algorithms, e.g., GSP yDatabase projection-based pattern growth xSimilar to the frequent pattern growth without candidate generation

13 13 Periodicity Analysis zPeriodicity is everywhere: tides, seasons, daily power consumption, etc. zFull periodicity yEvery point in time contributes (precisely or approximately) to the periodicity zPartial periodicit: A more general notion yOnly some segments contribute to the periodicity xJim reads NY Times 7:00-7:30 am every week day zCyclic association rules yAssociations which form cycles zMethods yFull periodicity: FFT, other statistical analysis methods yPartial and cyclic periodicity: Variations of Apriori-like mining methods

14 14 Sequential Pattern Mining Algorithms zConcept introduction and an initial Apriori-like algorithm yAgrawal & Srikant. Mining sequential patterns, ICDE’95 zApriori-based method: GSP (Generalized Sequential Patterns: Srikant & Agrawal @ EDBT’96) zPattern-growth methods: FreeSpan & PrefixSpan (Han et al.@KDD’00; Pei, et al.@ICDE’01) zVertical format-based mining: SPADE (Zaki@Machine Leanining’00) zConstraint-based sequential pattern mining (SPIRIT: Garofalakis, Rastogi, Shim@VLDB’99; Pei, Han, Wang @ CIKM’02) zMining closed sequential patterns: CloSpan (Yan, Han & Afshar @SDM’03)

15 15 Ref: Mining Sequential Patterns  R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT ’ 96. zH. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. DAMI:97. zM. Zaki. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning, 2001.  J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. ICDE'01 (TKDE ’ 04). zJ. Pei, J. Han and W. Wang, Constraint-Based Sequential Pattern Mining in Large Databases, CIKM'02. z X. Yan, J. Han, and R. Afshar. CloSpan: Mining Closed Sequential Patterns in Large Datasets. SDM'03. zJ. Wang and J. Han, BIDE: Efficient Mining of Frequent Closed Sequences, ICDE'04. zH. Cheng, X. Yan, and J. Han, IncSpan: Incremental Mining of Sequential Patterns in Large Database, KDD'04. zJ. Han, G. Dong and Y. Yin, Efficient Mining of Partial Periodic Patterns in Time Series Database, ICDE'99. zJ. Yang, W. Wang, and P. S. Yu, Mining asynchronous periodic patterns in time series data, KDD'00.


Download ppt "Data Mining: Concepts and Techniques 1 Mining Sequence Patterns in Transactional Databases CS240B --UCLA Notes by Carlo Zaniolo Based on those by J. Han."

Similar presentations


Ads by Google