Download presentation
Presentation is loading. Please wait.
Published byOswald Bates Modified over 8 years ago
1
Rakesh Agrawal Ramakrishnan Srikant Proc. of the Int ’ l Conference on Data Engineering (ICDE) March 1995 Presenter: İlkcan Keleş
2
Introduction Problem Definition Finding Sequential Patterns (The Main Algorithm) Sequence Phase (AprioriAll, AprioriSome, DynamicSome) Experiments and Results Conclusion
3
Progress in barcode technology has made it possible to collect and store massive amounts of sales data (basket data). A transaction record typically consists of ◦ the transaction date ◦ the items bought We would like to extract sequential buying patterns. ◦ E.g: Video Rental: Star Wars; Empire Strikes Back; Return of the Jedi Elements of a sequence need not be simple items. (elements can be set of items) ◦ E.g: Sheets and pillow cases; Comforter; Drapes and ruffles.
4
Association rules refer to what items are bought together. ◦ Intra-transaction patterns ◦ Apriori algorithm deals with association rules. Sequential patterns refer to what items are bought at different times. ◦ Inter-transaction patterns. ◦ This paper proposes three different algorithm for sequential pattern mining.
5
Given a database D of customer transactions. Each transaction: ◦ Customer ID ◦ Transaction time ◦ Set of items purchased No two transactions have same customer ID and transaction time. Quantities of items bought in the transaction were not considered. Item was either purchased or not purchased.
6
Itemset : Nonempty set of items. ◦ It is assumed that set of items are mapped to a set of contiguous integers. Sequence : An ordered list of itemsets. ◦ Contained in : A sequence is contained in another sequence if there exists integers i 1 < i 2 < … < i n such that a 1 ⊆ b i 1, a 2 ⊆ b i 2 … a n ⊆ b i n. Customer Sequence : List of customer transactions ordered by increasing transaction time. ◦ A customer supports a sequence if the sequence is contained in customer sequence.
7
Support for a sequence : The fraction of total customers who support this sequence. Maximal Sequence : In a set of sequences, a sequence s is maximal if s is not contained in any other sequence. Large Sequence : Sequence that satisfies a minimum support. Given a database D of customer transactions, the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain user-specified minimum support.
8
Customer IdTransaction TimeItems Bought 1June 25 ‘9330 1June 30 ’9390 2June 10 ’9310, 20 2June 15 ’9330 2June 20 ’9340, 60, 70 3June 25 ’9330, 50, 70 4June 25 ’9330 4June 30 ’9340, 70 4July 25 ’9390 5June 12 ’9390 Database Sorted by Customer ID and Transaction Time
9
Customer IdCustomer Sequence 1 2 3 4 5 Minimum support of 40%: a minimum support of 2 customers. is only supported by Customer #2.,, have minimum support but they are not maximal. Sequential Patterns with Support > 40%
10
Terminology : ◦ The length of a sequence is the number of itemsets in the sequence. A sequence of length k is called a k-sequence. ◦ The itemset i and the 1-sequence have the same support. Recall: Support for an itemset is the fraction of customers who bought the items in a single transaction. ◦ An itemset with minimum support is called a large itemset or litemset. ◦ Any large sequence must be a list of litemsets. Each itemset in a large sequence must have minimum support.
11
Phases: ◦ Sort Phase ◦ Litemset Phase ◦ Transformation Phase ◦ Sequence Phase ◦ Maximal Phase
12
Sort the database ◦ Customer ID as the major key. ◦ Transaction time as the minor key. Implicitly converts the original transaction database into a database of customer sequences. Customer IdTransaction Time Items Bought 1June 25 ‘9330 1June 30 ’9390 2June 10 ’9310, 20 2June 15 ’9330 2June 20 ’9340, 60, 70 3June 25 ’9330, 50, 70 4June 25 ’9330 4June 30 ’9340, 70 4July 25 ’9390 5June 12 ’9390
13
Set of all litemsets L is found. ◦ Simultaneously the set of all large 1-sequences is found since it is just { | l ∊ L}. Any algorithm to find the large itemsets in transactions can be adapted. ◦ The main difference is the definition of support. In these algorithms, the support is the fraction of transactions including this itemset. In sequential pattern finding problem support should be changed to fraction of customers.
14
The set of litemsets is mapped to a set of contiguous integers. ◦ Reason of mapping being able to treat litemsets as single entities. Two itemsets can be compared for equality in constant time. The time required to check if a sequence is contained in customer sequence is reduced. Customer IdTransaction Time Items Bought 1June 25 ‘9330 1June 30 ’9390 2June 10 ’9310, 20 2June 15 ’9330 2June 20 ’9340, 60, 70 3June 25 ’9330, 50, 70 4June 25 ’9330 4June 30 ’9340, 70 4July 25 ’9390 5June 12 ’9390 Large ItemsetsMapped To (30)1 (40)2 (70)3 (40 70)4 (90)5
15
Each customer sequence is transformed to an alternative representation. ◦ In order to increase speed of determining whether a sequence is contained in a customer sequence or not. Each transaction is replaced by the set of all litemsets contained in that transaction. ◦ If a transaction does not contain any litemset, it is not retained in the transformed sequence. ◦ If a customer sequence does not contain any litemset, this sequence is dropped from the transformed database. It still contributes to the count of total number of customers.
16
Customer Id Original Customer Sequence Transformed Customer SequenceAfter Mapping 1 2 3 4 5 (10 20) is dropped since it does not contain any litemset. (40 60 70) is replaced by {(40), (70), (40 70)} since 60 does not have minimum support, i.e. it is not a litemset of this database.
17
Finds large sequences: ◦ AprioriAll ◦ AprioriSome ◦ DynamicSome We will come back to this phase again.
18
Find maximal sequences among the large sequences. ◦ In some algorithms, this phase is combined with the sequence phase. ◦ k-sequence: Sequence of length k ◦ S: the set of all large sequences found in previous phase. for (k=n; k>1; k--) do for each k-sequence s k do Delete from S all subsequences of s k Authors claim data structures and algorithms exist to do this efficiently (hash tree).
19
Sequence phase finds all of the large sequences. seed set of sequences candidate sequences scan data to find support for candidates large sequences In the first pass, all 1-sequences with minimum support, obtained in the litemset phase, form the seed set.
20
Two families of algorithms ◦ Count-all Counts all the large sequences, including non-maximal sequences. Non-maximal sequences must be pruned out. (Maximal Phase) They present one count-all algorithm named AprioriAll. ◦ Count-some The intuition behind these algorithms: Since we are only interested in maximal sequences, we can avoid counting sequences which are contained in a longer sequence if we first count longer sequences. Note: We have to be careful not to count a lot of longer sequences that do not have minimum support. Otherwise, count-some algorithms may be less effective than count-all algorithms.
21
Based on the normal Apriori algorithm Counts all the large sequences Prunes non-maximal sequences in the "Maximal phase"
22
L 1 = {large 1-sequences} for (k = 2; L k-1 ≠ {}; k++) do begin C k = New candidates generated from L k-1 foreach customer-sequence c in the database do Increment the count of all candidates in C k that are contained in c. L k = Candidates in C k with minimum support. end Answer = Maximal Sequences in U k L k Notation: L k : Set of all large k-sequences C k : Set of candidate k-sequences
23
Candidate generation: ◦ Join L k-1 with itself to form C k insert into C k select p.litemset 1,p.litemset k-1,q.litemset k-1 from L k-1 p, L k-1 q where p.litemset 1 =q.litemset 1 … p.litemset k-2 =q.litemset k-2 ◦ Delete c in C k such that some (k-1)-sub sequence of c is not in L k-1. L3 C4 (after join) C4 (after pruning)
24
Cust. Sequences L1Support 4 2 4 4 4 L2Support 2 4 3 2 2 2 3 2 2 0 L3Support 2 2 3 2 2 1 L4Support 2 Answer:,,
25
Avoid counting sequences that are contained in longer sequences by counting the longer ones first. Also avoid counting a lot of longer sequences that do not have minimum support. Two phases: ◦ Forward phase: find all large sequences of certain lengths ◦ Backward phase: find all remaining large sequences The same generation function is used to generate candidates. ◦ If L k-1 is not available, we generate C k from C k-1
26
Function next determines the next sequence length which is counted. The next function they used in experiments: ◦ Let hit k denote the ratio of the number of large k-sequences to the number of candidate k-sequences. (i.e., |L k | / |C k |) ◦ The intuition behind this heuristic: As the percentage of candidates counted in the current pass which had minimum support increases, the time wasted by counting extensions of small candidates when we skip a length goes down. function next(k: integer) begin if(hit k < 0.666) return k+1; //degenerates to AprioriAll elseif(hit k < 0.75) return k+2; elseif(hit k < 0.80) return k+3; elseif(hit k < 0.85) return k+4; else return k+5; end
27
L 1 = {large 1-sequences} C 1 = L 1 last = 1 for (k = 2; C k-1 ≠ {} and L last ≠ {}; k++) do begin if (L k-1 known) then C k = New candidates generated from L k-1 else C k = New candidates generated from C k-1 if (k==next(last)) then begin // (next k to count?) foreach customer-sequence c in the database do Increment the count of all candidates in C k that are contained in c. L k = Candidates in C k with minimum support. last = k; end
28
for (k--; k>=1; k--) do if (L k not found in forward phase) then begin Delete all sequences in C k contained in some L i i>k; foreach customer-sequence c in D T do Increment the count of all candidates in C k that are contained in c L k = Candidates in C k with minimum support end else // l k already known Delete all sequences in L k contained in some L i i>k; Answer = U k L k //(Maximal Phase not Needed) *Notation: D T ; Transformed database
29
L2Support 2 4 3 3 2 2 3 2 2 C3 L3Support 2 C4 L4Support 2 C3 (After Deletion)
30
Similar to AprioriSome AprioriSome generates C k from C k-1. DynamicSome generates C k ‘on the fly’. ◦ based on large sequences found from the previous passes and the customer sequences read from the database
31
In the initialization phase, count only sequences up to and including step variable length ◦ If step is 3, count sequences of length 1, 2 and 3 In the forward phase, we generate sequences of length 2 × step, 3 × step, 4 × step, etc. on- the-fly based on previous passes and customer sequences in the database ◦ While generating sequences of length 9 with a step size 3: While passing the data, if sequences s 6 L 6 and s 3 L 3 are both contained in the customer sequence c in hand, and they do not overlap in c, then s 3. s 6 is a candidate (3+6)-sequence
32
In the intermediate phase, generate the candidate sequences for the skipped lengths. ◦ If we have counted L 6 and L 3, and L 9 turns out to be empty: we generate C 7 and C 8, count C 8 followed by C 7 after deleting non-maximal sequences, and repeat the process for C 4 and C 5 The backward phase is identical to AprioriSome.
33
They generated synthetic customer transactions. NameMeaning |D|Number of customers(= size of Database) |C|Average number of transactions per customer |T|Average number of items per transaction |S|Average length of maximal potentially large sequences |I|Average size of itemsets in maximal potentially large sequences NsNs Number of potentially large Sequences NINI Number of potentially large Itemsets NNumber of items Name|C||T||S||I|Size(MB) C10-T5-S4-I1.2510541.255.8 C10-T5-S4-I2.510542.56.0 C20-T2.5-S4-I1.25202.541.256.9 C20-T2.5-S8-I1.25202.581.257.8
35
They proposed an algorithm for finding sequential patterns in a database. They proposed three different algorithm for the sequence phase. ◦ AprioriAll ◦ AprioriSome ◦ DynamicSome
36
[1] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. of the ACM SIGMOD Con- ference on Management of Data, pages 207{216, Washington, D.C., May 1993. [2] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. of the VLDB Conference, Santiago, Chile, September1994. Expanded version available as IBM Re-search Report RJ9839, June 1994. [3] R. Agrawal and R. Srikant. Mining sequential patterns. Research Report RJ 9910, IBM Al-maden Research Center, San Jose, California, October 1994. [4] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. A basic local alignment search tool. Journal of Molecular Biology, 1990. [5] A. Califano and I. Rigoutsos. Flash: A fast look-up algorithm for string homology. In Proc. of the 1st International Converence on Intelligent Systems for Molecular Biology, Bethesda, MD, July 1993. [6] T. G. Dietterich and R. S. Michalski. Discovering patterns in sequences of events. Artificial Intelligence, 25:187{232, 1985. [7] L. Hui. Color set size problem with applications to string matching. In A. Apostolico, M. Crochemere, Z. Galil, and U. Manber, editors, Combinatorial Pattern Matching, LNCS 644, pages 230{243. Springer-Verlag, 1992. [8] M. Roytberg. A search for common patterns in many sequences. Computer Applications in the Biosciences, 8(1):57{64, 1992. [9] M. Vingron and P. Argos. A fast and sensitive multiple sequence alignment algorithm. Computer Applications in the Biosciences, 5:115{122, 1989. [10] J. T.-L. Wang, G.-W. Chirn, T. G. Marr, B. Shapiro, D. Shasha, and K. Zhang. Combinatorial pattern discovery for scientific data: Some preliminary results. In Proc. of the ACM SIG-MOD Conference on Management of Data, Minneapolis, May 1994. [11] M. Waterman, editor. Mathematical Methods for DNA Sequence Analysis. CRC Press, 1989. [12] S. Wu and U. Manber. Fast text searching allowing errors. Communications of the ACM, 35(10):83{91, October 1992.
37
THANKS FOR LISTENING!!!
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.