Mining Sequential Patterns: Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by:

Mining Sequential Patterns: Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by: M.H. Lin

Outline Motivation Objective Introduction Problem Statement The New Algorithm: GSP Performance Evaluation Conclusion Personal Opinion

Motivation The problem of mining sequential patterns was recently introduced. Limitations of the AprioriAll [Agrawal, 1995] Absence of time constraints Rigid definition of a transaction Absence of taxonomies

Objective We present GSP, a new algorithm that discovers these generalized sequential patterns Empirically compared the performance of GSP with the AprioriAll algorithm.

Introduction Instance A database of sequences, called data-sequences Each sequence is a list of transactions ordered by transaction- time Each transaction is a set of items Definitions: A sequential pattern consists a list of itemsets Support:the number of data-sequences that contain the pattern Problem: To discover all the sequential patterns with a user-specified minimum support

Example Of A Sequential Pattern Database of book-club, each data-sequence corresponds to a given customer’s all book selection, each transaction contains the books selected by the given customer in one order A sequential pattern: 5% of customers bought ‘Foundation’, then ‘Foundation and Empire’ and ‘Ringworld’, then ‘Second Foundation’

Features of A Sequential Pattern E.g: 5% cust. bought ‘Foundation’, then ‘Foundation and Empire’ and ‘Ringworld’, then ‘Second Foundation’ The Maximum and/or minimum time gaps between adjacent elements. Eg: the time between buying ‘Foundation’, and then ‘Foundation and Empire’ and ‘Ringworld’ should be within 3 months A sliding time window over the sequence-pattern elements E.g.: one week Mo: BK-a Sa: BK-b Next Su: BK-c ; This data-sequence supports the pattern “BK-a” and “ BK-b”, then “BK-c” User-defined Taxonomies Example  coming soon….

A User-defined Taxonomy A customer who bought Foundation,then Perfect Spy, would support the following patterns: Foundation, then Perfect Spy Asimov, then Perfect Spy Science Fiction, then Le Carre …

The Old Algorithm--AprioriAll A 3-phase algorithm Phase 1: finds all frequent itemsets with min. support Phase 2: transforms the DB s.t. each transaction only contains the frequent itemsets Phase 3: finds sequential patterns Pros. Can Discover all frequent sequential patterns Cons. Computationally expensive: space, time Not feasible to incorporate sliding windows

Problem Statement Definitions: Let I = {i 1,i 2,…,i m } be a set of literals, called items Let T be a directed acyclic graph on the literals. An itemset is a non-empty set of items A sequence is an ordered list of itemsets We denote a sequence s by, where s j is an itemset. We denote an element of sequence by (x 1,x 2,…,x m ), where x j is an item. A sequence is a subsequence of another sequence if there exist integers i 1 <i 2 <…<i n such that a 1  b i1, a 2  b i2, …, a n  b in. E.g: is a subsequence of E.g: is not a subsequence of

Problem Statement(contd.) A data-sequence contains a sequence s if s is a subsequence of the data-sequence. Plus taxonomies: a transaction T contains an item x  I if x is in T or x is an ancestor of some item in T. Plus sliding windows: A data-sequence d = contains a sequence s = if there exist integers l 1 ≤u 1 <l 2 ≤u 2 <…<l n ≤u n such that 1. s i is contained in, 1 ≤ i ≤ n, and 2. transaction-time( d ui ) – transaction-time( d li ) ≤window-size, 1 ≤ i ≤ n Plus time constraints: 3. transaction-time( d li ) - transaction-time( d ui-1 ) > min-gap, 2 ≤ i ≤ n, and 4. transaction-time( d ui ) - transaction-time( d li-1 ) ≤ max-gap, 2 ≤ i ≤ n.

Problem Definition Input: Database D : data sequences Taxonomy T : a DAG, not a tree User-specified min-gap and max-gap time constraints A user-specified sliding window size A user-specified minimum support Goal: To find all sequences whose support is greater than the given support

Example minimum support: 2 data-sequences With the AprioriAll Sliding-window of 7 days adds the pattern Max-gap of 30 days both patterns dropped Add the taxonomy, no sliding-window or time constraints, one is added

GSP:Basic Structure Phase 1: makes the first pass over database To yield all the 1-element frequent sequences Phase 2: the kth pass: starts with seed set found in the (k-1)th pass to generate candidate sequences, which has one more item than a seed sequence; A new pass over D to find the support for these candidate sequences These frequent candidates become the seed for the next pass Phase 3: terminates when no more frequent sequences are found no candidate sequences are generated

GSP: implementation Generating Candidates: To generate as few candidates as possible while maintaining completeness Counting Candidates: To determine the candidate sequence’s support Implementing Taxonomies

Candidate Generation Definition: K-sequence : a sequence with k items, L k : the set of frequent k-sequences, C k : the set of candidate k-sequences Goal: given the set of all frequent (k-1)-sequences, generate a candidate set of all frequent k-sequences Algorithm: Join Phase: joining L k-1 with L k-1. s 1 can join with s 2 if ( s 1 – first item) is the same as ( s 2 – last item) Prune Phase: delete candidate sequences that have a contiguous (k-1) subsequence whose support count is less than the minimum support

Candidate Generation: Example Join phase: joins with => Prune phase: is dropped => is not in L 3

Counting Candidates Problem: given a set of candidate sequences C and a data sequence d, find all sequences in C that are contained in d. Two techniques are used Hash-tree data structure: to reduce the number of candidates in C that need to be checked. Transformation the representation of the data- sequences d : to find whether a specific candidate is a subsequence of d efficiently.

Hash-Tree Structure Purpose: reducing the number of candidates Leaf node: a list of sequences Interior node: a hash table Operations: Adding candidate sequences to the hash-tree Finding the candidates contained in a data- sequence Min-gap Max-gap Sliding window size

Representation Transformation Purpose: to efficiently find the first occurrence of an element Transform the data sequences into transaction-links, each link is identified by one item E.g.:max-gap=30,min-gap=5,window-size=0, E.g.:window-size:7,find(2,6) after time=20

Implementing Taxonomies Basic Idea: to replace each data-sequence d with an “extended sequence” d’, where each transaction d i ’ contains all the items in the corresponding transaction d i,as well as all their ancestors. E.g.: => Optimizations Pre-compute the ancestors of each item, drop infrequent ancestors before a new pass Not count patterns with an element that contains an item x and its ancestor y Problem: redundancy E.g.

Performance Evaluation Comparison of GSP and AprioriAll Result: 2 to 20 times faster Contributing factors: Fewer candidates Directly finding the candidates Scale-up: scales linearly with the number of data-sequences Effects of Time Constraints and Sliding Windows: there was no performance degradation

Experiment Result

Experiment Result(contd.)

Conclusion GSP is a Generalized Sequence Mining Algorithm Discovering all the sequential patterns Good Customizability Has been incorporated into IBM’s data mining product

Personal Opinion Hash-tree Structure: main memory limitation Multi-pass over the database Apply GSP to CIS data

Mining Sequential Patterns: Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by:

Similar presentations

Presentation on theme: "Mining Sequential Patterns: Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mining Sequential Patterns: Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by:

Similar presentations

Presentation on theme: "Mining Sequential Patterns: Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by:"— Presentation transcript:

Similar presentations

About project

Feedback