Mining Sequential Patterns: Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by:

Slides:

Advertisements

Similar presentations

Sequential PAttern Mining using A Bitmap Representation

Advertisements

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

Data Mining (Apriori Algorithm)DCS 802, Spring DCS 802 Data Mining Apriori Algorithm Spring of 2002 Prof. Sung-Hyuk Cha School of Computer Science.

Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Data e Web Mining Paolo Gobbo

PREFIXSPAN ALGORITHM Mining Sequential Patterns Efficiently by Prefix- Projected Pattern Growth

Mining Multiple-level Association Rules in Large Databases

Mining Sequential Patterns Authors: Rakesh Agrawal and Ramakrishnan Srikant. Presenter: Jeremy Dalmer.

Nadia Andreani Dwiyono DESIGN AND MAKE OF DATA MINING MARKET BASKET ANALYSIS APLICATION AT DE JOGLO RESTAURANT.

Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.

1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Rules l Mining Association Rules between Sets of Items in Large Databases (R. Agrawal, T. Imielinski & A. Swami) l Fast Algorithms for.

Rakesh Agrawal Ramakrishnan Srikant

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Generalized Sequential Pattern (GSP) Step 1: – Make the first pass over the sequence database D to yield all the 1-element frequent sequences Step 2: Repeat.

1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant.

Temporal Pattern Matching of Moving Objects for Location-Based Service GDM Ronald Treur14 October 2003.

4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.

Mining Sequential Patterns Rakesh Agrawal Ramakrishnan Srikant Proc. of the Int’l Conference on Data Engineering (ICDE) March 1995 Presenter: Phil Schlosser.

1 Mining Quantitative Association Rules in Large Relational Database Presented by Jin Jin April 1, 2004.

Fast Algorithms for Mining Association Rules * CS401 Final Presentation Presented by Lin Yang University of Missouri-Rolla * Rakesh Agrawal, Ramakrishnam.

ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis)

Association Rule Mining (Some material adapted from: Mining Sequential Patterns by Karuna Pande Joshi)‏

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 1 CSE 711 Seminar on Data Mining: Apriori Algorithm By Sung-Hyuk Cha.

Fast Algorithms for Association Rule Mining

Presented by Yaron Gonen. Outline Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results.

Mining Association Rules

Mining Sequences. Examples of Sequence Web sequence:  {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation}

Data Mining Association Rules: Advanced Concepts and Algorithms Lecture Notes for Chapter 7 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak.

Data Mining Association Rules: Advanced Concepts and Algorithms

Data Mining Techniques Sequential Patterns. Sequential Pattern Mining Progress in bar-code technology has made it possible for retail organizations to.

1 Apriori Algorithm Review for Finals. SE 157B, Spring Semester 2007 Professor Lee By Gaurang Negandhi.

Sequential PAttern Mining using A Bitmap Representation

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

Intelligent Database Systems Lab Advisor ： Dr. Hsu Graduate ： Keng-Wei Chang Author ： Anthony K.H. Tung Hongjun Lu Jiawei Han Ling Feng 國立雲林科技大學 National.

Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )

Discovering RFM Sequential Patterns From Customers’ Purchasing Data 中央大學資管系陳彥良教授 Date: 2015/10/14.

MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.

Mining Sequential Patterns Rakesh Agrawal Ramakrishnan Srikant Proc. of the Int ’ l Conference on Data Engineering (ICDE) March 1995 Presenter: Sam Brown.

Modul 8: Sequential Pattern Mining. Terminology  Item  Itemset  Sequence (Customer-sequence)  Subsequence  Support for a sequence  Large/frequent.

Data Mining Association Rules: Advanced Concepts and Algorithms

Data Mining Association Analysis Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

Data Mining Association Rules: Advanced Concepts and Algorithms Lecture Notes for Chapter 7 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Modul 8: Sequential Pattern Mining

Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang

Sequential Pattern Mining

Generalized Sequential Pattern Mining with Item Intervals Yu Hirate Hayato Yamana PAKDD2006.

CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.

Data Mining Association Rules: Advanced Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar.

Data Mining Association Rules: Advanced Concepts and Algorithms

Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.

Fast Algorithms for Mining Association Rules Rakesh Agrawal and Ramakrishnan Srikant VLDB '94 presented by kurt partridge cse 590db oct 4, 1999.

Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.

1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001.

Data Mining Find information from data data ? information.

S EQUENTIAL P ATTERNS & THE GSP A LGORITHM BY : J OE C ASABONA.

Intelligent Database Systems Lab Advisor ： Dr.Hsu Graduate ： Keng-Wei Chang Author ： Salvatore Orlando Raffaele Perego Claudio Silvestri 國立雲林科技大學 National.

Course on Data Mining: Seminar Meetings Page 1/30 Course on Data Mining ( ): Seminar Meetings Ass. Rules EpisodesEpisodes Text Mining

Mining Sequential Patterns © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Slides are adapted from Introduction to Data Mining by Tan, Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.

Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {

Mining Sequential Patterns

Mining Sequential Patterns

Presentation transcript:

Mining Sequential Patterns: Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by: M.H. Lin

Outline Motivation Objective Introduction Problem Statement The New Algorithm: GSP Performance Evaluation Conclusion Personal Opinion

Motivation The problem of mining sequential patterns was recently introduced. Limitations of the AprioriAll [Agrawal, 1995] Absence of time constraints Rigid definition of a transaction Absence of taxonomies

Objective We present GSP, a new algorithm that discovers these generalized sequential patterns Empirically compared the performance of GSP with the AprioriAll algorithm.

Introduction Instance A database of sequences, called data-sequences Each sequence is a list of transactions ordered by transaction- time Each transaction is a set of items Definitions: A sequential pattern consists a list of itemsets Support:the number of data-sequences that contain the pattern Problem: To discover all the sequential patterns with a user-specified minimum support

Example Of A Sequential Pattern Database of book-club, each data-sequence corresponds to a given customer’s all book selection, each transaction contains the books selected by the given customer in one order A sequential pattern: 5% of customers bought ‘Foundation’, then ‘Foundation and Empire’ and ‘Ringworld’, then ‘Second Foundation’

Features of A Sequential Pattern E.g: 5% cust. bought ‘Foundation’, then ‘Foundation and Empire’ and ‘Ringworld’, then ‘Second Foundation’ The Maximum and/or minimum time gaps between adjacent elements. Eg: the time between buying ‘Foundation’, and then ‘Foundation and Empire’ and ‘Ringworld’ should be within 3 months A sliding time window over the sequence-pattern elements E.g.: one week Mo: BK-a Sa: BK-b Next Su: BK-c ; This data-sequence supports the pattern “BK-a” and “ BK-b”, then “BK-c” User-defined Taxonomies Example  coming soon….

A User-defined Taxonomy A customer who bought Foundation,then Perfect Spy, would support the following patterns: Foundation, then Perfect Spy Asimov, then Perfect Spy Science Fiction, then Le Carre …

The Old Algorithm--AprioriAll A 3-phase algorithm Phase 1: finds all frequent itemsets with min. support Phase 2: transforms the DB s.t. each transaction only contains the frequent itemsets Phase 3: finds sequential patterns Pros. Can Discover all frequent sequential patterns Cons. Computationally expensive: space, time Not feasible to incorporate sliding windows

Problem Statement Definitions: Let I = {i 1,i 2,…,i m } be a set of literals, called items Let T be a directed acyclic graph on the literals. An itemset is a non-empty set of items A sequence is an ordered list of itemsets We denote a sequence s by, where s j is an itemset. We denote an element of sequence by (x 1,x 2,…,x m ), where x j is an item. A sequence is a subsequence of another sequence if there exist integers i 1 <i 2 <…<i n such that a 1  b i1, a 2  b i2, …, a n  b in. E.g: is a subsequence of E.g: is not a subsequence of

Problem Statement(contd.) A data-sequence contains a sequence s if s is a subsequence of the data-sequence. Plus taxonomies: a transaction T contains an item x  I if x is in T or x is an ancestor of some item in T. Plus sliding windows: A data-sequence d = contains a sequence s = if there exist integers l 1 ≤u 1 <l 2 ≤u 2 <…<l n ≤u n such that 1. s i is contained in, 1 ≤ i ≤ n, and 2. transaction-time( d ui ) – transaction-time( d li ) ≤window-size, 1 ≤ i ≤ n Plus time constraints: 3. transaction-time( d li ) - transaction-time( d ui-1 ) > min-gap, 2 ≤ i ≤ n, and 4. transaction-time( d ui ) - transaction-time( d li-1 ) ≤ max-gap, 2 ≤ i ≤ n.

Problem Definition Input: Database D : data sequences Taxonomy T : a DAG, not a tree User-specified min-gap and max-gap time constraints A user-specified sliding window size A user-specified minimum support Goal: To find all sequences whose support is greater than the given support

Example minimum support: 2 data-sequences With the AprioriAll Sliding-window of 7 days adds the pattern Max-gap of 30 days both patterns dropped Add the taxonomy, no sliding-window or time constraints, one is added

GSP:Basic Structure Phase 1: makes the first pass over database To yield all the 1-element frequent sequences Phase 2: the kth pass: starts with seed set found in the (k-1)th pass to generate candidate sequences, which has one more item than a seed sequence; A new pass over D to find the support for these candidate sequences These frequent candidates become the seed for the next pass Phase 3: terminates when no more frequent sequences are found no candidate sequences are generated

GSP: implementation Generating Candidates: To generate as few candidates as possible while maintaining completeness Counting Candidates: To determine the candidate sequence’s support Implementing Taxonomies

Candidate Generation Definition: K-sequence : a sequence with k items, L k : the set of frequent k-sequences, C k : the set of candidate k-sequences Goal: given the set of all frequent (k-1)-sequences, generate a candidate set of all frequent k-sequences Algorithm: Join Phase: joining L k-1 with L k-1. s 1 can join with s 2 if ( s 1 – first item) is the same as ( s 2 – last item) Prune Phase: delete candidate sequences that have a contiguous (k-1) subsequence whose support count is less than the minimum support

Candidate Generation: Example Join phase: joins with => Prune phase: is dropped => is not in L 3

Counting Candidates Problem: given a set of candidate sequences C and a data sequence d, find all sequences in C that are contained in d. Two techniques are used Hash-tree data structure: to reduce the number of candidates in C that need to be checked. Transformation the representation of the data- sequences d : to find whether a specific candidate is a subsequence of d efficiently.

Hash-Tree Structure Purpose: reducing the number of candidates Leaf node: a list of sequences Interior node: a hash table Operations: Adding candidate sequences to the hash-tree Finding the candidates contained in a data- sequence Min-gap Max-gap Sliding window size

Representation Transformation Purpose: to efficiently find the first occurrence of an element Transform the data sequences into transaction-links, each link is identified by one item E.g.:max-gap=30,min-gap=5,window-size=0, E.g.:window-size:7,find(2,6) after time=20

Implementing Taxonomies Basic Idea: to replace each data-sequence d with an “extended sequence” d’, where each transaction d i ’ contains all the items in the corresponding transaction d i,as well as all their ancestors. E.g.: => Optimizations Pre-compute the ancestors of each item, drop infrequent ancestors before a new pass Not count patterns with an element that contains an item x and its ancestor y Problem: redundancy E.g.

Performance Evaluation Comparison of GSP and AprioriAll Result: 2 to 20 times faster Contributing factors: Fewer candidates Directly finding the candidates Scale-up: scales linearly with the number of data-sequences Effects of Time Constraints and Sliding Windows: there was no performance degradation

Experiment Result

Experiment Result(contd.)

Conclusion GSP is a Generalized Sequence Mining Algorithm Discovering all the sequential patterns Good Customizability Has been incorporated into IBM’s data mining product

Personal Opinion Hash-tree Structure: main memory limitation Multi-pass over the database Apply GSP to CIS data