The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Sequential Pattern Mining COMP 790-90 Seminar BCB 713 Module Spring 2011.

Slides:



Advertisements
Similar presentations
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Advertisements

Sequential Patterns & Process Mining Current State of Research Edgar de Graaf LIACS.
ICDM'06 Panel 1 Apriori Algorithm Rakesh Agrawal Ramakrishnan Srikant (description by C. Faloutsos)
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Data Mining Association Analysis: Basic Concepts and Algorithms
Rakesh Agrawal Ramakrishnan Srikant
IncSpan: Incremental Mining of Sequential Patterns in Large Databases Hong Cheng,Xifeng Yan,Jiawei Han University of Illinois at Urbana-Champaign.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining: Concepts and Techniques (2nd ed.) — Chapter 5 —
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Multi-dimensional Sequential Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Sequential Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Sequence Databases & Sequential Patterns
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Business Systems Intelligence: 4. Mining Association Rules Dr. Brian Mac Namee (
1 Mining Association Rules in Large Databases Association rule mining Algorithms for scalable mining of (single-dimensional Boolean) association rules.
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
Chapter 4: Mining Frequent Patterns, Associations and Correlations
Data Mining: Concepts and Techniques 1 Mining Sequence Patterns in Transactional Databases CS240B --UCLA Notes by Carlo Zaniolo Based on those by J. Han.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
Mining Sequences. Examples of Sequence Web sequence:  {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation}
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, and Jiawei Han SIGMOD 2002 Presented by: Eddie Date: 2002/12/23.
Pattern-growth Methods for Sequential Pattern Mining: Principles and Extensions Jiawei Han (UIUC) Jian Pei (Simon Fraser Univ.)
A Short Introduction to Sequential Data Mining
What Is Sequential Pattern Mining?
October 2, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 8 — 8.3 Mining sequence patterns in transactional.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Multi-dimensional Sequential Pattern Mining Helen Pinto, Jiawei Han, Jian Pei, Ke Wang, Qiming Chen, Umeshwar Dayal ~From: 10th ACM Intednational Conference.
Discovering RFM Sequential Patterns From Customers’ Purchasing Data 中央大學資管系 陳彥良 教授 Date: 2015/10/14.
Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang
Lecture 11 Sequential Pattern Mining MW 4:00PM-5:15PM Dr. Jianjun Hu CSCE822 Data Mining and Warehousing University.
Sequential Pattern Mining
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining III COMP Seminar GNET 713 BCB Module Spring 2007.
Jian Pei Jiawei Han Behzad Mortazavi-Asl Helen Pinto ICDE’01
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
Mining Sequential Patterns © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Slides are adapted from Introduction to Data Mining by Tan, Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.
PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth Jiawei Han, Jian Pei, Helen Pinto, Behzad Mortazavi-Asl, Qiming Chen,
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Sequential Pattern Mining
Reducing Number of Candidates
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Concepts and Techniques
Association rule mining
Frequent Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Mining Complex Data COMP Seminar Spring 2011.
Mining Access Pattrens Efficiently from Web Logs Jian Pei, Jiawei Han, Behzad Mortazavi-asl, and Hua Zhu 2000년 5월 26일 DE Lab. 윤지영.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Warehousing Mining & BI
Association Rule Mining
Presentation transcript:

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Sequential Pattern Mining COMP Seminar BCB 713 Module Spring 2011

COMP Data Mining: Concepts, Algorithms, and Applications 2 Sequential Pattern Mining Why sequential pattern mining? GSP algorithm FreeSpan and PrefixSpan Boarder Collapsing Constraints and extensions

COMP Data Mining: Concepts, Algorithms, and Applications 3 Sequence Databases and Sequential Pattern Analysis (Temporal) order is important in many situations Time-series databases and sequence databases Frequent patterns  (frequent) sequential patterns Applications of sequential pattern mining Customer shopping sequences: First buy computer, then CD-ROM, and then digital camera, within 3 months. Medical treatment, natural disasters (e.g., earthquakes), science & engineering processes, stocks and markets, telephone calling patterns, Weblog click streams, DNA sequences and gene structures

COMP Data Mining: Concepts, Algorithms, and Applications 4 What Is Sequential Pattern Mining? Given a set of sequences, find the complete set of frequent subsequences A sequence database A sequence : An element may contain a set of items. Items within an element are unordered and we list them alphabetically. is a subsequence of Given support threshold min_sup =2, is a sequential pattern SIDsequence

COMP Data Mining: Concepts, Algorithms, and Applications 5 Challenges on Sequential Pattern Mining A huge number of possible sequential patterns are hidden in databases A mining algorithm should Find the complete set of patterns satisfying the minimum support (frequency) threshold Be highly efficient, scalable, involving only a small number of database scans Be able to incorporate various kinds of user- specific constraints

COMP Data Mining: Concepts, Algorithms, and Applications 6 A Basic Property of Sequential Patterns: Apriori A basic property: Apriori (Agrawal & Sirkant’94) If a sequence S is not frequent Then none of the super-sequences of S is frequent E.g, is infrequent  so do and SequenceSeq. ID Given support threshold min_sup =2

COMP Data Mining: Concepts, Algorithms, and Applications 7 Basic Algorithm : Breadth First Search (GSP) L=1 While (Result L != NULL) Candidate Generate Prune Test L=L+1

COMP Data Mining: Concepts, Algorithms, and Applications 8 Finding Length-1 Sequential Patterns Initial candidates: all singleton sequences,,,,,,, Scan database once, count support for candidates SequenceSeq. ID min_sup =2 CandSup

COMP Data Mining: Concepts, Algorithms, and Applications 9 The Mining Process … … … … 1 st scan: 8 cand. 6 length-1 seq. pat. 2 nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all 3 rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all 4 th scan: 8 cand. 6 length-4 seq. pat. 5 th scan: 1 cand. 1 length-5 seq. pat. Cand. cannot pass sup. threshold Cand. not in DB at all SequenceSeq. ID min_sup =2

COMP Data Mining: Concepts, Algorithms, and Applications 10 Generating Length-2 Candidates 51 length-2 Candidates Without Apriori property, 8*8+8*7/2=92 candidates Apriori prunes 44.57% candidates

COMP Data Mining: Concepts, Algorithms, and Applications 11 Generating Length-4 Candidates Frequent 3-Sequences Candidates (after join) Candidates (After pruning)

COMP Data Mining: Concepts, Algorithms, and Applications 12 Pattern Growth (prefixSpan) Prefix and Suffix (Projection),, and are prefixes of sequence Given sequence PrefixSuffix (Prefix-Based Projection)

COMP Data Mining: Concepts, Algorithms, and Applications 13 Example Sequence_id Sequence An Example ( min_sup=2): PrefixSequential Patterns,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

COMP Data Mining: Concepts, Algorithms, and Applications 14 PrefixSpan (the example to be continued) Step1: Find length-1 sequential patterns; :4, :4, :4, :3, :3, :3 pattern support Step2: Divide search space; six subsets according to the six prefixes; Step3: Find subsets of sequential patterns; By constructing corresponding projected databases and mine each recursively.

COMP Data Mining: Concepts, Algorithms, and Applications 15 Example to be continued PrefixProjected(suffix) databasesSequential Patterns,,,,,,,,,,,,,,,,,, Sequence_id SequenceProjected(suffix) databases

COMP Data Mining: Concepts, Algorithms, and Applications 16 Example Find sequential patterns having prefix : 1.Scan sequence database S once. Sequences in S containing are projected w.r.t to form the - projected database. 2.Scan -projected database once, get six length-2 sequential patterns having prefix : :2, :4, :2, :4, :2, :2 3.Recursively, all sequential patterns having prefix can be further partitioned into 6 subsets. Construct respective projected databases and mine each. e.g. -projected database has two sequences : and.

COMP Data Mining: Concepts, Algorithms, and Applications 17 PrefixSpan Algorithm PrefixSpan( , i, S|  ) 1.Scan S|  once, find the set of frequent items b such that b can be assembled to the last element of  to form a sequential pattern; or can be appended to  to form a sequential pattern. 2.For each frequent item b, appended it to  to form a sequential pattern  ’, and output  ’; 3.For each  ’, construct  ’-projected database S|  ’, and call PrefixSpan(  ’, i+1,S|  ’). Main Idea: Use frequent prefixes to divide the search space and to project sequence databases. only search the relevant sequences.

COMP Data Mining: Concepts, Algorithms, and Applications 18 Approximate match When you observe d1 Spread count as d1: 90%, d2: 5%, d3: 5% Compatibility Matrix

COMP Data Mining: Concepts, Algorithms, and Applications 19 Match The degree to which pattern P is retained/reflected in S M(P,S) = P(P|S)=  C(p,s) when when l S =l P M(P,S) = max over all possible when l S >l P Example PSM d1d1d1d30.9*0 d1d2 0.9*0.8 d1d2d1d30.9*0.05 d1d2d2d30.1*0.05 d1d2d1d2d30.9*0.8

COMP Data Mining: Concepts, Algorithms, and Applications 20 Calculate Max over all Dynamic Programming M(p 1 p 2..p i, s 1 s 2 …s j )= Max of M(p 1 p 2..p i-1, s 1 s 2 …s j-1 ) * C(p i,s j ) M(p 1 p 2..p i, s 1 s 2 …s j-1 ) O(l P *l S ) When compatibility Matrix is sparse O(l S )

COMP Data Mining: Concepts, Algorithms, and Applications 21 Match in D Average over all sequences in D

COMP Data Mining: Concepts, Algorithms, and Applications 22 Spread of match If compatibility matrix is identity matrix Match = support

COMP Data Mining: Concepts, Algorithms, and Applications 23 Anti-Monotone The match of a pattern P in a symbol sequence S is less than or equal to the match of any subpattern of P in S The match of a pattern P in a sequence database D is less than or equal to the match of any subpattern of P in D Can use any support based algorithm More patterns match so require efficient solution Sample based algorithms Border collapsing of ambiguous patterns

COMP Data Mining: Concepts, Algorithms, and Applications 24 Chernoff Bound Given sample size=n, range R, with probability 1-  true value:    = sqrt([R 2 ln(1/  )]/2n) Distribution free More conservative Sample size : fit in memory Restricted spread : For pattern P= p 1 p 2..p L R=min (match[p i ]) for all 1  i  L Frequent Patterns min_match +  min_match -  Infrequent patterns

COMP Data Mining: Concepts, Algorithms, and Applications 25 Algorithm Scan DB: O(N*min (L s *m, L s +m 2 )) Find the match of each individual symbol Take a random sample of sequences Identify borders that embrace the set of ambiguous patterns O(m Lp * |S| * Lp * n) Min_match   existing methods for association rule mining Locate the border of frequent patterns in the entire DB via border collapsing

COMP Data Mining: Concepts, Algorithms, and Applications 26 Border Collapsing If memory can not hold the counters of all ambiguous patterns Probe-and-collapse : binary search Probe patterns with highest collapsing power until memory is filled If memory can hold all patterns up to the 1/x layer the space of ambiguous patterns can be narrowed to at least 1/x of the original one where x is a power of 2 If it takes a level-wise search y scans of the DB, only O(log x y) scans are necessary when the border collapsing technique is employed

COMP Data Mining: Concepts, Algorithms, and Applications 28 Studies on Sequential Pattern Mining Concept introduction and an initial Apriori- like algorithm [AgSr95] GSP—An Apriori-based, influential mining method [SrAg96] Mining sequential patterns with constraints [GRS99] Mining long sequential pattern [JWY02]

COMP Data Mining: Concepts, Algorithms, and Applications 29 Periodic Pattern Full periodic pattern ABC ABC ABC Partial periodic pattern ABC ADC ACC ABC Pattern hierarchy ABC ABC ABC DE DE DE DE ABC ABC ABC DE DE DE DE ABC ABC ABC DE DE DE DE

COMP Data Mining: Concepts, Algorithms, and Applications 30 Periodic Pattern Recent Achievements Partial Periodic Pattern Asynchronous Periodic Pattern Meta Pattern InfoMiner/InfoMiner+/STAMP

COMP Data Mining: Concepts, Algorithms, and Applications 31 Clustering Sequential Data CLUSEQ ApproxMAP