Overview on Association Rule Mining CENG 770 Advanced Data Mining

Slides:



Advertisements
Similar presentations
Mining Association Rules
Advertisements

732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña FP grow algorithm Correlation analysis.
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
CPS : Information Management and Mining
Rakesh Agrawal Ramakrishnan Srikant
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining: Association Rule Mining CSE880: Database Systems.
Data Mining: Concepts and Techniques (2nd ed.) — Chapter 5 —
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
2015年5月25日星期一 2015年5月25日星期一 2015年5月25日星期一 Data Mining: Concepts and Techniques1 Chapter 5: Mining Frequent Patterns, Association and Correlations Basic.
Data Mining Association Analysis: Basic Concepts and Algorithms
FP-growth. Challenges of Frequent Pattern Mining Improving Apriori Fp-growth Fp-tree Mining frequent patterns with FP-tree Visualization of Association.
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar.
Business Systems Intelligence: 4. Mining Association Rules Dr. Brian Mac Namee (
Association Analysis: Basic Concepts and Algorithms.
Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.
Chapter 4: Mining Frequent Patterns, Associations and Correlations
Mining Association Rules in Large Databases
Frequent-Pattern Tree. 2 Bottleneck of Frequent-pattern Mining  Multiple database scans are costly  Mining long patterns needs many passes of scanning.
Mining Association Rules
1 1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 6 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
Mining Association Rules
SEG Tutorial 2 – Frequent Pattern Mining.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
What Is Sequential Pattern Mining?
Ch5 Mining Frequent Patterns, Associations, and Correlations
Mining Frequent Patterns without Candidate Generation Presented by Song Wang. March 18 th, 2009 Data Mining Class Slides Modified From Mohammed and Zhenyu’s.
October 2, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 8 — 8.3 Mining sequence patterns in transactional.
Jiawei Han, Jian Pei, and Yiwen Yin School of Computing Science Simon Fraser University Mining Frequent Patterns without Candidate Generation SIGMOD 2000.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
Data Mining Frequent-Pattern Tree Approach Towards ARM Lecture
What Is Association Mining? l Association rule mining: – Finding frequent patterns, associations, correlations, or causal structures among sets of items.
Lecture 11 Sequential Pattern Mining MW 4:00PM-5:15PM Dr. Jianjun Hu CSCE822 Data Mining and Warehousing University.
Sequential Pattern Mining
Mining Frequent Patterns without Candidate Generation.
Chapter 4 – Frequent Pattern Mining Shuaiqiang Wang ( 王帅强 ) School of Computer Science and Technology Shandong University of Finance and Economics Homepage:
Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
1 Data Mining: Mining Frequent Patterns, Association and Correlations.
Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Mining Frequent Patterns. What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs.
Chapter 6: Mining Frequent Patterns, Association and Correlations
Mining Sequential Patterns © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Slides are adapted from Introduction to Data Mining by Tan, Steinbach,
What is Frequent Pattern Analysis?
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
Association Rule Mining CENG 514 Data Mining
Association Rule Mining CENG 514 Data Mining July 2,
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Information Management course
Association Rule Mining
Association Rule Mining
Association rule mining
Mining Association Rules in Large Databases
Association Rule Mining
732A02 Data Mining - Clustering and Association Analysis
Mining Frequent Patterns without Candidate Generation
Data Warehousing Mining & BI
Frequent-Pattern Tree
FP-Growth Wenlong Zhang.
Department of Computer Science National Tsing Hua University
Association Rule Mining
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
Mining Association Rules in Large Databases
What Is Association Mining?
Presentation transcript:

Overview on Association Rule Mining CENG 770 Advanced Data Mining

What Is Frequent Pattern Analysis? Motivation: Finding inherent regularities in data – What products were often purchased together?— Beer and diapers?! – What are the subsequent purchases after buying a PC? – What kinds of DNA are sensitive to this new drug? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining 11/20/20162

Basic Concepts: Frequent Patterns and Association Rules Itemset X = {x 1, …, x k } Find all the rules X  Y with minimum support and confidence – support, s, probability that a transaction contains X  Y – confidence, c, conditional probability that a transaction having X also contains Y 11/20/20163 Let sup min = 50%, conf min = 50% Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3} Association rules: A  D (60%, 100%) D  A (60%, 75%) Customer buys diaper Customer buys both Customer buys beer Transaction-idItems bought 10A, B, D 20A, C, D 30A, D, E 40B, E, F 50B, C, D, E, F

Spring 2005CSE 572, CBS 598 by H. Liu4 Use of Association Rules Association rules do not represent any sort of causality or correlation between the two itemsets. – X  Y does not mean X causes Y, so no Causality – X  Y can be different from Y  X, unlike correlation Association rules assist in marketing, targeted advertising, floor planning, inventory control, churning management, homeland security, Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.

Closed Patterns and Max-Patterns A long pattern contains a combinatorial number of sub- patterns, e.g., {a 1, …, a 100 } contains ( ) + ( ) + … + ( ) = – 1 = 1.27*10 30 sub-patterns! Solution: Mine closed patterns and max-patterns instead An itemset X is closed if it is frequent and there exists no super- pattern Y כ X, with the same support as X An itemset X is a max-pattern if it is frequent and there exists no frequent super-pattern Y כ X Closed pattern is a lossless compression of freq. patterns – Reducing the # of patterns and rules An itemset X is generator if it is frequent and there exists no sub-pattern X כ Y with the same support as X 11/20/20165

Closed Patterns and Max-Patterns Example DB = {, } – Minimum support count is 1. What is the set of closed itemset? – : 1 – : 2 What is the set of max-patterns? – : 1 What is the set of all frequent patterns? 11/20/20166

Methods for Mining Frequent Patterns Apriori Rule (The downward closure property (DCP)) – Any subset of a frequent itemset must be frequent – If {beer, diaper, nuts} is frequent, so is {beer, diaper} – i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} Basic FIM methods: – Apriori – Freq. pattern growth (ClosetFPgrowth—Han, Pei & – Vertical data format approach (SPADE, CHARM) 11/20/20167

Apriori: A Candidate Generation-and-Test Approach Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Mannila, et KDD’ 94) Method: – Initially, scan DB once to get frequent 1-itemset – Generate length (k+1) candidate itemsets from length k frequent itemsets – Test the candidates against DB – Terminate when no frequent or candidate set can be generated 11/20/20168

The Apriori Algorithm—An Example 11/20/20169 Database TDB 1 st scan C1C1 L1L1 L2L2 C2C2 C2C2 2 nd scan C3C3 L3L3 3 rd scan TidItems 10A, C, D 20B, C, E 30A, B, C, E 40B, E Itemsetsup {A}2 {B}3 {C}3 {D}1 {E}3 Itemsetsup {A}2 {B}3 {C}3 {E}3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Itemsetsup {A, B}1 {A, C}2 {A, E}1 {B, C}2 {B, E}3 {C, E}2 Itemsetsup {A, C}2 {B, C}2 {B, E}3 {C, E}2 Itemset {B, C, E} Itemsetsup {B, C, E}2 Sup min = 2

The Apriori Algorithm Pseudo-code: C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k !=  ; k++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support end return  k L k ; 11/20/201610

Important Details of Apriori How to generate candidates? – Step 1: self-joining L k – Step 2: pruning Example of Candidate-generation – L 3 ={abc, abd, acd, ace, bcd} – Self-joining: L 3 *L 3 abcd from abc and abd acde from acd and ace – Pruning: acde is removed because ade is not in L 3 – C 4 ={abcd} 11/20/201611

How to Generate Candidates? Suppose the items in L k-1 are listed in an order Step 1: self-joining L k-1 insert into C k select p.item 1, p.item 2, …, p.item k-1, q.item k-1 from L k-1 p, L k-1 q where p.item 1 =q.item 1, …, p.item k-2 =q.item k-2, p.item k-1 < q.item k-1 Step 2: pruning forall itemsets c in C k do forall (k-1)-subsets s of c do if (s is not in L k-1 ) then delete c from C k 11/20/201612

Challenges of Frequent Pattern Mining Challenges – Multiple scans of transaction database – Huge number of candidates – Tedious workload of support counting for candidates Improving Apriori: general ideas – Reduce passes of transaction database scans – Shrink number of candidates – Facilitate support counting of candidates 11/20/201613

Partition: Scan Database Only Twice Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB – Scan 1: partition database and find local frequent patterns – Scan 2: consolidate global frequent patterns A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB’95 11/20/201614

Transaction reduction A transaction that does not contain any frequent k-itemsets cannot contain any frequent (k+1)-itemsets. Therefore, such a transaction can be marked or removed from further consideration because subsequent scans of the database for j-itemsets, where j > k, will not require it. 11/20/201615

Sampling for Frequent Patterns Select a sample of original database, mine frequent patterns within sample using Apriori (PL: potentially large itemsets) Find additional candidates using negative border function PL = BD - (PL) U PL (BD - (PL) : minimal set of itemsets that are not PL, but whose subsets are all in PL) Scan database once to verify frequent itemsets found in sample and to check only borders of closure of frequent patterns (first pass) If there are uncovered new items in the result of scan C, iteratively apply BD - (C) to expand border of closure and scan database again to find missed frequent patterns. (second pass – may or may not happen) H. Toivonen. Sampling large databases for association rules. In VLDB’96 11/20/201616

Sampling for Frequent Patterns Example: DB t1: {Bread, Jelly, Peanutbutter} t2: {Bread, Peanutbutter} t3: {Bread, Milk, Peanutbutter} t4: {Beer, Bread} t5: {Beer, Milk} Support threshold 40% 11/20/ Sample t1: {Bread, Jelly, Peanutbutter} t2: {Bread, Peanutbutter} Support threshold 10%

Sampling for Frequent Patterns 11/20/ Sample t1: {Bread, Jelly, Peanutbutter} Support threshold 10% t2: {Bread, Peanutbutter}-> support count 1 PL = { {Bread},{Jelly},{Peanutbutter},{Bread, Jelly}, {Bread, Peanutbutter}, {Jelly, Peanutbutter}, {Bread, Jelly, Peanutbutter}} BD - (PL) = {{Beer}, {Milk}} PL U BD - (PL) = { {Bread},{Jelly},{Peanutbutter},{Bread, Jelly}, {Bread, Peanutbutter}, {Jelly, Peanutbutter}, {Bread, Jelly, Peanutbutter}, {Beer}, {Milk} } C = { {Bread},{Peanutbutter}, {Bread, Peanutbutter}, {Beer}, {Milk} } (on DB with 40% support threshold) Need one more pass

Sampling for Frequent Patterns 11/20/ C = {{Bread},{Peanutbutter}, {Bread, Peanutbutter}, {Beer}, {Milk} } BD - (C) = { {Bread, Beer}, {Bread, Milk}, {Beer, Milk}, {Beer, Peanutbutter}, {Milk, Peanutbutter}} C = C U BD - (C) BD - (C) = { {Bread, Beer, Milk}, {Bread, Beer, Peanutbutter}, {Beer, Milk, Peanutbutter}, {Bread, Milk, Peanutbutter}} C = C U BD - (C) BD - (C) = { {Bread, Beer, Milk, Peanutbutter} } F = { {Bread},{Peanutbutter}, {Bread, Peanutbutter}, {Beer}, {Milk} }(on DB with 40% support threshold)

Mining Frequent Patterns Without Candidate Generation Build a compact representation so that no need scan db several times Find frequent itemsets in a recursive way 11/20/201620

Construct FP-tree from a Transaction Database 11/20/ {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3 min_support = 3 TIDItems bought (ordered) frequent items 100{f, a, c, d, g, i, m, p}{f, c, a, m, p} 200{a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o, w}{f, b} 400 {b, c, k, s, p}{c, b, p} 500 {a, f, c, e, l, p, m, n}{f, c, a, m, p} 1.Scan DB once, find frequent 1-itemset (single item pattern) 2.Sort frequent items in frequency descending order 3.Scan DB again, construct FP-tree

Find Patterns Having P From P-conditional Database Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent item p Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base 11/20/ Conditional pattern bases itemcond. pattern base cf:3 afc:3 bfca:1, f:1, c:1 mfca:2, fcab:1 pfcam:2, cb:1 {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3

From Conditional Pattern-bases to Conditional FP-trees For each pattern-base – Accumulate the count for each item in the base – Construct the FP-tree for the frequent items of the pattern base 11/20/ m-conditional pattern base: fca:2, fcab:1 {} f:3 c:3 a:3 m-conditional FP-tree All frequent patterns relate to m m, fm, cm, am, fcm, fam, cam, fcam   {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3

Recursion: Mining Each Conditional FP-tree 11/20/ {} f:3 c:3 a:3 m-conditional FP-tree Cond. pattern base of “am”: (fc:3) {} f:3 c:3 am-conditional FP-tree Cond. pattern base of “cm”: (f:3) {} f:3 cm-conditional FP-tree Cond. pattern base of “cam”: (f:3) {} f:3 cam-conditional FP-tree

A Special Case: Single Prefix Path in FP-tree Suppose a (conditional) FP-tree T has a shared single prefix-path P Mining can be decomposed into two parts – Reduction of the single prefix path into one node – Concatenation of the mining results of the two parts 11/20/  a 2 :n 2 a 3 :n 3 a 1 :n 1 {} b 1 :m 1 C 1 :k 1 C 2 :k 2 C 3 :k 3 b 1 :m 1 C 1 :k 1 C 2 :k 2 C 3 :k 3 r1r1 + a 2 :n 2 a 3 :n 3 a 1 :n 1 {} r1r1 =

Mining Frequent Patterns With FP-trees Idea: Frequent pattern growth – Recursively grow frequent patterns by pattern and database partition Method – For each frequent item, construct its conditional pattern- base, and then its conditional FP-tree – Repeat the process on each newly created conditional FP- tree – Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern 11/20/201626

FP-Growth vs. Apriori: Scalability With the Support Threshold 11/20/ Data set T25I20D10K

Mining by Exploring Vertical Data Format Vertical format: t(AB) = {T 11, T 25, …} – tid-list: list of trans.-ids containing an itemset Deriving closed patterns based on vertical intersections – t(X) = t(Y): X and Y always happen together – t(X)  t(Y): transaction having X always has Y Using diffset to accelerate mining – Only keep track of differences of tids – t(X) = {T 1, T 2, T 3 }, t(XY) = {T 1, T 3 } – Diffset (XY, X) = {T 2 } 11/20/201628

Sequential Associations Sequence: ordered list of itemsets – I = {i 1, i 2, …, i m } set of items – S =, where s 1  I – e.g: Subsequence of a given sequence is one that can be obtained by removing some items and any resulting empty itemsets from the original sequence. – e.g: is a subsequence of – {A,C},{B} is NOT subsequence of the above sequence 11/20/201629

Sequential Associations Support(s): percentage of sessions (customers) whose session-sequence contains the sequence s Confidence(s → t): ratio of the number of sessions that contain both sequences s and t to the number of ones that contain s. Example: s = Support(s) = 1/3 u = Support(u) = 2/3 11/20/ CustomerTimeItemset C110AB C120BC C130D C215ABC C220D C315ACD

Sequential Associations Basic Algorithms: AprioriAll – Basic algorithm for finding sequential associations GSP – More efficient than AprioriAll – Includes extensions Window Concept Hierarchy Spade – Improves performance by using vertical table structure Equivalence classes Episode mining – Events are ordered with occurance time – Uses window 11/20/201631

Sequential Associations Support for a sequence: Fraction of total customers that support a sequence. Large Sequence: Sequence that meets min_sup. Maximal Sequence: A sequence that is not contained in any other sequence. 11/20/201632

Sequential Associations - Example Customer ID Transaction Time Items Bought 1June 25 '9330 1June 30 '9390 2June 10 '9310,20 2June 15 '9330 2June 20 '9340,60,70 3June 25 '9330,50,70 4June 25 '9330 4June 30 '9340,70 4July 25 '9390 5June 12 '9390 Customer IDCustomer Sequence 1{ (30) (90) } 2{ (10 20) (30) ( ) } 3{ ( ) } 4{ (30) (40 70) (90) } 5{ (90) } Maximal sequences with support > 25% { (30) (90) } { (30) (40 70) } Min-sup:25% { (10 20) (30) } Does not satisfy min_sup (Only supported by Cust. 2) { (30) }, { (70) }, { (30) (40) } … are not maximal.

Frequent Sequence Generation Customer Sequences L1 1-SeqSupport L2 2-SeqSupport L3 3-SeqSupport L4 4-SeqSupport Minisup = 25% Maximal:,,

Sequential Associations Candidate Generation -Initially find 1-element frequent sequence -At subsequent passes, each pass starts the frequent sequences found in the previous pass. -Each candidate sequence has one more item than a previously found frequent sequence; so all the candidate sequences in the same pass will have the same number of items -Candidate generation -Join: s 1 joins with s 2 if the subsequence obtained by dropping the first item of s 1 is the same as the subsequence obtained by dropping the last element of s 2. s 1 is extended with the last item in s 2. -Prune: delete candidate if it has an infrequent contiguous subsequence

Sequential Associations Initial candidates:,,,,,,, Scan database once, count support for candidates CandSup SequenceSeq. ID min_sup =2

Sequential Associations length-2 Candidates Without Apriori property, 8*8+8*7/2=92 candidates Apriori prunes 44.57% candidates

Sequential Associations 38 … … … … 1 st scan: 8 cand. 6 length-1 seq. pat. 2 nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all 3 rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all 4 th scan: 8 cand. 6 length-4 seq. pat. 5 th scan: 1 cand. 1 length-5 seq. pat. Cand. cannot pass sup. threshold Cand. not in DB at all SequenceSeq. ID min_sup =2 GSP:

Sequential Associations Example: L3 Freq. 3-sequences C4 (candidate 4-seq) After join C4 (candidate 4-seq) After pruning

Sequential Associations Extended features – Concept hierarchies (taxonomies) – Sliding window – Time constraints

Sequential Associations Example: min-sup count =2 sliding window =7 days max-gap= 30 days no patterns taxonomy Science Fiction AsimovNiven Foundation Foundation and Empire Second Foundation RingworldRingworld Engineers Spy Le Carre Perfect Spy Smiley’s People Sequence-IdTransaction TimeItems C11Ringworld C12Foundation C115Ringworld Engineers, Second Foundation C21Foundation, Ringworld C220Foundation and Empire C250Ringworld Engineers

Sequential Associations SPADE (Sequential PAttern Discovery using Equivalent Class) developed by Zaki 2001 A vertical format sequential pattern mining method A sequence database is mapped to a large set of Item: Sequential pattern mining is performed by – growing the subsequences (patterns) one item at a time by Apriori candidate generation

Sequential Associations - SPADE

44 Bottlenecks of Candidate Generate- and-test A huge set of candidates generated. – Especially 2-item candidate sequence. Multiple Scans of database in mining. – The length of each candidate grows by one at each database scan. Inefficient for mining long sequential patterns. – A long pattern grow up from short patterns – An exponential number of short candidates

45 PrefixSpan (Prefix-Projected Sequential Pattern Growth) PrefixSpan – Projection-based – But only prefix-based projection: less projections and quickly shrinking sequences J.Pei, J.Han,… PrefixSpan : Mining sequential patterns efficiently by prefix-projected pattern growth. ICDE’01.

46 Mining Sequential Patterns by Prefix Projections Step 1: find length-1 sequential patterns –,,,,, Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets: – The ones having prefix ; – … – The ones having prefix SIDsequence

47 Finding Seq. Patterns with Prefix Only need to consider projections w.r.t. – -projected database:,,, Find all the length-2 seq. pat. Having prefix :,,,,, – Further partition into 6 subsets Having prefix ; … Having prefix SIDsequence

48 The Algorithm of PrefixSpan Input: A sequence database S, and the minimum support threshold min_sup Output: The complete set of sequential patterns Method: Call PrefixSpan(<>,0,S) Subroutine PrefixSpan(α, l, S|α) Parameters: – α: sequential pattern, – l: the length of α; – S|α: the α-projected database, if α ≠<>; otherwise; the sequence database S

49 The Algorithm of PrefixSpan(2) Method 1. Scan S|α once, find the set of frequent items b such that: a) b can be assembled to the last element of α to form a sequential pattern; or b) can be appended to α to form a sequential pattern. 2. For each frequent item b, append it to α to form a sequential pattern α’, and output α’; 3. For each α’, construct α’-projected database S|α’, and call PrefixSpan(α’, l+1, S|α’).

50/16 1. Find length1sequential patterns: 2. Divide search space Prefix id Sequence PrefixSpan - Example

51/16 PrefixSpan – Example (2) Find subsets of sequential patterns: <>

52 Performance on Data Set C10T8S8I8

53 Performance on Data Set Gazelle

54 Prefix and Suffix (Projection),, and are prefixes of sequence Given sequence PrefixSuffix (Prefix-Based Projection)

Interestingness Measure: Correlation (Lift) play basketball  eat cereal [40%, 66.7%] is misleading – The overall % of students eating cereal is 75% > 66.7%. play basketball  not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence Measure of dependent/correlated events: lift (also called correlation, interest) 11/20/ BasketballNot basketballSum (row) Cereal Not cereal Sum(col.)

Interestingness Measure: Conviction Measures the independence of negation of implication Conviction (A → B) = P(A) P(-B) / P(A, -B) 1: not related∞: always hold Example: Conviction (basketball → cereal) = (3000/5000) * (1250/5000) / (1000/5000) = 0.75 Conviction (basketball → not cereal) = (3000/5000) * (3750/5000) / (2000/5000) = 1.12 Conviction (not basketball → cereal) = (2000/5000) * (1250/5000) / (250/5000) = 2 11/20/ Basket ball Not basketball Sum (row) Cereal Not cereal Sum(col.)

Interestingness Measure: Conviction Conviction (A → B) = P(A) P(-B) / P(A, -B) Example: t1: {Bread, Jelly, Peanutbutter} t2: {Bread, Peanutbutter} t3: {Bread, Milk, Peanutbutter} t4: {Beer, Bread} t5: {Beer, Milk} conviction (Peanutbutter → Bread) = (3/5 * 1/5) / 0 = ∞ conviction (Bread → Peanutbutter) = (4/5 * 2/5) / (1/5) = 8/5 > 1 11/20/201657

Interestingness Measure: f-measure f-measure = (( 1+ α 2 ) * support * confidence ) / (α * s ) + c α takes a value in [0,1], when 0 support has min effect Another representation F-metric = ((B 2 +1 ) * confidence * support ) / (( B * confidence) + support)) 11/20/201658