Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining Association Rules: Advanced Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar.

Similar presentations


Presentation on theme: "Data Mining Association Rules: Advanced Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar."— Presentation transcript:

1 Data Mining Association Rules: Advanced Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar

2 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Sequence Data ObjectTimestampEvents A102, 3, 5 A206, 1 A231 B114, 5, 6 B172 B217, 8, 1, 2 B281, 6 C141, 8, 7 Sequence Database:

3 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 Examples of Sequence Data Sequence Database SequenceElement (Transaction) Event (Item) CustomerPurchase history of a given customer A set of items bought by a customer at time t Books, diary products, CDs, etc Web DataBrowsing activity of a particular Web visitor A collection of files viewed by a Web visitor after a single mouse click Home page, index page, contact info, etc Event dataHistory of events generated by a given sensor Events triggered by a sensor at time t Types of alarms generated by sensors Genome sequences DNA sequence of a particular species An element of the DNA sequence Bases A,T,G,C Sequence E1 E2 E1 E3 E2 E3 E4 E2 Element (Transaction) Event (Item)

4 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 Formal Definition of a Sequence l A sequence is an ordered list of elements (transactions) s = –Each element contains a collection of events (items) e i = {i 1, i 2, …, i k } –Each element is attributed to a specific time or location l Length of a sequence, |s|, is given by the number of elements of the sequence l A k-sequence is a sequence that contains k events (items)

5 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 Examples of Sequence l Web sequence: l Sequence of initiating events causing the nuclear accident at 3-mile Island: (http://stellar-one.com/nuclear/staff_reports/summary_SOE_the_initiating_event.htm) l Sequence of books checked out at a library:

6 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Formal Definition of a Subsequence l A sequence is contained in another sequence (m ≥ n) if there exist integers i 1 < i 2 < … < i n such that a 1  b i1, a 2  b i1, …, a n  b in l The support of a subsequence w is defined as the fraction of data sequences that contain w l A sequential pattern is a frequent subsequence (i.e., a subsequence whose support is ≥ minsup) Data sequenceSubsequenceContain? Yes No Yes

7 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 Sequential Pattern Mining: Definition l Given: –a database of sequences –a user-specified minimum support threshold, minsup l Task: –Find all subsequences with support ≥ minsup

8 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8 Sequential Pattern Mining: Challenge l Given a sequence: –Examples of subsequences:,,, etc. l How many k-subsequences can be extracted from a given n-sequence? n = 9 k=4: Y _ _ Y Y _ _ _ Y

9 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9 Sequential Pattern Mining: Example Minsup = 50% Examples of Frequent Subsequences: s=60% s=80% s=60%

10 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10 Extracting Sequential Patterns l Given n events: i 1, i 2, i 3, …, i n l Candidate 1-subsequences:,,, …, l Candidate 2-subsequences:,, …,,, …, l Candidate 3-subsequences:,, …,,, …,,, …,,, …

11 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11 Generalized Sequential Pattern (GSP) l Step 1: –Make the first pass over the sequence database D to yield all the 1- element frequent sequences l Step 2: Repeat until no new frequent sequences are found –Candidate Generation:  Merge pairs of frequent subsequences found in the (k-1)th pass to generate candidate sequences that contain k items –Support Counting:  Make a new pass over the sequence database D to find the support for these candidate sequences –Candidate Elimination:  Eliminate candidate k-sequences whose actual support is less than minsup

12 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12 GSP Example

13 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13 Timing Constraints (I) {A B} {C} {D E} <= m s <= x g >n g x g : max-gap n g : min-gap m s : maximum span Data sequenceSubsequenceContain? Yes No Yes No x g = 2, n g = 0, m s = 4

14 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14 Mining Sequential Patterns with Timing Constraints l Approach 1: –Mine sequential patterns without timing constraints –Postprocess the discovered patterns l Approach 2: –Modify GSP to directly prune candidates that violate timing constraints

15 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15 Frequent Subgraph Mining l Extend association rule mining to finding frequent subgraphs l Useful for Web Mining, computational chemistry, bioinformatics, spatial data sets, etc

16 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16 Graph Definitions

17 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17 Representing Transactions as Graphs l Each transaction is a clique of items

18 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18 Representing Graphs as Transactions

19 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19 Edge Growing

20 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20 Apriori-like Algorithm l Find frequent 1-subgraphs l Repeat –Candidate generation  Use frequent (k-1)-subgraphs to generate candidate k-subgraph –Candidate pruning  Prune candidate subgraphs that contain infrequent (k-1)-subgraphs –Support counting  Count the support of each remaining candidate –Eliminate candidate k-subgraphs that are infrequent In practice, it is not as easy. There are many other issues

21 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21 Example: Dataset

22 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22 Example

23 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23 Adjacency Matrix Representation The same graph can be represented in many ways

24 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24 Graph Isomorphism l A graph is isomorphic if it is topologically equivalent to another graph

25 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 Graph Isomorphism l Test for graph isomorphism is needed: –During candidate generation step, to determine whether a candidate has been generated –During candidate pruning step, to check whether its (k-1)-subgraphs are frequent –During candidate counting, to check whether a candidate is contained within another graph


Download ppt "Data Mining Association Rules: Advanced Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar."

Similar presentations


Ads by Google