Download presentation

Presentation is loading. Please wait.

Published byHailey Tobin Modified over 4 years ago

1
Mining Sequence Data

2
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Sequence Data ObjectTimestampEvents A102, 3, 5 A206, 1 A231 B114, 5, 6 B172 B217, 8, 1, 2 B281, 6 C141, 8, 7 Sequence Database:

3
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 Examples of Sequence Data Sequence Database SequenceElement (Transaction) Event (Item) CustomerPurchase history of a given customer A set of items bought by a customer at time t Books, diary products, CDs, etc Web DataBrowsing activity of a particular Web visitor A collection of files viewed by a Web visitor after a single mouse click Home page, index page, contact info, etc Event dataHistory of events generated by a given sensor Events triggered by a sensor at time t Types of alarms generated by sensors Genome sequences DNA sequence of a particular species An element of the DNA sequence Bases A,T,G,C Sequence E1 E2 E1 E3 E2 E3 E4 E2 Element (Transaction) Event (Item)

4
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 Formal Definition of a Sequence l A sequence is an ordered list of elements (transactions) s = –Each element contains a collection of events (items) e i = {i 1, i 2, …, i k } –Each element is attributed to a specific time or location l Length of a sequence, |s|, is given by the number of elements of the sequence l A k-sequence is a sequence that contains k events (items)

5
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 Examples of Sequence l Web sequence: l Sequence of initiating events causing the nuclear accident at 3-mile Island: (http://stellar-one.com/nuclear/staff_reports/summary_SOE_the_initiating_event.htm) l Sequence of books checked out at a library:

6
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Formal Definition of a Subsequence l A sequence is contained in another sequence (m n) if there exist integers i 1 < i 2 < … < i n such that a 1 b i1, a 2 b i1, …, a n b in l The support of a subsequence w is defined as the fraction of data sequences that contain w l A sequential pattern is a frequent subsequence (i.e., a subsequence whose support is minsup) Data sequenceSubsequenceContain? Yes No Yes

7
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 Sequential Pattern Mining: Definition l Given: –a database of sequences –a user-specified minimum support threshold, minsup l Task: –Find all subsequences with support minsup

8
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8 Sequential Pattern Mining: Challenge l Given a sequence: –Examples of subsequences:,,, etc. l How many k-subsequences can be extracted from a given n-sequence? n = 9 k=4: Y _ _ Y Y _ _ _ Y

9
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9 Sequential Pattern Mining: Example Minsup = 50% Examples of Frequent Subsequences: s=60% s=80% s=60%

10
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10 Extracting Sequential Patterns l Given n events: i 1, i 2, i 3, …, i n l Candidate 1-subsequences:,,, …, l Candidate 2-subsequences:,, …,,, …, l Candidate 3-subsequences:,, …,,, …,,, …,,, …

11
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11 Generalized Sequential Pattern (GSP) l Step 1: –Make the first pass over the sequence database D to yield all the 1- element frequent sequences l Step 2: Repeat until no new frequent sequences are found –Candidate Generation: Merge pairs of frequent subsequences found in the (k-1)th pass to generate candidate sequences that contain k items –Candidate Pruning: Prune candidate k-sequences that contain infrequent (k-1)-subsequences –Support Counting: Make a new pass over the sequence database D to find the support for these candidate sequences –Candidate Elimination: Eliminate candidate k-sequences whose actual support is less than minsup

12
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12 Candidate Generation l Base case (k=2): –Merging two frequent 1-sequences and will produce two candidate 2-sequences: and l General case (k>2): –A frequent (k-1)-sequence w 1 is merged with another frequent (k-1)-sequence w 2 to produce a candidate k-sequence if the subsequence obtained by removing the first event in w 1 is the same as the subsequence obtained by removing the last event in w 2 The resulting candidate after merging is given by the sequence w 1 extended with the last event of w 2. –If the last two events in w 2 belong to the same element, then the last event in w 2 becomes part of the last element in w 1 –Otherwise, the last event in w 2 becomes a separate element appended to the end of w 1

13
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13 Candidate Generation Examples l Merging the sequences w 1 = and w 2 = will produce the candidate sequence because the last two events in w 2 (4 and 5) belong to the same element l Merging the sequences w 1 = and w 2 = will produce the candidate sequence because the last two events in w 2 (4 and 5) do not belong to the same element l We do not have to merge the sequences w 1 = and w 2 = to produce the candidate because if the latter is a viable candidate, then it can be obtained by merging w 1 with

14
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14 GSP Example

15
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15 Timing Constraints (I) {A B} {C} {D E} <= m s <= x g >n g x g : max-gap n g : min-gap m s : maximum span Data sequenceSubsequenceContain? Yes No Yes No x g = 2, n g = 0, m s = 4

16
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16 Mining Sequential Patterns with Timing Constraints l Approach 1: –Mine sequential patterns without timing constraints –Postprocess the discovered patterns l Approach 2: –Modify GSP to directly prune candidates that violate timing constraints –Question: Does Apriori principle still hold?

17
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17 Apriori Principle for Sequence Data Suppose: x g = 1 (max-gap) n g = 0 (min-gap) m s = 5 (maximum span) minsup = 60% support = 40% but support = 60% Problem exists because of max-gap constraint No such problem if max-gap is infinite

18
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18 Contiguous Subsequences l s is a contiguous subsequence of w = … if any of the following conditions hold: 1.s is obtained from w by deleting an item from either e 1 or e k 2.s is obtained from w by deleting an item from any element e i that contains more than 2 items 3.s is a contiguous subsequence of s and s is a contiguous subsequence of w (recursive definition) l Examples: s = –is a contiguous subsequence of,, and –is not a contiguous subsequence of and

19
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19 Modified Candidate Pruning Step l Without maxgap constraint: –A candidate k-sequence is pruned if at least one of its (k-1)-subsequences is infrequent l With maxgap constraint: –A candidate k-sequence is pruned if at least one of its contiguous (k-1)-subsequences is infrequent

20
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20 Timing Constraints (II) {A B} {C} {D E} <= m s <= x g >n g <= ws x g : max-gap n g : min-gap ws: window size m s : maximum span Data sequenceSubsequenceContain? No Yes Yes x g = 2, n g = 0, ws = 1, m s = 5

21
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21 Modified Support Counting Step l Given a candidate pattern: –Any data sequences that contain, ( where time({c}) – time({a}) ws) (where time({a}) – time({c}) ws) will contribute to the support count of candidate pattern

Similar presentations

Presentation is loading. Please wait....

OK

Addition 1’s to 20.

Addition 1’s to 20.

© 2018 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google