Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining Sequence Data. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Sequence Data ObjectTimestampEvents A102, 3, 5 A206, 1 A231 B114,

Similar presentations


Presentation on theme: "Mining Sequence Data. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Sequence Data ObjectTimestampEvents A102, 3, 5 A206, 1 A231 B114,"— Presentation transcript:

1 Mining Sequence Data

2 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Sequence Data ObjectTimestampEvents A102, 3, 5 A206, 1 A231 B114, 5, 6 B172 B217, 8, 1, 2 B281, 6 C141, 8, 7 Sequence Database:

3 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Examples of Sequence Data Sequence Database SequenceElement (Transaction) Event (Item) CustomerPurchase history of a given customer A set of items bought by a customer at time t Books, diary products, CDs, etc Web DataBrowsing activity of a particular Web visitor A collection of files viewed by a Web visitor after a single mouse click Home page, index page, contact info, etc Event dataHistory of events generated by a given sensor Events triggered by a sensor at time t Types of alarms generated by sensors Genome sequences DNA sequence of a particular species An element of the DNA sequence Bases A,T,G,C Sequence E1 E2 E1 E3 E2 E3 E4 E2 Element (Transaction) Event (Item)

4 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Formal Definition of a Sequence l A sequence is an ordered list of elements (transactions) s = –Each element contains a collection of events (items) e i = {i 1, i 2, …, i k } –Each element is attributed to a specific time or location l Length of a sequence, |s|, is given by the number of elements of the sequence l A k-sequence is a sequence that contains k events (items)

5 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Examples of Sequence l Web sequence: l Sequence of initiating events causing the nuclear accident at 3-mile Island: (http://stellar-one.com/nuclear/staff_reports/summary_SOE_the_initiating_event.htm) l Sequence of books checked out at a library:

6 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Formal Definition of a Subsequence l A sequence is contained in another sequence (m n) if there exist integers i 1 < i 2 < … < i n such that a 1 b i1, a 2 b i1, …, a n b in l The support of a subsequence w is defined as the fraction of data sequences that contain w l A sequential pattern is a frequent subsequence (i.e., a subsequence whose support is minsup) Data sequenceSubsequenceContain? Yes No Yes

7 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Sequential Pattern Mining: Definition l Given: –a database of sequences –a user-specified minimum support threshold, minsup l Task: –Find all subsequences with support minsup

8 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Sequential Pattern Mining: Challenge l Given a sequence: –Examples of subsequences:,,, etc. l How many k-subsequences can be extracted from a given n-sequence? n = 9 k=4: Y _ _ Y Y _ _ _ Y

9 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Sequential Pattern Mining: Example Minsup = 50% Examples of Frequent Subsequences: s=60% s=80% s=60%

10 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Extracting Sequential Patterns l Given n events: i 1, i 2, i 3, …, i n l Candidate 1-subsequences:,,, …, l Candidate 2-subsequences:,, …,,, …, l Candidate 3-subsequences:,, …,,, …,,, …,,, …

11 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Generalized Sequential Pattern (GSP) l Step 1: –Make the first pass over the sequence database D to yield all the 1- element frequent sequences l Step 2: Repeat until no new frequent sequences are found –Candidate Generation: Merge pairs of frequent subsequences found in the (k-1)th pass to generate candidate sequences that contain k items –Candidate Pruning: Prune candidate k-sequences that contain infrequent (k-1)-subsequences –Support Counting: Make a new pass over the sequence database D to find the support for these candidate sequences –Candidate Elimination: Eliminate candidate k-sequences whose actual support is less than minsup

12 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Candidate Generation l Base case (k=2): –Merging two frequent 1-sequences and will produce two candidate 2-sequences: and l General case (k>2): –A frequent (k-1)-sequence w 1 is merged with another frequent (k-1)-sequence w 2 to produce a candidate k-sequence if the subsequence obtained by removing the first event in w 1 is the same as the subsequence obtained by removing the last event in w 2 The resulting candidate after merging is given by the sequence w 1 extended with the last event of w 2. –If the last two events in w 2 belong to the same element, then the last event in w 2 becomes part of the last element in w 1 –Otherwise, the last event in w 2 becomes a separate element appended to the end of w 1

13 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Candidate Generation Examples l Merging the sequences w 1 = and w 2 = will produce the candidate sequence because the last two events in w 2 (4 and 5) belong to the same element l Merging the sequences w 1 = and w 2 = will produce the candidate sequence because the last two events in w 2 (4 and 5) do not belong to the same element l We do not have to merge the sequences w 1 = and w 2 = to produce the candidate because if the latter is a viable candidate, then it can be obtained by merging w 1 with

14 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ GSP Example

15 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Timing Constraints (I) {A B} {C} {D E} <= m s <= x g >n g x g : max-gap n g : min-gap m s : maximum span Data sequenceSubsequenceContain? Yes No Yes No x g = 2, n g = 0, m s = 4

16 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Mining Sequential Patterns with Timing Constraints l Approach 1: –Mine sequential patterns without timing constraints –Postprocess the discovered patterns l Approach 2: –Modify GSP to directly prune candidates that violate timing constraints –Question: Does Apriori principle still hold?

17 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Apriori Principle for Sequence Data Suppose: x g = 1 (max-gap) n g = 0 (min-gap) m s = 5 (maximum span) minsup = 60% support = 40% but support = 60% Problem exists because of max-gap constraint No such problem if max-gap is infinite

18 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Contiguous Subsequences l s is a contiguous subsequence of w = … if any of the following conditions hold: 1.s is obtained from w by deleting an item from either e 1 or e k 2.s is obtained from w by deleting an item from any element e i that contains more than 2 items 3.s is a contiguous subsequence of s and s is a contiguous subsequence of w (recursive definition) l Examples: s = –is a contiguous subsequence of,, and –is not a contiguous subsequence of and

19 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Modified Candidate Pruning Step l Without maxgap constraint: –A candidate k-sequence is pruned if at least one of its (k-1)-subsequences is infrequent l With maxgap constraint: –A candidate k-sequence is pruned if at least one of its contiguous (k-1)-subsequences is infrequent

20 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Timing Constraints (II) {A B} {C} {D E} <= m s <= x g >n g <= ws x g : max-gap n g : min-gap ws: window size m s : maximum span Data sequenceSubsequenceContain? No Yes Yes x g = 2, n g = 0, ws = 1, m s = 5

21 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Modified Support Counting Step l Given a candidate pattern: –Any data sequences that contain, ( where time({c}) – time({a}) ws) (where time({a}) – time({c}) ws) will contribute to the support count of candidate pattern


Download ppt "Mining Sequence Data. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Sequence Data ObjectTimestampEvents A102, 3, 5 A206, 1 A231 B114,"

Similar presentations


Ads by Google