Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining Time-Series Databases Mohamed G. Elfeky. Introduction A Time-Series Database is a database that contains data for each point in time. Examples:

Similar presentations


Presentation on theme: "Mining Time-Series Databases Mohamed G. Elfeky. Introduction A Time-Series Database is a database that contains data for each point in time. Examples:"— Presentation transcript:

1 Mining Time-Series Databases Mohamed G. Elfeky

2 Introduction A Time-Series Database is a database that contains data for each point in time. Examples: Weather Data Stock Prices

3 What to Mine? Full Periodic Patterns Every point in time contributes to the cyclic behavior of the time-series for each period. e.g., describing the weekly stock prices pattern considering all the days of the week. Partial Periodic Patterns Describing the behavior of the time-series at some but not all points in time. e.g., discovering that the stock prices are high every Saturday and small every Tuesday.

4 Mining Partial Periodic Patterns Problem Definition Methods Apriori Max-Subpattern Hit Set Jiawei Han, Guozhu Dong, and Yiwen Yin – ICDE98

5 Problem Definition The time-series is: S = D 1 D 2 … D n A pattern is: s = s 1 … s p over the set of features L and the letter *. |s| = p is the period of the pattern s. L-length of s is the number of s i that is not *. If s has L-length j, it is called a j-pattern. A subpattern of s is: s ’ = s ’ 1 … s ’ p such that for each position i: s ’ i is a * or subset of s i.

6 Problem Definition (Cont.) Each segment of the form D i|s|+1 … D i|s|+|s| is called a period segment. A period segment matches s if for each position j, either s j is * or subset of D i|s|+j. The frequency count of s in a time-series S is the number of period segments of S that matches s. The confidence of s is defined as the division of its frequency count by the maximum number of periods of length |s| in S. A pattern is called frequent if its confidence not less than a minimum threshold.

7 Problem Definition (Example) The pattern: a*{a,c}de is of length 5 and of L-length 4 and so it is called 4-pattern. The patterns: a*{a,c}** and **cde are subpatterns of the above pattern. In the series a{b,c}baebaced, the pattern: a*b, whose period is 3, has frequency count 2. Its confidence is 2/3 where 3 is the maximum number of periods of length 3.

8 Apriori Method Apriori Property: Each subpattern of a frequent pattern of period p is itself a frequent pattern of period p. Method: 1. Find F 1, the set of frequent 1-patterns of period p. 2. Find all frequent i-patterns of period p, for i from 2 to p, based on the idea of Apriori, and terminate when the candidate i-pattern set is empty.

9 Max-Subpattern Hit Set Method Definitions Algorithm Implementation Data Structure

10 Definitions A candidate max-pattern C max is the maximal pattern which can be generated from F 1 (the set of frequent 1-patterns). Example: If F 1 = {a***, *b**, *c**, **d*}, Then C max = a{b,c}d*

11 Definitions (Cont.) A subpattern of C max is hit in a period segment S i if it is the maximal subpattern of C max in S i. Example: For C max = a{b,c}d* and S i = a{b,c}ce, The hit subpattern is: a{b,c}** The hit set H is the set of all hit subpatterns of C max in S.

12 Algorithm 1. Scan S once to find F 1 and form the candidate max-pattern C max. 2. Scan S again, and for each period segment, add its max-subpattern to the hit set setting its count to 1 if it is not exist, or increase its count by 1. 3. Derive the frequent patterns from the hit set.

13 Implementation Data Structure Max-Subpattern Tree The root node is: C max. A child node is a subpattern of the parent node with one non-* letter missing. The link is labeled by this letter. A node containing only 2 non-* letters have no children since they are already in F 1. Each node has a count field which registers its number of hits.

14 Max-Subpattern Tree (Example) a{b,c}d* *{b,c}d* acd*abd*a{b,c}** a d cb *cd* *bd*a*d*ab**ac** bc b d d a a b bc 10 050 4032 188519 2

15 Max-Subpattern Tree (Construction) Finding w the max-subpattern in the current segment. Search for w in the tree, starting from the root and following the path corresponds to the missing non-* letters in order. If the node w is found, increase its count by 1. Otherwise, create a new node w (with count 1) and its missing ancestors in the followed path (with count 0).

16 Max-Subpattern Tree (Construction) a{b,c}d* *{b,c}d* a *cd* b 0 0 1

17 Max-Subpattern Tree (Traversal) After the second scan, the tree will contain all the max subpatterns of the time-series. Now the tree must be traversed to compute the confidence value of each subpattern.

18 Max-Subpattern Tree (Traversal) The frequency count of each node is the sum of its count and those of all its reachable ancestors. For Example: The frequency count of *cd* is 78. The frequency count of a*d* is 105.

19 Max-Subpattern Tree (Example) a{b,c}d* *{b,c}d* acd*abd*a{b,c}** a d cb *cd* *bd*a*d*ab**ac** bc b d d a a b bc 10 050 4032 188519 2


Download ppt "Mining Time-Series Databases Mohamed G. Elfeky. Introduction A Time-Series Database is a database that contains data for each point in time. Examples:"

Similar presentations


Ads by Google