Presentation is loading. Please wait.

Presentation is loading. Please wait.

Subsequence Matching in Time Series Databases Xiaojin Xu 04-25-2006.

Similar presentations


Presentation on theme: "Subsequence Matching in Time Series Databases Xiaojin Xu 04-25-2006."— Presentation transcript:

1 Subsequence Matching in Time Series Databases Xiaojin Xu 04-25-2006

2 2 Papers Online Event driven Subsequence Matching over Financial Data Streams –Huanmei Wu, Betty Salzberg, Donghui Zhang Fast Subsequence Matching in Time-Series Databases –C. Faloutsos, M. Ranganathan, Y. Manolopoulos

3 3 Challenges of Subsequence Matching over Financial Data Streams Existing techniques of Subsequence Matching –Mainly focus on discovering the similarity between an online querying subsequence and a traditional database –Queried data are static Subsequence Similarities of Financial Data Streams –Data changing constantly, single pass search required –Movement can be predicted by observing a repetitive pattern of waves (zigzag shapes) –The relative position of the upper and lower end points is important in subsequence similarity. –Subsequence similarity should be flexible with regard to time shifting and scaling, amplitude rescaling…

4 4 Our online event-driven subsequence matching meets the requirements of financial data analysis Database is a dynamic stream database which stores recent financial data. 3-tier online segmentation and pruning Similarity measure: distance function is defined based on a permutation of the subsequence Event-driven matching over an up-to-date database: query will be carried out only when there is a new end point A new definition of trend for financial data stream

5 5 Processing Online Data Stream  Translating massive data streams into manageable data for database before matching Aggregation and Smoothing Piecewise linear representation Online segmentation and pruning

6 6 Aggregation and Smoothing One unique value for each time instance over a fixed time interval Use p-interval moving average to filter out noise and generate a clean trend signal –X(i) is the value for i = 1, 2,..., n –n is the number of periods.

7 7 Piecewise Linear Representation (PLR) Segment over Bollinger Band Percent (%b) %b indicator middle_band = p-period moving average upper_band = middle_band + 2*p-period standard deviation lower_band = middle_band - 2* p-period standard deviation %b =(close price – lower_band)/(upper_band – lower_band) Advantages of %b indicator –Smoothed moving trend similar to the price movement –Normalized value of the real price. –Sensitive to price change

8 8 Segmentation Use a sliding window which – Can only contain at most m points –Begin after the last identified end point and end right before the current point –Only contain last m points if more than m points Segmentation over b% finds a possible upper or lower end points in the current sliding window Current point is P j (X j,t j ), the upper point P i (X i,t i ) is a point in the sliding window that satisfies: 1. X i = max( X values of current sliding window ) 2. X i > X j + δ (δ is the given error threshold) 3. P (X i, t i ) is the last one satisfying the above two conditions

9 9 Segmentation (Cont’d)

10 10 Pruning Purpose — smoothing over recently identified end points Two step –Filter: Pruning on %b –Refinement: pruning on raw data stream Pruning rule — If the absolute %b or raw data values of two adjacent end points differs by less than a certain value, that line segment should be removed.

11 11 Pruning (Cont’d)

12 12 Online segmentation and pruning Whenever an upper/lower point is identified, the previous line segment is checked for pruning First check the need for pruning on %b If pruning on %b, no pruning on raw data is done. System waits for next stream data to come in If no pruning on %b done, the same line segment is checked for pruning on raw data Keep which point after pruning? –Compare the last end point with the third last end point. If upper points, the one with the larger value will be kept. Otherwise, keep the point with smaller value.

13 13 Online segmentation and pruning

14 14 Online segmentation and pruning Strategy of identifying end points –a smaller threshold δ s for segmentation over %b, to ensure the sensitivity and reduce delay –a larger threshold δ p b for pruning over %b, to fi lter out noise – a separate δ p d for pruning over raw stream data. The online segmentation and pruning are running simultaneously. At most three end points need to be kept for segmentation and pruning procedure All the fixed end points are updated into the database in real time

15 15 Permutation Subsequence matching –Find the subsequence of end points that are similar to the query subsequence Permutation –Stream of end points S = {(X 1, t 1 ), (X 2, t 2 ),…, (X n, t n ) }, divided into two subsets of upper and lower end points respectively, get S’ –S’ = {[(X 1, t 1 ), (X 3, t 3 ),…, (X n-1, t n-1 )], [(X 2, t 2 ), (X 4, t 4 ),…, (X n, t n )]},Sort the X values of each subset, get S” –S” = {[X i1, X i3, …X in-1 ], [X i2, X i4, …X in ]} where X i1 ≤X i3 ≤ … ≤X in-1, X i2 ≤X i4 ≤… ≤X in, –{i 1, i 3,…, i n-1, i 2, i 4,…, i n } is the permutation of S

16 16 Subsequence Similarity Definition: S = {(X 1, t 1 ), (X 2, t 2 ),…, (X n, t n ) }, S’ = {(X 1 ’, t 1 ’), (X 2 ’, t 2 ’),…, (X n ’, t n ’) }, S and S’ are similar if two conditions are satisfied: (1) S and S’ have the same permutation (2) d(S,S’) < γ where α,β, and γ≥0 and are user-defined parameters Permutation provides flexibility of time scaling and amplitude rescaling

17 17 Event­driven subsequence match Stream data are massive, real time. Do similarity search after a fixed time period may lose potentially important information Event — A new potential end point is being identified and no pruning is need. Event-driven subsequence match –Performs subsequence similarity search automatically only when there is a new event. –Generated query subsequence is the most recent n fixed and potential end points Advantage: Can reduce the huge computation burden while maintain sensitivity to changes

18 18 Application - Trend Prediction Trend of an end point: Tendency of the raw stream after k end points from the current end point E. (ε is a user defined parameter) If E k.X≥E.X+ε E.trend = UP If E k.X≤E.X- ε E.trend = DOWN If E.X - ε <E k.X <E.X+ ε E.trend = NOTREND If E k.does not exist, E.trend = UNDEFINED. Predict trend of query event Subsequence similarity search returns a list of retrieved end points F(D) = (# of retrieved end points with trend D) / (total # of retrieved end points) ×100% if |F(UP) – F(DOWN)| < F(NOTREND) + λ predict NOTREND; else if F(UP) > F(DOWN) predict UP; else predict DOWN; (λ is a user defined threshold)

19 19 Conclusion The online simultaneous segmentation and pruning algorithm for PLR achieves quick identification of new end points yet maintains accurate segmentation New similarity measure of a permutation and a distance function has better performance than measures based on Euclidean distance Experiments demonstrated that event-driven search outperformed the searches with any fixed time period.

20 20 Fast Subsequence Matching in Time- Series Databases Whole matching –Given N data sequences of S 1, S 2, …, S N and a query sequence Q, find those sequences that are within distance ε from Q. S i and Q have the same length. Subsequence matching –Given N data sequences of S 1, S 2, …, S N of arbitrary lengths, a query sequence Q and a tolerance ε, try to find data sequences S i that containing matching subsequences( with distance < ε from Q)

21 21 Whole matching Use a distance preserving transform( e.g. DFT) to extract f features from sequences Map f features into points in the f-dimensional feature space. Use spatial access method ( e.g. R * -tree) to search for range/approximate query. Precondition: data sequences and query sequences all have the same length

22 22 Defined Subsequence Matching Given N data sequences of real numbers S 1, S 2, …, S N of potentially deferent lengths The user specifies query subsequence Q of length Len(Q) and the tolerance ε (maximum distance) Try to find quickly all the sequences S i and the correct offsets k, such that the subsequence S i [k: k+Len(Q)-1] matches the query sequence: D(Q, S i [k: k+Len(Q)-1] ≤ ε  Sequential Scan is not efficient for space/time overhead

23 23 ST-index Assume the minimum query length is w Use a sliding window of size w and place it at every possible position on every data sequence Extract the features of subsequence inside the window for each placement A data sequence of length Len(S) is mapped to a trail in feature space The trail consists of Len(S)-w+1points. Each point represent each possible offset of the sliding window

24 24 How to index the trails A straightforward way — I-naive – keep track of the individual points of each trail and store them in a spatial access method Problem –Storing the individual points of trail in an R * - tree is inefficient in space and speed –Almost every point in a data sequence will correspond to a point in the f-dimensional feature space. 1: f increase for storage.

25 25 MBR Divide the trail into sub-trails. Each sub-trail is represented with minimum bounding (hyper)- rectangle (MBR). Only a few MBRs need to be stored. When a query arrives, retrieve all the MBRs that intersect the query region. Some false alarms are included(their MBR intersect the query region, but the sub-trails do not) MBRs belonging to the same trail may overlap

26 26 MBR(Cont’d) Information of MBR –t start, t end : offsets of first and last positionings –sequence_id: unique identifier of the data sequence –(F1 low,F1 high,F2 low,F2 high,…) : extent of the MBR

27 27 MBR(Cont’d) Group the MBRs to form MBRs at higher level None-leaf nodes do not store sequence_id or offsets

28 28 Insertion – How to divide trails into sub-trails I-fixed method –Sub-trail size is fixed number or a simple function of Len(S) –Resulting MBRs are not good.

29 29 I-adaptive method Goal: Adapt to the distribution of points of the trail Cost function –L = (L 1, L 2, …, L n ) : sides of n-dimensional MBR of a node in an R-tree –Average number of disk accesses DA(L) Marginal cost of each point in the sub-trail of k points with the MBR –mc = DA(L)/k

30 30 I-adaptive method: Algorithm Assign the first point of the trail in a trivial sub-trail FOR each successive point –IF it increases the marginal cost of the current sub-trail –THEN start another sub-trail –ELSE include it in the current sub-trail

31 31 Searching : Len(Q) = w Q is mapped to a point q f in feature space; the query corresponds to a sphere in feature space with center q f and radius ε ; Retrieve the sub-trails whose MBRs intersect the query region using our index Examine the corresponding subsequences of the data sequences to discard the false alarms

32 32 Searching : Len(Q) = pw  If Q and S agree within tolerance ε, then at least one of the pairs (s i, q i ) of corresponding subsequences agree within tolerance ε/ ; Q is broken into p sub-queries which corresponds to p spheres in feature space with ε/ ; Retrieve the sub-trails whose MBRs intersect at least one sub-query region using ST-index Examine the corresponding subsequences of the data sequences to discard the false alarms

33 33 Conclusion Designed a method that efficiently handles approximate queries for subsequence matching Fulfill the following requirements: Fast — Experiment results showed it achieves orders of magnitude savings over the sequential scanning It requires small space overhead It is dynamic Correct : no false dismissals

34 Thank you!


Download ppt "Subsequence Matching in Time Series Databases Xiaojin Xu 04-25-2006."

Similar presentations


Ads by Google