Presentation is loading. Please wait.

Presentation is loading. Please wait.

Indexing Time Series Based on original slides by Prof. Dimitrios Gunopulos and Prof. Christos Faloutsos with some slides from tutorials by Prof. Eamonn.

Similar presentations


Presentation on theme: "Indexing Time Series Based on original slides by Prof. Dimitrios Gunopulos and Prof. Christos Faloutsos with some slides from tutorials by Prof. Eamonn."— Presentation transcript:

1 Indexing Time Series Based on original slides by Prof. Dimitrios Gunopulos and Prof. Christos Faloutsos with some slides from tutorials by Prof. Eamonn Keogh and Dr. Michalis Vlachos. Excellent tutorials (and not only) about Time Series can be found there: A nice tutorial on Matlab and Time series is also there:

2 Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time intervals Stock prices Volume of sales over time Daily temperature readings ECG data A time series database is a large collection of time series

3 Time Series Data A time series is a collection of observations made sequentially in time. time axis value axis

4 Time Series Problems (from a database perspective) The Similarity Problem X = x 1, x 2, …, x n and Y = y 1, y 2, …, y n Define and compute Sim(X, Y) E.g. do stocks X and Y have similar movements? Retrieve efficiently similar time series (Indexing for Similarity Queries)

5 Types of queries whole match vs sub-pattern match range query vs nearest neighbors all-pairs query

6 Examples Find companies with similar stock prices over a time interval Find products with similar sell cycles Cluster users with similar credit card utilization Find similar subsequences in DNA sequences Find scenes in video streams

7 day $price 1365 day $price 1365 day $price 1365 distance function: by expert (eg, Euclidean distance)

8 Problems Define the similarity (or distance) function Find an efficient algorithm to retrieve similar time series from a database (Faster than sequential scan) The Similarity function depends on the Application

9 Metric Distances What properties should a similarity distance have to allow (easy) indexing? D(A,B) = D(B,A)Symmetry D(A,A) = 0Constancy of Self-Similarity D(A,B) >= 0 Positivity D(A,B) D(A,C) + D(B,C)Triangular Inequality Some times the distance function that best fits an application is not a metric… then indexing becomes interesting….

10 Euclidean Similarity Measure View each sequence as a point in n-dimensional Euclidean space (n = length of each sequence) Define (dis-)similarity between sequences X and Y as p=1 Manhattan distance p=2 Euclidean distance

11 Euclidean model Query Q n datapoints S Q Euclidean Distance between two time series Q = {q 1, q 2, …, q n } and S = {s 1, s 2, …, s n } Distance Rank Database n datapoints

12 Easy to compute: O(n) Allows scalable solutions to other problems, such as indexing clustering etc... Advantages

13 Dynamic Time Warping [Berndt, Clifford, 1994] Allows acceleration-deceleration of signals along the time dimension Basic idea Consider X = x 1, x 2, …, x n, and Y = y 1, y 2, …, y n We are allowed to extend each sequence by repeating elements Euclidean distance now calculated between the extended sequences X and Y Matrix M, where m ij = d(x i, y j )

14 Example Euclidean distance vs DTW

15 X Y warping path j = i – w j = i + w Dynamic Time Warping [Berndt, Clifford, 1994] x1 x2 x3 y1 y2 y3

16 Restrictions on Warping Paths Monotonicity Path should not go down or to the left Continuity No elements may be skipped in a sequence Warping Window | i – j | <= w

17 Example s1 s2 s3 s4 s5 s6 s7 s8 s9 q q q q q q q q q Matrix of the pair-wise distances for element si with qj

18 Example s1 s2 s3 s4 s5 s6 s7 s8 s9 q q q q q q q q q Matrix computed with Dynamic Programming based on the: dist(i,j) = dist(si, yj) + min {dist(i-1,j-1), dist(i, j-1), dist(i-1,j))

19 Formulation Let D(i, j) refer to the dynamic time warping distance between the subsequences x 1, x 2, …, x i y 1, y 2, …, y j D(i, j) = | x i – y j | + min{ D(i – 1, j), D(i – 1, j – 1), D(i, j – 1) }

20 Solution by Dynamic Programming Basic implementation = O(n 2 ) where n is the length of the sequences will have to solve the problem for each (i, j) pair If warping window is specified, then O(nw) Only solve for the (i, j) pairs where | i – j | <= w

21 Longest Common Subsequence Measures (Allowing for Gaps in Sequences) Gap skipped

22 Longest Common Subsequence (LCSS) ignore majority of noise match Advantages of LCSS: A.Outlying values not matched B.Distance/Similarity distorted less C.Constraints in time & space Disadvantages of DTW: A.All points are matched B.Outliers can distort distance C.One-to-many mapping LCSS is more resilient to noise than DTW.

23 Longest Common Subsequence Similar dynamic programming solution as DTW, but now we measure similarity not distance. Can also be expressed as distance

24 Similarity Retrieval Range Query Find all time series S where Nearest Neighbor query Find all the k most similar time series to Q A method to answer the above queries: Linear scan … very slow A better approach GEMINI

25 GEMINI Solution: Quick-and-dirty' filter: extract m features (numbers, eg., avg., etc.) map into a point in m-d feature space organize points with off-the-shelf spatial access method (SAM) retrieve the answer using a NN query discard false alarms

26 GEMINI Range Queries Build an index for the database in a feature space using an R-tree Algorithm RangeQuery(Q, ) 1. Project the query Q into a point q in the feature space 2. Find all candidate objects in the index within 3. Retrieve from disk the actual sequences 4. Compute the actual distances and discard false alarms

27 GEMINI NN Query Algorithm K_NNQuery(Q, K) 1. Project the query Q in the same feature space 2. Find the candidate K nearest neighbors in the index 3. Retrieve from disk the actual sequences pointed to by the candidates 4. Compute the actual distances and record the maximum 5. Issue a RangeQuery(Q, max) 6. Compute the actual distances, return best K

28 GEMINI GEMINI works when: D feature (F(x), F(y)) <= D(x, y) Note that, the closer the feature distance to the actual one, the better.

29 Generic Search using Lower Bounding query simplified query simplified DB original DB Answer Superset Verify against original DB Final Answer set

30 Problem How to extract the features? How to define the feature space? Fourier transform Wavelets transform Averages of segments (Histograms or APCA) Chebyshev polynomials.... your favorite curve approximation...

31 Fourier transform DFT (Discrete Fourier Transform) Transform the data from the time domain to the frequency domain highlights the periodicities SO?

32 DFT A: several real sequences are periodic Q: Such as? A: sales patterns follow seasons; economy follows 50-year cycle (or 10?) temperature follows daily and yearly cycles Many real signals follow (multiple) cycles

33 How does it work? Decomposes signal to a sum of sine and cosine waves. Q:How to assess similarity of x with a (discrete) wave? 0 1 n-1 time value x ={x 0, x 1,... x n-1 } s ={s 0, s 1,... s n-1 }

34 How does it work? A: consider the waves with frequency 0, 1,...; use the inner-product (~cosine similarity) 0 1 n-1 time value freq. f=0 0 1 n-1 time value freq. f=1 (sin(t * 2 n) ) Freq=1/period

35 How does it work? A: consider the waves with frequency 0, 1,...; use the inner-product (~cosine similarity) 0 1 n-1 time value freq. f=2

36 How does it work? basis functions 0 1 n sine, freq =1 sine, freq = n cosine, f=1 cosine, f=2

37 How does it work? Basis functions are actually n-dim vectors, orthogonal to each other similarity of x with each of them: inner product DFT: ~ all the similarities of x with the basis functions

38 How does it work? Since e jf = cos(f) + j sin(f) (j=sqrt(-1)), we finally have:

39 DFT: definition Discrete Fourier Transform (n-point): inverse DFT

40 DFT: properties Observation - SYMMETRY property: X f = (X n-f )* ( *: complex conjugate: (a + b j)* = a - b j ) Thus we use only the first half numbers

41 DFT: Amplitude spectrum Amplitude Intuition: strength of frequency f time count freq. f AfAf freq: 12

42 DFT: Amplitude spectrum excellent approximation, with only 2 frequencies! so what?

43 C … Raw Data The graphic shows a time series with 128 points. The raw data used to produce the graphic is also reproduced as a column of numbers (just the first 30 or so points are shown). n = 128

44 C Fourier Coefficients … Raw Data We can decompose the data into 64 pure sine waves using the Discrete Fourier Transform (just the first few sine waves are shown). The Fourier Coefficients are reproduced as a column of numbers (just the first 30 or so coefficients are shown).

45 C Truncated Fourier Coefficients C We have discarded of the data Fourier Coefficients … Raw Data n = 128 N = 8 C ratio = 1/16

46 C Sorted Truncated Fourier Coefficients C Fourier Coefficients … Raw Data Instead of taking the first few coefficients, we could take the best coefficients

47 DFT: Parsevals theorem sum( x t 2 ) = sum ( | X f | 2 ) Ie., DFT preserves the energy or, alternatively: it does an axis rotation: x0 x1 x = {x0, x1}

48 Lower Bounding lemma Using Parsevals theorem we can prove the lower bounding property! So, apply DFT to each time series, keep first 3-10 coefficients as a vector and use an R- tree to index the vectors R-tree works with euclidean distance, OK.

49 Time series collections Fourier and wavelets are the most prevalent and successful descriptions of time series. Next, we will consider collections of M time series, each of length N. What is the series that is most similar to all series in the collection? What is the second most similar, and so on…

50 Time series collections Some notation: i -th series, x (i) values at time t, x t

51 Principal Component Analysis Example Exchange rates (vs. USD) U U U Time U4 Principal components 1-4 = 48% + 33% = 81% + 11% = 92% + 4% = 96% u1u1 u2u2 u3u3 u4u4 Coefficients of each time series w.r.t. basis { u 1, u 2, u 3, u 4 } : Best basis : { u 1, u 2, u 3, u 4 } x (2) = 49.1u u u u )

52 Principal component analysis i,1 i,2 First two principal components FRF BEF DEM NLG ESP GBP CAD JPY AUD SEK NZL CHF

53 Principal Component Analysis Matrix notation Singular Value Decomposition (SVD) X = U V T x (1) x (2) x(M)x(M) M u1u1 u2u2 ukuk =. time series basis for time series XU V T coefficients w.r.t. basis in U (columns)

54 Principal Component Analysis Matrix notation Singular Value Decomposition (SVD) X = U V T u1u1 u2u2 ukuk x (1) x (2) x(M)x(M) = N v 1 v 2 v k XU V T time series basis for time series coefficients w.r.t. basis in U (columns) basis for measurements (rows)

55 Principal Component Analysis Matrix notation Singular Value Decomposition (SVD) X = U V T u1u1 u2u2 ukuk x (1) x (2) x(M)x(M) =. v1v1 v2v2 vkvk. 1 2 k XU VTVT basis for measurements (rows) time series basis for time series scaling factors

56 PCA gives another lower dimensional transformation Easy to show that the lower bounding lemma holds but needs a collection of time series and expensive to compute it exactly

57 Haar 0 Haar 1 Haar 2 Haar 3 Haar 4 Haar 5 Haar 6 Haar X X' DFT Agrawal, Faloutsos, Swami 1993 Chan & Fu 1999 eigenwave 0 eigenwave 1 eigenwave 2 eigenwave 3 eigenwave 4 eigenwave 5 eigenwave 6 eigenwave 7 Korn, Jagadish, Faloutsos 1997 Feature Spaces X X' DWT X X' SVD

58 Piecewise Aggregate Approximation (PAA) value axis time axis Original time series (n-dimensional vector) S={s 1, s 2, …, s n } n-segment PAA representation (n-d vector) S = {sv 1, sv 2, …, sv n } sv 1 sv 2 sv 3 sv 4 sv 5 sv 6 sv 7 sv 8 PAA representation satisfies the lower bounding lemma (Keogh, Chakrabarti, Mehrotra and Pazzani, 2000; Yi and Faloutsos 2000)

59 Can we improve upon PAA? n-segment PAA representation (n-d vector) S = {sv 1, sv 2, …, sv N } sv 1 sv 2 sv 3 sv 4 sv 5 sv 6 sv 7 sv 8 sv 1 sv 2 sv 3 sv 4 sr 1 sr 2 sr 3 sr 4 n/2-segment APCA representation (n-d vector) S= { sv 1, sr 1, sv 2, sr 2, …, sv M, sr M } (M is the number of segments = n/2) Adaptive Piecewise Constant Approximation (APCA)

60 Q D LB (Q,S) Distance Measure S Q D(Q,S) Exact (Euclidean) distance D(Q,S) Lower bounding distance D LB (Q,S) S S Q

61 Lower Bounding the Dynamic Time Warping Recent approaches use the Minimum Bounding Envelope for bounding the constrained DTW Create a Envelope of the query Q (U, L) Calculate distance between MBE of Q and any sequence A D(MBE(Q) δ,A) < DTW(Q,A) One can show that: D(MBE(Q) δ,A) < DTW(Q,A) is the constraint is the constraint Q A MBE(Q) 2δ2δ U L

62 Lower Bounding the Dynamic Time Warping LB by Keogh approximate MBE and sequence using MBRs LB = LB by Zhu and Shasha approximate MBE and sequence using PAA LB = Q A Q A

63 Computing the LB distance Use PAA to approximate each time series A in the sequence and U and L of the query envelop using k segments Then the LB_PAA can be computed as follows:

64 where is the average of the i-th segment of the time series A, i.e. similarly we compute and


Download ppt "Indexing Time Series Based on original slides by Prof. Dimitrios Gunopulos and Prof. Christos Faloutsos with some slides from tutorials by Prof. Eamonn."

Similar presentations


Ads by Google