Presentation is loading. Please wait.

Presentation is loading. Please wait.

It's about time. Choosing Distance Measures for Mining Time Series Data Spencer Schnier 2/22/11.

Similar presentations

Presentation on theme: "It's about time. Choosing Distance Measures for Mining Time Series Data Spencer Schnier 2/22/11."— Presentation transcript:

1 It's about time. Choosing Distance Measures for Mining Time Series Data Spencer Schnier 2/22/11

2 Indexing and clustering make explicit use of a distance measure The others make implicit use of a distance measure Major Time Series Data Mining Tasks Indexing Clustering Classification Prediction Summarization Anomaly Detection Segmentation ( Ratanamahatana et al., 2010)

3 Popular Distance Measures Lock - step Measure ( one - to - one ) o Minkowski Distance L 1 norm ( Manhattan Distance ) L 2 norm ( Euclidean Distance ) L norm ( Supremum Distance ) Elastic Measure ( one - to - many / one - to - none ) o Dynamic Time Warping ( DTW ) o Edit distance based measure Longest Common SubSequence ( LCSS ) Edit Distance on Real Sequence ( EDR ) Threshold - based Measure o Threshold query based similarity search ( TQuEST ) Pattern - based Measure o Spatial Assembling Distance ( SpADe ) ( Ding et al., 2008)

4 Minkowski Distance h = 1: Manhattan (city block, L 1 norm) distance E.g., the Hamming distance: the number of bits that are different between two binary vectors h = 2: (L 2 norm) Euclidean distance h. supremum (L max norm, L norm) distance. This is the maximum difference between any component (attribute) of the vectors

5 5 Dissimilarity Matrices Manhattan (L 1 ) Euclidean (L 2 ) Supremum Minkowski Distance Examples

6 Similar sequences but they are shifted and have different scales Whats wrong with Euclidean Distance? What if a sequence is stretched or compressed along the time axis ? ( Goldin and Kanellakis, 1995) Normalize the time series before measuring the distance between them.

7 Dynamic Time Warping Sequences are similar but accelerate differently along the time axis Enforcing a temporal constraint δ on the warping window size improves computation efficiency and accuracy Application : Speech recognition ( Berndt and Clifford, 1996)

8 1 Longest Common Subsequence Similarity Dissimilarity: Tolerance: Match 2 sequences by allowing some elements to be unmatched C = {1,2,3,4,5,1,7} and Q = {2,5,4,5,3,1,8} Longest is {2,4,5,1} Application : Bioinformatics Vlachos et al., 2002

9 1 Longest Common Subsequence Similarity for i := 1..m for j := 1..n if C[i] = Q[j] L[i,j] := L[i-1,j-1] + 1 else: L[i,j] := max(L[i,j-1], L[i-1,j]) return L[m,n] Input sequences C[1..m] and Q[1..n] Compute LCS btwn C[1..i] and Q[1..j] for all 1 i m and 1 j n Stores it in L[i,j] L[m,n] = length of the LCS Vlachos et al., 2002

10 Edit Distance on Real Sequence Similar to LCSS Uses a threshold parameter ε t o quantify the distance between a pair of points to 0 or 1 Seeks the minimum number of edit operations to change one sequence into another Assigns penalties to the unmatched segments according to the lengths of the gaps Application : Trajectories of moving objects ( Chen et al., 2005)

11 TQuEST SpADe ( Assfalg et al., 2006) ( Chen et al., 2007) Uses a threshold parameter τ to transform a time series into a sequence of threshold - crossing intervals ( the points within each interval have a value greater than a given τ) Each interval is treated as a 2 D point : x = starting time, y = ending time The similarity between two time series is then defined as the Minkowski sum of the two sequences of time interval points A pattern - based similarity measure for time series Finds matching segments called patterns by allowing shifting and scaling Then finds the most similar set of matching patterns Disadvantage : requires many parameters ( temporal and amplitude scale factor, pattern length, sliding step size, etc.)

12 Comparison of Distance Measures ( Ding et al., 2008)

13 Comparison of Distance Measures 1. The accuracy of elastic measures converge with Euclidean distance as the training set increases. On small data sets, elastic measures can be significantly more accurate than lock - step measures. 2. Constraining the warping window size for elastic measures can reduce the computation cost and increase accuracy. 3. The accuracy of edit distance based similarity measures is very close to that of DTW. Only EDR is potentially slightly better than DTW. 4. The accuracy of several new similarity measures, such as TQuEST and SpADe, is in general inferior to elastic measures. 5. To improve accuracy of a similarity measure, get more training data. 6. If you can t get more data, trying the other measures might help ; however, be careful to avoid overfitting. ( Ding et al., 2008)

14 ELKI 0.2 ( Achtert et al., 2009) Software for visualization and performance evaluation of distance measures for time series www. dbs. ifi. lmu. de / research / KDD / ELKI /

15 Research Questions Is distance measure performance related to some intrinsic properties of the data set ? If so, can those properties be used to identify the most appropriate distance measure ?

16 References Achtert, E., T. Bernecker, H.- P. Kriegel, E. Schubert, and A. Zimek ELKI in Time : ELKI 0.2 for the Performance Evaluation of Distance Measures for Time Series. SSTD Aßfalg, J., H.-P. Kriegel, P. Kr¨oger, P. Kunath, A. Pryakhin, and M. Renz Similarity search on time series based on threshold queries. EDBT, Berndt, D., and J. Clifford Finding Patterns in Time Series: A Dynamic Programming Approach. Advances in Knowledge Discovery and Data Mining AAAI/MIT Press, Menlo Park, CA. pg Chen, L., M. Ozsu, and V. Oria Robust and fast similarity search for moving object trajectories. SIGMOD 05. Chen, Y., M. Nascimento, B. Ooi, and A. Tung SpADe: On Shape-based Pattern Detection in Streaming Time Series. ICDE, Ding, H., G. Trajcevski, P. Scheuermann, X. Wang, E. Keogh Querying and Mining of Time Series Data : Experimental Comparison of Representations and Distance Measures. VLDB 08. Goldin, D., and P. Kanellakis On Similarity Queries for Time - Series Data : Constraint Specification and Implementation. Proceedings of the 1 st International Conference on the Principles and Practice of Constraint Programming. pp Ratanamahatana, C., J. Lin, D. Gunopulos, E. Keogh, M. Vlachos, G. Das Mining Time Series Data. Data Mining and Knowledge Discovery Handbook. Part 6, pg Vlachos, M., D. Gunopulos, and G. Kollios Discovering similar multidimensional trajectories. ICDE, 2002.

Download ppt "It's about time. Choosing Distance Measures for Mining Time Series Data Spencer Schnier 2/22/11."

Similar presentations

Ads by Google