Choosing Distance Measures for Mining Time Series Data

Slides:



Advertisements
Similar presentations
Online Event-driven Subsequence Matching over Financial Data Streams Huanmei Wu,Betty Salzberg, Donghui Zhang Northeastern University, College of Computer.
Advertisements

Indexing Time Series Based on original slides by Prof. Dimitrios Gunopulos and Prof. Christos Faloutsos with some slides from tutorials by Prof. Eamonn.
In Search of Meaning for Time Series Subsequence Clustering
Efficient Anomaly Monitoring over Moving Object Trajectory Streams joint work with Lei Chen (HKUST) Ada Wai-Chee Fu (CUHK) Dawei Liu (CUHK) Yingyi Bu (Microsoft)
Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,
74 th EAGE Conference & Exhibition incorporating SPE EUROPEC 2012 Automated seismic-to-well ties? Roberto H. Herrera and Mirko van der Baan University.
Indexing Time Series. Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time.
An Introduction to Clustering
Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)
Themis Palpanas1 VLDB - Aug 2004 Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use.
Overview Of Clustering Techniques D. Gunopulos, UCR.
Making Time-series Classification More Accurate Using Learned Constraints © Chotirat “Ann” Ratanamahatana Eamonn Keogh 2004 SIAM International Conference.
Distance Functions for Sequence Data and Time Series
1 ISI’02 Multidimensional Databases Challenge: representation for efficient storage, indexing & querying Examples (time-series, images) New multidimensional.
Based on Slides by D. Gunopulos (UCR)
Distance Measures Tan et al. From Chapter 2.
Cluster Analysis (1).
Detecting Time Series Motifs Under
Using Relevance Feedback in Multimedia Databases
Smart Traveller with Visual Translator for OCR and Face Recognition LYU0203 FYP.
A Multiresolution Symbolic Representation of Time Series
Data Mining – Intro.
1 Dot Plots For Time Series Analysis Dragomir Yankov, Eamonn Keogh, Stefano Lonardi Dept. of Computer Science & Eng. University of California Riverside.
Distance Measures Tan et al. From Chapter 2. Similarity and Dissimilarity Similarity –Numerical measure of how alike two data objects are. –Is higher.
Pattern Matching with Acceleration Data Pramod Vemulapalli.
Exact Indexing of Dynamic Time Warping
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Qualitative approximation to Dynamic Time Warping similarity between time series data Blaž Strle, Martin Možina, Ivan Bratko Faculty of Computer and Information.
1 TEMPLATE MATCHING  The Goal: Given a set of reference patterns known as TEMPLATES, find to which one an unknown pattern matches best. That is, each.
Analysis of Constrained Time-Series Similarity Measures
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
CSE 185 Introduction to Computer Vision Pattern Recognition 2.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Discovering Deformable Motifs in Time Series Data Jin Chen CSE Fall 1.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Distributed Spatio-Temporal Similarity Search Demetrios Zeinalipour-Yazti University of Cyprus Song Lin
Identifying Patterns in Time Series Data Daniel Lewis 04/06/06.
ICDE, San Jose, CA, 2002 Discovering Similar Multidimensional Trajectories Michail VlachosGeorge KolliosDimitrios Gunopulos UC RiversideBoston UniversityUC.
k-Shape: Efficient and Accurate Clustering of Time Series
Chapter 2: Getting to Know Your Data
Exact indexing of Dynamic Time Warping
Types of Data How to Calculate Distance? Dr. Ryan Benton January 29, 2009.
Stream Monitoring under the Time Warping Distance Yasushi Sakurai (NTT Cyber Space Labs) Christos Faloutsos (Carnegie Mellon Univ.) Masashi Yamamuro (NTT.
Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
Distance/Similarity Functions for Pattern Recognition J.-S. Roger Jang ( 張智星 ) CS Dept., Tsing Hua Univ., Taiwan
ITree: Exploring Time-Varying Data using Indexable Tree Yi Gu and Chaoli Wang Michigan Technological University Presented at IEEE Pacific Visualization.
High-Dimensional Data. Topics Motivation Similarity Measures Index Structures.
A Time Series Representation Framework Based on Learned Patterns
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09 Supervisor Dr Koh Speaker Nonhlanhla Shongwe April 26,
Machine Learning for the Quantified Self
Lecture 2-2 Data Exploration: Understanding Data
Supervised Time Series Pattern Discovery through Local Importance
Distance Functions for Sequence Data and Time Series
A Time Series Representation Framework Based on Learned Patterns
Similarity and Dissimilarity
Overview Of Clustering Techniques
Distance Functions for Sequence Data and Time Series
School of Computer Science & Engineering
Robust Similarity Measures for Mobile Object Trajectories
Time Series Data and Moving Object Trajectory
The Classification Problem
Handwritten Characters Recognition Based on an HMM Model
Nearest Neighbors CSC 576: Data Mining.
Fourier Transform of Boundaries
Data Mining: Concepts and Techniques — Chapter 2 —
Presentation transcript:

Choosing Distance Measures for Mining Time Series Data It's about time. Choosing Distance Measures for Mining Time Series Data Spencer Schnier 2/22/11

Major Time Series Data Mining Tasks Indexing Clustering Classification Prediction Summarization Anomaly Detection Segmentation Indexing and clustering make explicit use of a distance measure The others make implicit use of a distance measure (Ratanamahatana et al., 2010)

Popular Distance Measures Lock-step Measure (one-to-one) Minkowski Distance L1 norm (Manhattan Distance) L2 norm (Euclidean Distance) L∞ norm (Supremum Distance) Elastic Measure (one-to-many/one-to-none) Dynamic Time Warping (DTW) Edit distance based measure Longest Common SubSequence (LCSS) Edit Distance on Real Sequence (EDR) Threshold-based Measure Threshold query based similarity search (TQuEST) Pattern-based Measure Spatial Assembling Distance (SpADe) Distance measure = similarity measure Lock step means the measure compares i-th point to i-th point (one-to-one) Elastic measures mean one-to-many (DTW) and one-to-many/one-to-none points (LCSS) (Ding et al., 2008)

Minkowski Distance h = 1: Manhattan (city block, L1 norm) distance E.g., the Hamming distance: the number of bits that are different between two binary vectors h = 2: (L2 norm) Euclidean distance h  . “supremum” (Lmax norm, L norm) distance. This is the maximum difference between any component (attribute) of the vectors Borrowed from CS 412 Chp2 slides

Minkowski Distance Examples Dissimilarity Matrices Manhattan (L1) Euclidean (L2) Borrowed from CS 412 Chp2 slides Supremum

What’s wrong with Euclidean Distance? Similar sequences but they are shifted and have different scales Normalize the time series before measuring the distance between them. 𝑥 𝑖 ′ = 𝑥 𝑖 −μ σ What if a sequence is stretched or compressed along the time axis? (Goldin and Kanellakis, 1995)

Dynamic Time Warping Sequences are similar but accelerate differently along the time axis Enforcing a temporal constraint δ on the warping window size improves computation efficiency and accuracy Application: Speech recognition (Berndt and Clifford, 1996)

Longest Common Subsequence Similarity Match 2 sequences by allowing some elements to be unmatched C = {1,2,3,4,5,1,7} and Q = {2,5,4,5,3,1,8} Longest is {2,4,5,1} Application: Bioinformatics 2 5 4 5 3 1 8 1 2 3 4 5 7 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 2 2 2 2 2 1 2 2 3 3 3 3 1 2 2 3 3 4 4 This is an edit-distance based measure Two points from two time series are considered to match if their distance is less than ε. Dissimilarity is the minimum number of elements that should be removed from and inserted into C to transform C to Q. Specification of a matching window can improve accuracy. Dissimilarity: 1 2 2 3 3 4 4 𝐿𝐶𝑆𝑆 𝐶,𝑄 = 𝑚+𝑛−2∙𝑙 𝑚+𝑛 2 4 5 1 Tolerance: c 1−ε <𝑞<𝑐(1+ε) Vlachos et al., 2002

Longest Common Subsequence Similarity Input sequences C[1..m] and Q[1..n] Compute LCS btwn C[1..i] and Q[1..j] for all 1 ≤ i ≤ m and 1 ≤ j ≤ n Stores it in L[i,j] L[m,n] = length of the LCS 2 5 4 5 3 1 8 1 2 3 4 5 7 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 for i := 1..m for j := 1..n if C[i] = Q[j] L[i,j] := L[i-1,j-1] + 1 else: L[i,j] := max(L[i,j-1], L[i-1,j]) return L[m,n] 1 1 2 2 2 2 2 1 2 2 3 3 3 3 1 2 2 3 3 4 4 1 2 2 3 3 4 4 2 4 5 1 Vlachos et al., 2002

Edit Distance on Real Sequence Similar to LCSS Uses a threshold parameter ε to quantify the distance between a pair of points to 0 or 1 Seeks the minimum number of edit operations to change one sequence into another Assigns penalties to the unmatched segments according to the lengths of the gaps Application: Trajectories of moving objects (Chen et al., 2005)

TQuEST (Assfalg et al., 2006) Uses a threshold parameter τ to transform a time series into a sequence of threshold-crossing intervals (the points within each interval have a value greater than a given τ) Each interval is treated as a 2D point: x = starting time, y = ending time The similarity between two time series is then defined as the Minkowski sum of the two sequences of time interval points SpADe (Chen et al., 2007) A pattern-based similarity measure for time series Finds matching segments called patterns by allowing shifting and scaling Then finds the most similar set of matching patterns Disadvantage: requires many parameters (temporal and amplitude scale factor, pattern length, sliding step size, etc.)

Comparison of Distance Measures (Ding et al., 2008)

Comparison of Distance Measures The accuracy of elastic measures converge with Euclidean distance as the training set increases. On small data sets, elastic measures can be significantly more accurate than lock-step measures. Constraining the warping window size for elastic measures can reduce the computation cost and increase accuracy. The accuracy of edit distance based similarity measures is very close to that of DTW. Only EDR is potentially slightly better than DTW. The accuracy of several new similarity measures, such as TQuEST and SpADe, is in general inferior to elastic measures. To improve accuracy of a similarity measure, get more training data. If you can’t get more data, trying the other measures might help; however, be careful to avoid overfitting. elastic measures (e.g., DTW, LCSS, EDR and ERP etc.). Other lock-step (e.g., L1-norm, Euclidean and DISSIM). Elastic measures (such as DTW and LCSS) Edit distance based (such as LCSS, EDR and ERP ) (Ding et al., 2008)

ELKI 0.2 Software for visualization and performance evaluation of distance measures for time series www.dbs.ifi.lmu.de/research/KDD/ELKI/ (Achtert et al., 2009)

Research Questions Is distance measure performance related to some intrinsic properties of the data set? If so, can those properties be used to identify the most appropriate distance measure?

References Achtert, E., T. Bernecker, H.-P. Kriegel, E. Schubert, and A. Zimek. 2009. “ELKI in Time: ELKI 0.2 for the Performance Evaluation of Distance Measures for Time Series.” SSTD 2009. Aßfalg, J., H.-P. Kriegel, P. Kr¨oger, P. Kunath, A. Pryakhin, and M. Renz. 2006. “Similarity search on time series based on threshold queries.” EDBT, 2006. Berndt, D., and J. Clifford. 1996. “Finding Patterns in Time Series: A Dynamic Programming Approach.” Advances in Knowledge Discovery and Data Mining AAAI/MIT Press, Menlo Park, CA. pg. 229-248. Chen, L., M. Ozsu, and V. Oria. 2005. “Robust and fast similarity search for moving object trajectories. SIGMOD ‘05. Chen, Y., M. Nascimento, B. Ooi, and A. Tung. 2007. “SpADe: On Shape-based Pattern Detection in Streaming Time Series. ICDE, 2007. Ding, H., G. Trajcevski, P. Scheuermann, X. Wang, E. Keogh. 2008. “Querying and Mining of Time Series Data: Experimental Comparison of Representations and Distance Measures.” VLDB ‘08. Goldin, D., and P. Kanellakis. 1995. “On Similarity Queries for Time-Series Data: Constraint Specification and Implementation.” Proceedings of the 1st International Conference on the Principles and Practice of Constraint Programming. pp. 137-153. Ratanamahatana, C., J. Lin, D. Gunopulos, E. Keogh, M. Vlachos, G. Das. 2010. “Mining Time Series Data.” Data Mining and Knowledge Discovery Handbook. Part 6, pg. 1049-1077. Vlachos, M., D. Gunopulos, and G. Kollios. 2002. “Discovering similar multidimensional trajectories.” ICDE, 2002.