Keogh, E. , Chakrabarti, K. , Pazzani, M. & Mehrotra, S. (2001)

Slides:



Advertisements
Similar presentations
Dimensionality Reduction Techniques Dimitrios Gunopulos, UCR.
Advertisements

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Spatial Database Systems. Spatial Database Applications GIS applications (maps): Urban planning, route optimization, fire or pollution monitoring, utility.
Mining Time Series Data CS240B Notes by Carlo Zaniolo UCLA CS Dept A Tutorial on Indexing and Mining Time Series Data ICDM '01 The 2001 IEEE International.
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
Similarity Search for Adaptive Ellipsoid Queries Using Spatial Transformation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa (Nara.
Mining Time Series.
Efficient Reverse k-Nearest Neighbors Retrieval with Local kNN-Distance Estimation Mike Lin.
Continuous Intersection Joins Over Moving Objects Rui Zhang University of Melbourne Dan Lin Purdue University Kotagiri Ramamohanarao University of Melbourne.
Spatial Mining.
Multimedia DBs.
Time Series Indexing II. Time Series Data
Indexing Time Series. Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time.
Themis Palpanas1 VLDB - Aug 2004 Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use.
Efficient Similarity Search in Sequence Databases Rakesh Agrawal, Christos Faloutsos and Arun Swami Leila Kaghazian.
Dimensionality Reduction
Data Mining: Concepts and Techniques Mining time-series data.
Multimedia DBs. Time Series Data
Spatial Queries Nearest Neighbor Queries.
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept.
Review. Time Series Data
A Multiresolution Symbolic Representation of Time Series
Dimensionality Reduction
Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
Indexing Time Series.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Time Series I.
Pattern Matching with Acceleration Data Pramod Vemulapalli.
Exact Indexing of Dynamic Time Warping
Mining Time Series Data
Multimedia and Time-series Data
Analysis of Constrained Time-Series Similarity Measures
Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method Gang Qian University of Central Oklahoma November 2006.
A Query Adaptive Data Structure for Efficient Indexing of Time Series Databases Presented by Stavros Papadopoulos.
Mining Time Series.
Fast Subsequence Matching in Time-Series Databases Author: Christos Faloutsos etc. Speaker: Weijun He.
Identifying Patterns in Time Series Data Daniel Lewis 04/06/06.
E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal.
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
Exact indexing of Dynamic Time Warping
University of Macau Discovering Longest-lasting Correlation in Sequence Databases Yuhong Li Department of Computer and Information Science.
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept. of Electronic.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and.
Multimedia and Time-Series Data When Is “ Nearest Neighbor ” Meaningful? Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason.
Time Series Sequence Matching Jiaqin Wang CMPS 565.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.
1 Reverse Nearest Neighbor Queries for Dynamic Databases SHOU Yu Tao Jan. 10 th, 2003 SIGMOD 2000.
ITree: Exploring Time-Varying Data using Indexable Tree Yi Gu and Chaoli Wang Michigan Technological University Presented at IEEE Pacific Visualization.
1 Spatial Query Processing using the R-tree Donghui Zhang CCIS, Northeastern University Feb 8, 2005.
Feature learning for multivariate time series classification Mustafa Gokce Baydogan * George Runger * Eugene Tuv † * Arizona State University † Intel Corporation.
Dense-Region Based Compact Data Cube
Indexing Multidimensional Data
Time Series Indexing II
Fast Subsequence Matching in Time-Series Databases.
Spatial Data Management
Data Transformation: Normalization
SIMILARITY SEARCH The Metric Space Approach
Dear Reader This set of slides is partly redundant with “similarity_search.ppt” I have expanded some examples in this file. We will quickly review the.
Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,
A Time Series Representation Framework Based on Learned Patterns
Nearest Neighbor Queries using R-trees
Introduction to Spatial Databases
Data Mining: Concepts and Techniques — Chapter 8 — 8
Distributed Probabilistic Range-Aggregate Query on Uncertain Data
Data Mining: Concepts and Techniques — Chapter 8 — 8
Nearest Neighbors CSC 576: Data Mining.
Data Mining: Concepts and Techniques — Chapter 8 — 8
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Presentation transcript:

Locally adaptive dimensionality reduction for indexing large time series databases Keogh, E.,  Chakrabarti, K., Pazzani, M. & Mehrotra, S. (2001). Locally adaptive dimensionality reduction for indexing large time series databases. In proceedings of ACM SIGMOD Conference on Management of Data, May. pp 151-162

Motivating Applications Time Series Similarity Retrieval Typical query: Find time series objects similar to a given pattern Example Applications Medical Diagnosis: doctor searching for pattern (that implies heart irregularity) in ECG database Financial Data: stock analyst searching for stock price pattern for prediction Time-Series Mining: similarity search a component inside mining algorithm Other Applications Scientific, Spatial/Spatio-temporal A Time Series .13 .19 .28 .38 .30 .25 .32 High Dimensional Vector Representation of Time Series .13 .19 .28 .38 .30 .25 .19 .32

Similarity Search in Time Series Data Query Q n datapoints Database n datapoints Distance 0.98 0.07 0.21 0.43 Rank 4 1 2 3 S Q Euclidean Distance between two time series Q = {q1, q2, …, qn} and S = {s1, s2, …, sn}

Mapping to Multidimensional Space Database n datapoints S1 S1 Query Q n datapoints Q S2 S2 S4 S3 S3 n-dimensional space Index the n-d space using a multidimensional index structure to avoid slow sequential scanning n ~ 100-1000; need to reduce dimensionality from n to n’ that can be handled by index structure S4

Dimensionality Reduction Techniques for Time Series Data eigenwave 0 eigenwave 1 eigenwave 2 eigenwave 3 eigenwave 4 eigenwave 5 eigenwave 6 eigenwave 7 Korn, Jagadish, Faloutsos 1997 X X' PCA 20 40 60 80 100 120 140 20 40 60 80 100 120 140 X X' DFT X X' DWT 20 40 60 80 100 120 140 Haar 0 Haar 1 Haar 2 Haar 3 Haar 4 Haar 5 Haar 6 Haar 7 1 2 3 4 5 6 7 Agrawal, Faloutsos, Swami 1993 Chan & Fu 1999

Piecewise Aggregate Approximation (PAA) value axis time axis Original time series (n-dimensional vector) S={s1, s2, …, sn} sv1 sv2 sv3 sv4 sv5 sv6 sv7 sv8 n’-segment PAA representation (n’-d vector) S = {sv1 , sv2, …, svn’ } PAA representation satisfies the lower bounding lemma (Keogh, Chakrabarti, Mehrotra and Pazzani, 2000; Yi and Faloutsos 2000)

They are all global All the dimensionality reduction techniques (DFT, DWT, PCA and PAA) are global - they choose a common reduced-representation for all items in the database Is it possible to devise a representation that adapts locally to each time-series item and chooses the best representation for that item? Researchers considered such a representation to be infeasible as it “does not allow for indexing due to its irregularity” (Yi and Faloutsos, VLDB 2000)

Locally Adaptive Representation sv1 sv2 sv3 sv4 sv5 sv6 sv7 sv8 n’-segment PAA representation (n’-d vector) S = {sv1 , sv2, …, svN } Adaptive Piecewise Constant Approximation (APCA) sv1 sv2 sv3 sv4 sr1 sr2 sr3 sr4 n’/2-segment APCA representation (n’-d vector) S= { sv1, sr1, sv2, sr2, …, svM , srM } (M is the number of segments = n’/2)

Is it any good? Question 1: Does APCA approximate the original signal better than PAA when M = n’/2? Question 2: Can we compute the APCA representation efficiently? Question 3: Can we come up with a lower bounding distance measure for APCA? Question 4: Can we index APCA? Question 5: If all the above, how does it perform compared to other dimensionality reduction techniques? PAA APCA

Answer 1: APCA approximates original signal better than PAA Reconstruction error PAA Reconstruction error APCA Improvement factor = 1.69 3.77 1.21 1.03 3.02 1.75

Answer 2: APCA Representation can be computed efficiently Near-optimal representation can be computed in O(nlog(n)) time Optimal representation can be computed in O(n2M)

Answer 3: Lower Bounding Distance Measure Q D(Q,S) Exact (Euclidean) distance D(Q,S) Lower bounding distance DLB(Q,S) S Q’ Q DLB(Q’,S) DLB(Q’,S)

Is it any good? Question 4: Can we index APCA? Question 1: Does APCA approximate the original signal better than PAA when M = n’/2? Question 2: Can we compute the APCA representation efficiently? Question 3: Can we come up with a lower bounding distance measure for APCA? Question 4: Can we index APCA? Question 5: If all the above, how does it perform compared to other dimensionality reduction techniques? PAA APCA

Can we index APCA? R1 R3 R2 R4 S6 S5 S1 S2 S3 S4 S8 S7 S9 R2 R3 R4 R1 What does indexability mean? If we build the index on the 2M-dimensional APCA space, can we retrieve nearest neighbors to a query time series using the index? Nearest neighbors based on the true distance, i.e., Euclidean distance in the original n-dimensional space R1 R3 R2 R4 2M-dimensional APCA space S6 S5 S1 S2 S3 S4 S8 S7 S9 Any feature-based index structure can used (e.g., R-tree, X-tree, Hybrid Tree) R2 R3 R4 R1 S3 S4 S5 S6 S7 S8 S9 S2 S1 The k-nearest neighbor search traverse the nodes of the multidimensional index structure in the order of the distance from the query. We define the node distance as MINDIST which is the minimum distance from the query to any point in the node boundary. In case of Query Point Movement, MinDist is computed as the distance between the centroid of relevant points and the point P which is defined as in this equation.

k-NN Algorithm Q R1 S7 R3 R2 R4 S1 S2 S3 S5 S4 S6 S8 S9 MINDIST(Q,R2) MINDIST(Q,R3) R1 S7 R3 R2 R4 S1 S2 S3 S5 S4 S6 S8 S9 Q MINDIST(Q,R4) The k-nearest neighbor search traverse the nodes of the multidimensional index structure in the order of the distance from the query. We define the node distance as MINDIST which is the minimum distance from the query to any point in the node boundary. In case of Query Point Movement, MinDist is computed as the distance between the centroid of relevant points and the point P which is defined as in this equation. Correctness criteria: Distance DLB(Q, S) Q and APCA point: DLB(Q, S) £ Euclidean (true) distance D(Q, original(S) ) Distance MINDIST(Q,R) between Q and MBR R of a node U of the index: MINDIST(Q,R) £ Euclidean distance D(Q, original(S)) for any data item S under U

Index Modification for MINDIST Computation APCA point S= { sv1, sr1, sv2, sr2, …, svM, srM } smax3 smin3 R1 sv3 S2 S5 R3 S3 smax1 smin1 smax2 smin2 S1 S6 S4 sv1 R2 smax4 smin4 R4 S8 sv2 S9 sv4 S7 sr1 sr2 sr3 sr4 APCA rectangle S= (L,H) where L= { smin1, sr1, smin2, sr2, …, sminM, srM } and H = { smax1, sr1, smax2, sr2, …, smaxM, srM }

MBR Representation in time-value space We can view the MBR R=(L,H) of any node U as two APCA representations L= { l1, l2, …, l(N-1), lN } and H= { h1, h2, …, h(N-1), hN } REGION 2 H= { h1, h2, h3, h4 , h5, h6 } h1 h2 h3 h4 h5 h6 time axis value axis l3 l4 l6 l5 REGION 1 l1 l2 REGION 3 L= { l1, l2, l3, l4 , l5, l6 }

Regions l(2i-1) h(2i-1) h2i l(2i-2)+1 h3 h5 h2 h4 h6 l3 l1 l2 l4 l6 l5 REGION i l(2i-1) h(2i-1) h2i l(2i-2)+1 M regions associated with each MBR; boundaries of ith region: h3 h1 h5 h2 h4 h6 value axis time axis l3 l1 l2 l4 l6 l5 REGION 1 REGION 3 REGION 2

Regions t1 t2 h3 l3 h5 l1 l5 l2 l4 h2 h4 h6 l6 ith region is active at time instant t if it spans across t The value st of any time series S under node U at time instant t must lie in one of the regions active at t (Lemma 2) REGION 2 t1 t2 h3 value axis h1 l3 REGION 3 h5 l1 l5 REGION 1 l2 l4 h2 h4 h6 l6 time axis

MINDIST Computation t1 MINDIST(Q,R,t1) =min(MINDIST(Q, Region1, t1), For time instant t, MINDIST(Q, R, t) = minregion G active at t MINDIST(Q,G,t) MINDIST(Q,R,t1) =min(MINDIST(Q, Region1, t1), MINDIST(Q, Region2, t1)) =min((qt1 - h1)2 , (qt1 - h3)2 ) =(qt1 - h1)2 t1 REGION 2 h3 l3 h1 REGION 3 h5 l1 l5 REGION 1 l2 l4 h2 h4 h6 MINDIST(Q,R) = l6 Lemma3: MINDIST(Q,R) £ D(Q,C) for any time series C under node U

Other Queries Range Search Approximate Search

Answer 5: Comparison of APCA with other techniques (Pruning Power) Dataset: Electrocardiogram data, 100,000 objects, n (query length) varying from 256 to 1024 Pruning Power = Number of objects examined for 1-NN query Fourier Wavelet/ PAA APCA # objects examined Original dimensionality (n) (256-1024) Reduced dimensionality (n’) (16-64)

Comparison of APCA with other techniques (Index Performance) Electrocardiogram data, Hybrid Tree to index reduced space Linear Scan Fourier Wavelet/ PAA APCA # random disk access CPU time (sec) Original dimensionality (n) (256-1024) Reduced dimensionality (n’) (16-64) 22

Summary of APCA APCA is a new dimensionality reduction technique for time series data that Approximates the original data better by adapting to each data item “locally” Can be indexed using a multidimensional index structure Outperforms existing techniques by one to two orders of magnitude in terms of search performance As of 2005, has been referenced well over 100 times, and implemented at least 20 times.