Presentation is loading. Please wait.

Presentation is loading. Please wait.

Keogh, E. , Chakrabarti, K. , Pazzani, M. & Mehrotra, S. (2001)

Similar presentations


Presentation on theme: "Keogh, E. , Chakrabarti, K. , Pazzani, M. & Mehrotra, S. (2001)"— Presentation transcript:

1 Locally adaptive dimensionality reduction for indexing large time series databases
Keogh, E.,  Chakrabarti, K., Pazzani, M. & Mehrotra, S. (2001). Locally adaptive dimensionality reduction for indexing large time series databases. In proceedings of ACM SIGMOD Conference on Management of Data, May. pp

2 Motivating Applications
Time Series Similarity Retrieval Typical query: Find time series objects similar to a given pattern Example Applications Medical Diagnosis: doctor searching for pattern (that implies heart irregularity) in ECG database Financial Data: stock analyst searching for stock price pattern for prediction Time-Series Mining: similarity search a component inside mining algorithm Other Applications Scientific, Spatial/Spatio-temporal A Time Series .13 .19 .28 .38 .30 .25 .32 High Dimensional Vector Representation of Time Series

3 Similarity Search in Time Series Data
Query Q n datapoints Database n datapoints Distance 0.98 0.07 0.21 0.43 Rank 4 1 2 3 S Q Euclidean Distance between two time series Q = {q1, q2, …, qn} and S = {s1, s2, …, sn}

4 Mapping to Multidimensional Space
Database n datapoints S1 S1 Query Q n datapoints Q S2 S2 S4 S3 S3 n-dimensional space Index the n-d space using a multidimensional index structure to avoid slow sequential scanning n ~ ; need to reduce dimensionality from n to n’ that can be handled by index structure S4

5 Dimensionality Reduction Techniques for Time Series Data
eigenwave 0 eigenwave 1 eigenwave 2 eigenwave 3 eigenwave 4 eigenwave 5 eigenwave 6 eigenwave 7 Korn, Jagadish, Faloutsos 1997 X X' PCA 20 40 60 80 100 120 140 20 40 60 80 100 120 140 X X' DFT X X' DWT 20 40 60 80 100 120 140 Haar 0 Haar 1 Haar 2 Haar 3 Haar 4 Haar 5 Haar 6 Haar 7 1 2 3 4 5 6 7 Agrawal, Faloutsos, Swami 1993 Chan & Fu 1999

6 Piecewise Aggregate Approximation (PAA)
value axis time axis Original time series (n-dimensional vector) S={s1, s2, …, sn} sv1 sv2 sv3 sv4 sv5 sv6 sv7 sv8 n’-segment PAA representation (n’-d vector) S = {sv1 , sv2, …, svn’ } PAA representation satisfies the lower bounding lemma (Keogh, Chakrabarti, Mehrotra and Pazzani, 2000; Yi and Faloutsos 2000)

7 They are all global All the dimensionality reduction techniques (DFT, DWT, PCA and PAA) are global - they choose a common reduced-representation for all items in the database Is it possible to devise a representation that adapts locally to each time-series item and chooses the best representation for that item? Researchers considered such a representation to be infeasible as it “does not allow for indexing due to its irregularity” (Yi and Faloutsos, VLDB 2000)

8 Locally Adaptive Representation
sv1 sv2 sv3 sv4 sv5 sv6 sv7 sv8 n’-segment PAA representation (n’-d vector) S = {sv1 , sv2, …, svN } Adaptive Piecewise Constant Approximation (APCA) sv1 sv2 sv3 sv4 sr1 sr2 sr3 sr4 n’/2-segment APCA representation (n’-d vector) S= { sv1, sr1, sv2, sr2, …, svM , srM } (M is the number of segments = n’/2)

9 Is it any good? Question 1: Does APCA approximate the original signal better than PAA when M = n’/2? Question 2: Can we compute the APCA representation efficiently? Question 3: Can we come up with a lower bounding distance measure for APCA? Question 4: Can we index APCA? Question 5: If all the above, how does it perform compared to other dimensionality reduction techniques? PAA APCA

10 Answer 1: APCA approximates original signal better than PAA
Reconstruction error PAA Reconstruction error APCA Improvement factor = 1.69 3.77 1.21 1.03 3.02 1.75

11 Answer 2: APCA Representation can be computed efficiently
Near-optimal representation can be computed in O(nlog(n)) time Optimal representation can be computed in O(n2M)

12 Answer 3: Lower Bounding Distance Measure
Q D(Q,S) Exact (Euclidean) distance D(Q,S) Lower bounding distance DLB(Q,S) S Q’ Q DLB(Q’,S) DLB(Q’,S)

13 Is it any good? Question 4: Can we index APCA?
Question 1: Does APCA approximate the original signal better than PAA when M = n’/2? Question 2: Can we compute the APCA representation efficiently? Question 3: Can we come up with a lower bounding distance measure for APCA? Question 4: Can we index APCA? Question 5: If all the above, how does it perform compared to other dimensionality reduction techniques? PAA APCA

14 Can we index APCA? R1 R3 R2 R4 S6 S5 S1 S2 S3 S4 S8 S7 S9 R2 R3 R4 R1
What does indexability mean? If we build the index on the 2M-dimensional APCA space, can we retrieve nearest neighbors to a query time series using the index? Nearest neighbors based on the true distance, i.e., Euclidean distance in the original n-dimensional space R1 R3 R2 R4 2M-dimensional APCA space S6 S5 S1 S2 S3 S4 S8 S7 S9 Any feature-based index structure can used (e.g., R-tree, X-tree, Hybrid Tree) R2 R3 R4 R1 S3 S4 S5 S6 S7 S8 S9 S2 S1 The k-nearest neighbor search traverse the nodes of the multidimensional index structure in the order of the distance from the query. We define the node distance as MINDIST which is the minimum distance from the query to any point in the node boundary. In case of Query Point Movement, MinDist is computed as the distance between the centroid of relevant points and the point P which is defined as in this equation.

15 k-NN Algorithm Q R1 S7 R3 R2 R4 S1 S2 S3 S5 S4 S6 S8 S9
MINDIST(Q,R2) MINDIST(Q,R3) R1 S7 R3 R2 R4 S1 S2 S3 S5 S4 S6 S8 S9 Q MINDIST(Q,R4) The k-nearest neighbor search traverse the nodes of the multidimensional index structure in the order of the distance from the query. We define the node distance as MINDIST which is the minimum distance from the query to any point in the node boundary. In case of Query Point Movement, MinDist is computed as the distance between the centroid of relevant points and the point P which is defined as in this equation. Correctness criteria: Distance DLB(Q, S) Q and APCA point: DLB(Q, S) £ Euclidean (true) distance D(Q, original(S) ) Distance MINDIST(Q,R) between Q and MBR R of a node U of the index: MINDIST(Q,R) £ Euclidean distance D(Q, original(S)) for any data item S under U

16 Index Modification for MINDIST Computation
APCA point S= { sv1, sr1, sv2, sr2, …, svM, srM } smax3 smin3 R1 sv3 S2 S5 R3 S3 smax1 smin1 smax2 smin2 S1 S6 S4 sv1 R2 smax4 smin4 R4 S8 sv2 S9 sv4 S7 sr1 sr2 sr3 sr4 APCA rectangle S= (L,H) where L= { smin1, sr1, smin2, sr2, …, sminM, srM } and H = { smax1, sr1, smax2, sr2, …, smaxM, srM }

17 MBR Representation in time-value space
We can view the MBR R=(L,H) of any node U as two APCA representations L= { l1, l2, …, l(N-1), lN } and H= { h1, h2, …, h(N-1), hN } REGION 2 H= { h1, h2, h3, h4 , h5, h6 } h1 h2 h3 h4 h5 h6 time axis value axis l3 l4 l6 l5 REGION 1 l1 l2 REGION 3 L= { l1, l2, l3, l4 , l5, l6 }

18 Regions l(2i-1) h(2i-1) h2i l(2i-2)+1 h3 h5 h2 h4 h6 l3 l1 l2 l4 l6 l5
REGION i l(2i-1) h(2i-1) h2i l(2i-2)+1 M regions associated with each MBR; boundaries of ith region: h3 h1 h5 h2 h4 h6 value axis time axis l3 l1 l2 l4 l6 l5 REGION 1 REGION 3 REGION 2

19 Regions t1 t2 h3 l3 h5 l1 l5 l2 l4 h2 h4 h6 l6
ith region is active at time instant t if it spans across t The value st of any time series S under node U at time instant t must lie in one of the regions active at t (Lemma 2) REGION 2 t1 t2 h3 value axis h1 l3 REGION 3 h5 l1 l5 REGION 1 l2 l4 h2 h4 h6 l6 time axis

20 MINDIST Computation t1 MINDIST(Q,R,t1) =min(MINDIST(Q, Region1, t1),
For time instant t, MINDIST(Q, R, t) = minregion G active at t MINDIST(Q,G,t) MINDIST(Q,R,t1) =min(MINDIST(Q, Region1, t1), MINDIST(Q, Region2, t1)) =min((qt1 - h1)2 , (qt1 - h3)2 ) =(qt1 - h1)2 t1 REGION 2 h3 l3 h1 REGION 3 h5 l1 l5 REGION 1 l2 l4 h2 h4 h6 MINDIST(Q,R) = l6 Lemma3: MINDIST(Q,R) £ D(Q,C) for any time series C under node U

21 Other Queries Range Search Approximate Search

22 Answer 5: Comparison of APCA with other techniques (Pruning Power)
Dataset: Electrocardiogram data, 100,000 objects, n (query length) varying from 256 to 1024 Pruning Power = Number of objects examined for 1-NN query Fourier Wavelet/ PAA APCA # objects examined Original dimensionality (n) ( ) Reduced dimensionality (n’) (16-64)

23 Comparison of APCA with other techniques (Index Performance)
Electrocardiogram data, Hybrid Tree to index reduced space Linear Scan Fourier Wavelet/ PAA APCA # random disk access CPU time (sec) Original dimensionality (n) ( ) Reduced dimensionality (n’) (16-64) 22

24 Summary of APCA APCA is a new dimensionality reduction technique for time series data that Approximates the original data better by adapting to each data item “locally” Can be indexed using a multidimensional index structure Outperforms existing techniques by one to two orders of magnitude in terms of search performance As of 2005, has been referenced well over 100 times, and implemented at least 20 times.


Download ppt "Keogh, E. , Chakrabarti, K. , Pazzani, M. & Mehrotra, S. (2001)"

Similar presentations


Ads by Google