Download presentation

Presentation is loading. Please wait.

Published byKeely Lyddon Modified over 3 years ago

1
**In Search of Meaning for Time Series Subsequence Clustering**

Dina Goldin, Brown University work done with Ricardo Mardales, UConn and George Nagy, RPI CIKM, Nov. 8, 2006

2
**The “Meaningless” Paper**

[KLT03] Keogh, E., Lin, J., Truppel, W. Clustering of Time Series is meaningless. Proc. IEEE Conf. on Data Mining (2003) [KL05] Keogh, E. & Lin, J. Clustering of time-series subsequences is meaning-less: implications for previous and future research. J. Knowledge and Inf. Sys. 8:2 (2005) Clustering of time series subsequences is meaningless [because] the result of clustering these subsequences is independent of the input. November 8, 2006 CIKM’06

3
**Implications of “Meaningless” Result**

It “cast a shadow over STS clustering”. Jeopardized the legitimacy of research that had used subsequence clustering. Led to a flurry of follow-up research Chen ’05 uses cyclical data and k-medoids Simon et al. ’05 uses self-organizing maps Denton ’04 uses density based clustering Struzik ’03 uses correlation for trivial matches Bagnall ’03, Mahoney ’05, Rodrigues et al. ’04 moved away from STS No one had challenged the results head-on i.e. show that output and input of STS clustering are not independent November 8, 2006 CIKM’06

4
**Independence of Input and Output**

Time series: x y STS clustering algorithm Clusters: A C B Is there a way to match C to the right time series (X or Y) reliably? Before: NO; cluster_set_dist(C,B) / cluster_set_dist(C,A) not small Our work: YES Find a different distance measure! November 8, 2006 CIKM’06

5
Outline Introduction New Distance Measure for Cluster Sets based on the notion of cluster shapes STS Cluster Matching Observations and Conclusions Describe each item/summarize STS Clustering=Subsequence Time Series November 8, 2006 CIKM’06

6
**STS Clustering Consider all subsequences of the same time series**

time series T of length m, window Size w Normalize each subsequence so its average is 0 and std. deviation is 1 Normalize(x) = x – avg(x) / stddev(x) Cluster the normalized subsequences using K-means clustering algorithm November 8, 2006 CIKM’06

7
K-means Clustering Given a set of multidimensional points (of dimension w), partition in into K groups, so each point belongs to one cluster. Compute the center of each cluster; it is the mean of all points in the cluster. Result: a set of K cluster centers Cluster Centers An item is a member of one group only A group must have the minimum of one element Cluster representative=means November 8, 2006 CIKM’06

8
**A B cluster_set_dist(B,A)**

Cluster Set Distance - Previous approach to measuring distance between cluster sets - Returs sum of Euclidean Distances between cluster centers A B cluster_set_dist(B,A) November 8, 2006 CIKM’06

9
**Cluster Shape Distance**

New distance measure for cluster sets Returns Euclidean Distance between cluster set shapes Cluster set shape: sorted list of pairwise distances between cluster centers; has K*(K-1)/2 values Y A B X Z Shape of cluster A = [XZ, ZY, XY] A and B have the same shape (B is a rotated and translated copy of A) so cluster_shape_dist(A,B) = 0 November 8, 2006 CIKM’06

10
**D’s: pairwise distances between cluster centers**

Cluster Shape Example STS clustering for ocean series with K=3 Note: all our datasets come from UC Riverside repository D’s: pairwise distances between cluster centers November 8, 2006 CIKM’06

11
**Cluster Structure Sort the pairwise distances**

Observation: for each K and w, the shapes obtained from different STS clustering runs are similar! Cluster structure DT: the average of cluster set shapes from many clustering runs over T. Comment similarities November 8, 2006 CIKM’06

12
**Cluster Structure: Example**

Cluster structures for datasets from UCR repository k=3 w=8 k=3 w=16 k=4 w=8 November 8, 2006 CIKM’06

13
**Outline Introduction New Distance Measure for Cluster Sets**

STS Cluster Matching Observations and Conclusions Describe each item/summarize STS Clustering=Subsequence Time Series November 8, 2006 CIKM’06

14
**STS Cluster Matching Problem**

Given a dataset of multiple time series and a cluster center set from one of them (“query”), match it to the series that produced it. Note: K and w are assumed to be fixed. Matching algorithm: Outputs a guess -- which of the N time series in the dataset produced the query? Algorithm accuracy: Percentage of times that the matching algorithm is correct. Note: no previous work succeeded to attain high accuracy, even with dataset of size 2! November 8, 2006 CIKM’06

15
**Matching Algorithm Pre-processing phase:**

For each sequence in the dataset, perform Q clustering runs with given K and w, and calculate its cluster structure. Store all the structures in a master table. Matching phase: 1. Given a query, find the Euclidean distance from its shape to each of the structures in the master table. 2. Return the sequence whose structure is the closest. Preprocessing: Can be offline and Can be saved for later use November 8, 2006 CIKM’06

16
**Example Master table k=3 w=8 Look at triplets not single numbers**

November 8, 2006 CIKM’06

17
**Performance Evaluation**

10 datasets from UCR time series repository 100 clustering runs per structure Algorithm evaluated with 3 values of K, 4 values of w (12 combinations) Result: 100% accuracy November 8, 2006 CIKM’06

18
**Outline Introduction New Distance Measure for Cluster Sets**

STS Cluster Matching Algorithm Observations and Conclusions Describe each item/summarize STS Clustering=Subsequence Time Series November 8, 2006 CIKM’06

19
Conclusions Previous work seemed to show that the output of STS clustering is independent of input. The correct conclusion: cluster set distance is an inappropriate distance metric. Instead of absolute positions of cluster centers, one needs to use relative positions (as represented by cluster shapes). STS clustering becomes meaningful: cluster centers are reliably matched to original series. We also found correlation between some characteristics (number of unique shapes, shape skew) and sequence smoothness. November 8, 2006 CIKM’06

20
Future Work WHY? Difference in behavior between whole-sequence and subsequence clustering? (some preliminary answers are in paper) Apparent presence of transformations among cluster sets? Dependency between smoothness, skew, number of unique clusters, etc.? HOW? Find expected accuracy of the matching algorithm for given input and Q (number of clustering runs to compute each structure). November 8, 2006 CIKM’06

21
Questions?

22
Thank you!

Similar presentations

Presentation is loading. Please wait....

OK

LIBRA: Lightweight Data Skew Mitigation in MapReduce

LIBRA: Lightweight Data Skew Mitigation in MapReduce

© 2018 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on obesity prevention in children Ppt on single phase and three phase dual converters Download ppt on water level indicator Ppt on marine diesel engines Ppt on solar based ups system Ppt on different solid figures worksheets Ppt on minimum wages act 1948 india Ppt on marie curie wikipedia Ppt on javascript events listeners Ppt on layer 3 switching protocols