Presentation on theme: "In Search of Meaning for Time Series Subsequence Clustering Dina Goldin, Brown University work done with Ricardo Mardales, UConn and George Nagy, RPI CIKM,"— Presentation transcript:
In Search of Meaning for Time Series Subsequence Clustering Dina Goldin, Brown University work done with Ricardo Mardales, UConn and George Nagy, RPI CIKM, Nov. 8, 2006
CIKM062 November 8, 2006 The Meaningless Paper [KLT03] Keogh, E., Lin, J., Truppel, W. Clustering of Time Series is meaningless. Proc. IEEE Conf. on Data Mining (2003) [KL05] Keogh, E. & Lin, J. Clustering of time-series subsequences is meaning- less: implications for previous and future research. J. Knowledge and Inf. Sys. 8:2 (2005) Clustering of time series subsequences is meaningless [because] the result of clustering these subsequences is independent of the input.
CIKM063 November 8, 2006 It cast a shadow over STS clustering. Jeopardized the legitimacy of research that had used subsequence clustering. Led to a flurry of follow-up research Chen 05 uses cyclical data and k-medoids Simon et al. 05 uses self-organizing maps Denton 04 uses density based clustering Struzik 03 uses correlation for trivial matches Bagnall 03, Mahoney 05, Rodrigues et al. 04 moved away from STS No one had challenged the results head-on i.e. show that output and input of STS clustering are not independent Implications of Meaningless Result
CIKM064 November 8, 2006 Time series: x y STS clustering algorithm Clusters: A C B Independence of Input and Output Is there a way to match C to the right time series (X or Y) reliably? Before: NO; cluster_set_dist(C,B) / cluster_set_dist(C,A) not small Our work: YES Find a different distance measure!
CIKM065 November 8, 2006 Outline 1. Introduction 2. New Distance Measure for Cluster Sets based on the notion of cluster shapes 3. STS Cluster Matching 4. Observations and Conclusions
CIKM066 November 8, 2006 STS Clustering Consider all subsequences of the same time series time series T of length m, window Size w Normalize each subsequence so its average is 0 and std. deviation is 1 Normalize(x) = x – avg(x) / stddev(x) Cluster the normalized subsequences using K-means clustering algorithm
CIKM067 November 8, 2006 K-means Clustering Given a set of multidimensional points (of dimension w), partition in into K groups, so each point belongs to one cluster. Compute the center of each cluster; it is the mean of all points in the cluster. Result: a set of K cluster centers Cluster Centers
CIKM068 November 8, 2006 Cluster Set Distance - Previous approach to measuring distance between cluster sets - Returs sum of Euclidean Distances between cluster centers A B cluster_set_dist(B,A)
CIKM069 November 8, 2006 Cluster Shape Distance - New distance measure for cluster sets - Returns Euclidean Distance between cluster set shapes - Cluster set shape: sorted list of pairwise distances between cluster centers; has K*(K-1)/2 values Y A B X Z Shape of cluster A = [XZ, ZY, XY] A and B have the same shape (B is a rotated and translated copy of A) so cluster_shape_dist(A,B) = 0
CIKM0610 November 8, 2006 Cluster Shape Example STS clustering for ocean series with K=3 Note: all our datasets come from UC Riverside repository Ds: pairwise distances between cluster centers
CIKM0611 November 8, 2006 Cluster Structure Sort the pairwise distances Observation: for each K and w, the shapes obtained from different STS clustering runs are similar! Cluster structure T : the average of cluster set shapes from many clustering runs over T.
CIKM0612 November 8, 2006 Cluster Structure: Example Cluster structures for datasets from UCR repository k=3 w=8 k=3 w=16 k=4 w=8
CIKM0613 November 8, 2006 Outline 1. Introduction 2. New Distance Measure for Cluster Sets 3. STS Cluster Matching 4. Observations and Conclusions
CIKM0614 November 8, 2006 STS Cluster Matching Problem Given a dataset of multiple time series and a cluster center set from one of them (query), match it to the series that produced it. Note: K and w are assumed to be fixed. Matching algorithm: Outputs a guess -- which of the N time series in the dataset produced the query? Algorithm accuracy: Percentage of times that the matching algorithm is correct. Note: no previous work succeeded to attain high accuracy, even with dataset of size 2!
CIKM0615 November 8, 2006 Matching Algorithm Pre-processing phase: 1. For each sequence in the dataset, perform Q clustering runs with given K and w, and calculate its cluster structure. 2. Store all the structures in a master table. Matching phase: 1. Given a query, find the Euclidean distance from its shape to each of the structures in the master table. 2. Return the sequence whose structure is the closest.
CIKM0616 November 8, 2006 Example Master table k=3 w=8
CIKM0617 November 8, 2006 Performance Evaluation 10 datasets from UCR time series repository 100 clustering runs per structure Algorithm evaluated with 3 values of K, 4 values of w (12 combinations) Result: 100% accuracy
CIKM0618 November 8, 2006 Outline 1. Introduction 2. New Distance Measure for Cluster Sets 3. STS Cluster Matching Algorithm 4. Observations and Conclusions
CIKM0619 November 8, 2006 Conclusions Previous work seemed to show that the output of STS clustering is independent of input. The correct conclusion: cluster set distance is an inappropriate distance metric. Instead of absolute positions of cluster centers, one needs to use relative positions (as represented by cluster shapes). STS clustering becomes meaningful: cluster centers are reliably matched to original series. We also found correlation between some characteristics (number of unique shapes, shape skew) and sequence smoothness.
CIKM0620 November 8, 2006 Future Work WHY? Difference in behavior between whole- sequence and subsequence clustering? (some preliminary answers are in paper) Apparent presence of transformations among cluster sets? Dependency between smoothness, skew, number of unique clusters, etc.? HOW? Find expected accuracy of the matching algorithm for given input and Q (number of clustering runs to compute each structure).