# Time Series Shapelets: A New Primitive for Data Mining

## Presentation on theme: "Time Series Shapelets: A New Primitive for Data Mining"— Presentation transcript:

Time Series Shapelets: A New Primitive for Data Mining
KDD 2009 Time Series Shapelets: A New Primitive for Data Mining Lexiang Ye and Eamonn Keogh University of California, Riverside Hello, everyone. I am Lexiang Ye. And this is the joint work with my advisor Dr. Eamonn Keogh. Presented by: Zhenhui Li

Classification in Time Series
Application: Finance, Medicine 1-Nearest Neighbor Pros: accurate, robust, simple Cons: time and space complexity (lazy learning); results are not interpretable As we all know, classification is an important problem in the data mining domain. Classification of time series has been extracting great interest over the past decade. And its applications are among a wide variety domains, such as medicine, finance and etc Recent research has suggested the nearest neighbor classification is very hard to beat for most problems. It is most accurate proved by extensive empirical tests. It is robust to noise. It is extremely simple and generic. 200 400 600 800 1000 1200

Solution Shapelets time series subsequence representative of a class
discriminative from other classes In this work, we introduce a new time series primitive, time series shapelet, which addresses these limitations. Informally, shapelets are time series subsequences which are in some sense maximally representative of a class. There are some similar problem in other domains. Like distinguishing substring selection or probe design. However, the time series shapalets covers much wider problem than those.

MOTIVATING EXAMPLE First, Let’s begin with a toy example in shape for visual clarity.

stinging nettles false nettles Shapelet Shapelet Dictionary 5.1 I
Leaf Decision Tree Shapelet Dictionary 5.1 yes no I 1 false nettles Shapelet stinging nettles Given two species of leaves, the first one is called stinging nettle and the second one is colloquially called “false nettle”, since it is very similar to the first group and is easily confused. Now, we want to build a classifier on these two species based on shape. First we convert the outline shapelet to one-dimension time series representation. Because the global difference of the two types of shapes are subtle, Instead, We use a particularly distinguishing subsection. By “blushing” the subsection back, we can gain the insight of the difference, the stinging nettle has a stem connect to the leaf at an angle close to 90 degree. While for the false nettle, the stem connect to the leaf at a much wider angle. This distinguishing subsection is called “shapelet”. Now we build the classifier, the outline time series has a subsection similar to this shapelet are classified as false nettle, otherwise it is stinging nettle. So how can we extract these distinguishing shapelet?

BRUTE-FORCE ALGORITHM

Extract subsequences of all possible lengths
Candidates Pool ca The first step is to construct the shapelet candidates pool. To exhaustively search all the possibilities, we extract subsequences of all possible length. Using a sliding window of certain size, we can extract the subsequence of that lengths. We use the sliding window, slide across each time series, and extract all the subsequences. Then we proceed to a sliding window of different size, and so on and so force. . . .

Testing the utility of a candidate shapelet
Information gain Arrange the time series objects based on the distance from candidate Find the optimal split point (maximal information gain) Pick the candidate achieving best utility as the shapelet After we have all the candidates in the pool, we test and compare the utility of each candidate shapelet in the pool. Here we use the information gain as the utility function. The higher the better. First, we arrange the objects by the distance from each object to the candidate. For visual illustration, in the figure, the leaves are arranged in the real number line according to the candidate. Then we test every possible split point between each two objects and find the optimal one that can best divide the leaves into two original classes. Here, this is the optimal split point for the candidate, it is not perfect but pretty good. After we testing all the candidate, we can pick the one achieving best utility as the shapelet Since the brute force algorithm exhaustively check every single candidate, it guarantee to find the best candidate as the shapelet. However, the brute force method suffers a vital problem. To get the distance from each of the object to the candidate, we need to find the best matching location for the the candidate in the time series which introduces a subsequence searching procedure for every distance calculation. Split Point candidate

Problem Total number of candidate
Candidates Pool Problem Total number of candidate Each candidate: compute the distance between this candidate and each training sample Trace dataset 200 instances, each of length 275 7,480,200 shapelet candidates approximately three days . . . There is really a huge number of candidate in the pool, even for a very small dataset. Take the well know trace dataset for example, it contains 200 instances, each of length 275. But checking every candidate about takes the brute force algorithm about three days. To enable to extend the shapelet method to larger dataset, we apply some speedup on the brute force algorithm.

Speedup Distance calculations from time series objects to shapelet candidates are the most expensive part Reduce the time in two ways Distance Early Abandon reduce the distance computation time between two time series Admissible Entropy Pruning reduce the number of distance calculatations candidate We realize the most expensive calculations in the brute force method is the distance calculation. The two speedups reduce the distance calculation time in two aspect; one is the distance early abandon, it is a known idea and have successfully applied to previous researches. And this method also works particularly efficient in our case. The other is our novel idea, called the admissible entropy prune.

DISTANCE EARLY ABANDON

10 20 30 40 50 60 70 80 90 100 T S

best matching location
10 20 30 40 50 60 70 80 90 100 best matching location S Dist= 0.4 T

calculation abandoned at this point T
Dist> 0.4 S calculation abandoned at this point T 10 20 30 40 50 60 70 80 90 100

Distance Early Abandon
We only need the minimum Dist Method Keep the best-so-far distance Abandon the calculation if the current distance is larger than best so far. Consider the subsequences of T in random order to reduce the best so far as much as possible in the early stage.

We now explain how this idea works

We only need the best shapelet for each class For a candidate shapelet We don’t need to calculate the distance for each training sample After calculating some training samples, the upper bound of information gain < best candidate shapelet Stop calculation Try next candidate As we mentioned before, we use the information gain as the utility function because, it is the tradition evaluation in the decision tree and easily generalized to the multi-class problem. And what’s more, we can exploit it to largely reduce the number of distance calculations from objects to candidates.

stinging nettles false nettles
Suppose now we are considering a candidate. The distances of the first five time series objects to the candidate have been calculated, and their corresponding positions in a one-dimensional representation are shown in the figure.

I=0.42 I= 0.29 Before continue to calculate the remaining distances, we can first give an upper bound of the information gain. The best case is that the rest leaves in two species are far away from each other. So we arrange them at the two side. In this case, the upper bound of the information gain is Suppose we have a previous best-so-far candidate, which has the entropy of 0.42. Our current information gain is lower than the best-so-far. Therefore, at this point, we can stop the distance calculation for the remaining objects and prune this candidate from consideration. So our speedup method still consider every single candidate as the brute force method, but try to avoid unnecessary distance calculations as many as possible.

Classification stinging nettles false nettles Shapelet
Leaf Decision Tree Shapelet Dictionary 5.1 yes no I 1 Classification After we find the shapelet, we can frame it as the decision tree. For each step of the decision tree induction, we determine the shapelet. For classification stage, we compare the time series with the shapelet at each step. Although the shapelet takes long in the training. It is very fast for classification. Since the number of the shapelets a time series should compare to is equal to the height of the decision tree.

EXPERIMENTAL EVALUATION
Now I will provide you several interesting examples and case studies.

Performance Comparison
Lets start with a large dataset, Lightning dataset. In the original dataset, the length of each time series object is And there are separated training and testing set. The size of training is 2000, and testing One sample from each class are plotted in the right below figure here. In our experiment, we use the different fraction of data from training set, and calculate the accuracy on the original testing set. The left figure shows the performance comparison. The y-axis indicates the computing time and x-axis represents the different size of the training. As you can see, our pruning method works far efficiently than the brute force one. When the data set is size of 160, the brute force method take 5 days to complete the learning procedure, while our method only takes 2 hours. On the right figure, we shows the testing accuracy, when only 10 objects (out of the original 2,000) are examined, the decision tree is slightly worse than the best known result on this dataset but after examining just 1% of the training data, it is significantly more accurate. Original Lightning Dataset Length 2000 Training Testing 18000

Projectile Points Projectile point classification is an important and interesting topic in anthropology. The diversity and broken projectile points increase the difficulty to classify the projectile point correctly. In this example, we use our shapelet to classify these three types of projectile points on the shapes.

11.24 85.47 Shapelet Dictionary (Clovis) (Avonlea) I II 200 400 1.0 Arrowhead Decision Tree 2 1 Clovis Avonlea We get the shapelet at each node of the decision tree. By looking at the corresponding portion in the shape, we found some interesting explanation of the classifiers. Clovis is distinguished from the others by an unnotched hafting area near the bottom connected by a deep concave bottom end. After the Clovis is distinguished, Avonlea is differentiated from the mixed class by a small notched hafting area connected by a shallow concave bottom end. Our method also gives better accuracy and much less time than the rotation invariant nearest neighbor method. Method Accuracy Time Shapelet 0.80 0.33 Rotation Invariant Nearest Neighbor 0.68 1013

one sample from each class
Wheat Spectrography 200 400 600 800 1000 1200 0.5 1 one sample from each class Now lets see a multi-class example, the wheat spectrography problem. The wheat samples are grown in Canada between 1998 and The data is labeled by the year in which the wheat was grown. In the figures, we show one sample from each class. The objects are separated in the y-axis for visual clarity. As we can see the global difference is subtle. Thus, we need to reply on the local differences. Wheat Dataset Length 1050 Training 49 Testing 276

Shapelet Dictionary Wheat Decision Tree
0.4 II 0.3 III 0.2 IV 0.1 0.0 V VI 100 200 300 Wheat Decision Tree 2 4 1 3 6 5 I II III IV V VI Using the shapelet method, we build the decision tree, found the shapelets represent the local differences and got the better accuracy than the nearest neighbor. So our method works well in the multi-class problem. Method Accuracy Time Shapelet 0.720 0.86 Nearest Neighbor 0.543 0.65

Coffee Coffee Decision Tree chlorogenic acid 1 Shapelet Dictionary
100 200 300 Coffee Decision Tree 0.5 1.0 1.5 100 200 300 I 11.14 (Robusta) 1 Shapelet Dictionary caffeine 100 200 300 Method Accuracy Shapelet (28/28) 1.0 Nearest Neighbor (leave-one-out) 0.98

the Gun/NoGun Problem No Gun Gun (No Gun) 38.94 I Shapelet Dictionary
2 I Shapelet Dictionary Lets we considered a well-studied gun/nogun. In our experiment, we use the same training/testing split as the previous study. The shapelet indicates a phenomenon known as “overshoot”. we can see that the NoGun class has a “dip” where the actor put her hand down by her side. That is because the inertia carries her hand a little too far and she is forced to correct for it. And we also achieve better accuracy using less classification time in this well-studied case. 50 100 Gun Decision Tree I 1 Method Accuracy Time Shapelet 0.933 0.016 Rotation Invariant Nearest Neighbor 0.913 0.064

Gait Analysis Gait Dataset Length different lengths Training 18
For all the above experiments, we only consider one dimensional time series. In the gait analysis example, we extend the usage of shapelet to multi-variate time series. Here, we want to distinguish the normal walk, in the left video, from the abnormal walk on the right. We record the time series for both feet. The video are taken from different actors, different walking style and different step numbers. Gait Dataset Length different lengths Training 18 Testing 143

Reduces the sensitivity of alignment
0.909 0.902 1.0 0.860 right toe I left toe (Normal Walk) 0.535 100 200 300 Walk Decision Tree I 1 The resulting shapelet on the left represent exactly one walking cycle. On the right figure, we shows the accuracy comparison between two different methods. Shapelet and Rotation invariant nearest neighbor. We have two types of data in the experiment, one is the raw data extracted from the video and the other is data after careful alignment and segmentation. Although the nearest neighbor method doesn’t lose too much on the well segmented data, the accuracy of nn suffers greatly on the raw data. so the shapelet method can reduce the sensitivity of alignment and thus alleviates human labors for preprocessing data. Time for Classification on 143 testing objects Shapelet 0.060s Rotation Invariant Nearest Neighbor 2.684s

Conclusions Interpretable results more accurate/robust
significantly faster at classification As we have seen, The shapelet provide interpretable results, which may help domain practitioners better understand their data. The shapelet generate more accurate result. Also it is more robust to the noise and unalignment. What’s more, the shapelet is significantly fast at classification than existing state-of-art approaches.

Discussions - Comparison
Hong Cheng, Xifeng Yan, Jiawei Han, and Chih-Wei Hsu , “Discriminative Frequent Pattern Analysis for Effective Classification” (ICDE'07) Hong Cheng, Xifeng Yan, Jiawei Han, and Philip S. Yu, "Direct Discriminative Pattern Mining for Effective Classification", (ICDE'08) Similarities: motivation: Discriminative frequent pattern = Shapelet technique: Use upper bound of information gain to speed up Differences: application: general feature selection v.s. time series (no explicit features) split node: binary (contain/not contain a pattern) v.s. numeric value (smaller/larger than a value)

Discussions – other topics
Similar ideas could be applied to other research topics graph image spatio-temporal social network ….

Discussions – other topics
Graph classification: Xifeng Yan, Hong Cheng, Jiawei Han, and Philip S. Yu, “Mining Significant Graph Patterns by Scalable Leap Search”, Proc ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'08), Vancouver, BC, Canada, June 2008.

Discussions – other topics
moving object classification Discriminative sub-movement

Discussions – other topics
Social network classify normal/spamming users

Discussions – other topics

Discussions – other topics
Social network classify normal/spamming users How to find discriminative features on social network? social network structure user behaviour

Discussions – other topics
For different applications, this idea could be adapted to improve the performance; but not easily adapted.

Thank You  Question? That’s concludes my talk! And I am happy to answer any question.