Presentation on theme: "Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use these slides for teaching, if."— Presentation transcript:
Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use these slides for teaching, if You send me an email telling me the class number/ university in advance. My name and email address appears on the first slide (if you are using all or most of the slides), or on each slide (if you are just taking a few slides). You may freely use these slides for a conference presentation, if You send me an email telling me the conference name in advance. My name appears on each slide you use. You may not use these slides for tutorials, or in a published work (tech report/ conference paper/ thesis/ journal etc). If you wish to do this, email me first, it is highly likely I will grant you permission. (c) Eamonn Keogh, firstname.lastname@example.org
Towards Parameter Free Data Mining Eamonn Keogh Computer Science & Engineering Department, University of California Riverside Stefano Lonardi Chotirat Ann Ratanamahatana
Outline of Talk Why parameter laden data mining algorithms are troublesome… A parameter free/lite approach Empirical evaluation Parameter free/lite clustering Parameter free/lite anomaly detection Parameter free/lite classifcation Conclusions
Parameters Anecdotal observations A data mining paper I once reviewed had 14 parameters… More than 5 parameters seems the norm for most papers… Some algorithms have only 4 parameters, but the algorithm is tested on synthetic data, created by a algorithm with 3 or 4 parameters…. Is a surfeit of parameters a bad thing?
Parameter laden algorithms are often hard to implement By definition, parameter laden algorithms tend to be hard to implement. But don’t we want people to use our work? This fact (combined with general laziness) probably accounts for the general dearth of strawman comparisons in most papers. See Keogh & Kasetty SIGKDD02.
Parameter laden algorithms make it hard to reproduce others results We re-implement many other papers (63 for this paper alone). Yet we are rarely able to reproduce others results, even when using the same datasets. One reason is that we typically don’t know how someone set the 5, 6, 7, 8, 9 parameters used in the original paper. Are irreproducible results scientifically meaningful?
Parameter laden algorithms make it hard not to overfit Over fitting is a problem in data mining. If we have 5 to 10 parameters to set, avoiding over fitting is nearly impossible. This makes it hard to evaluate the contribution of a paper. So you got 95% accuracy on the “bull/bear” classification problem. Did you… Spend 2 minutes adjusting the parameters (I am somewhat impressed) Spend 2 weeks adjusting the parameters (I am not impressed) Yet often we don’t know which! (See Pedro Domingos POE papers)
Parameter laden algorithms are nearly useless to the non-specialist If we ask a Cardiologist / COE / Lawyer / School Administrator to use an algorithm with 10 parameters, then the best we can hope for is that they give up in frustration!
The Central Claim of this Paper A very simple (12 lines of code) parameter free algorithm can outperform many (most/all?) other algorithms, for many data mining tasks.
Quick Review The Kolmogorov complexity K(x) of a string x is defined as the length of the shortest program capable of producing x on a universal computer — such as a Turing machine. The conditional Kolmogorov complexity K(x|y) of x to y is defined as the length of the shortest program that computes x when y is given as an auxiliary input to the program. The function K(xy) is the length of the shortest program that outputs y concatenated to x.
The Similarity Metric The Similarity Metric Ming Li et. al. Ming Li, Xin Chen, Xin Li, Bin Ma, Paul M. B. Vitányi: The similarity metric. SODA 2003: 863-872 This is the optimal similarity measure, but intractable… This is a cheap approximation… C(x) means the size of x compressed by an off-the- shelf compression algorithm like Winzip, Stuffit etc Ming Li
The (very) Minor Extension proposed here This is the optimal similarity measure, but intractable… This is a cheap approximation… This is a cheaper approximation… Note we flipped the numerator and denominator to go from similarity to dissimilarity
function dist = CDM(A, B) save A.txt A –ASCII % Save variable A as A.txt zip('A.zip', 'A.txt'); % Compress A.txt A_file = dir('A.zip'); % Get file information save B.txt B –ASCII % Save variable B as B.txt zip('B.zip', 'B.txt'); % Compress B.txt B_file = dir('B.zip'); % Get file information A_n_B = [A; B]; % Concatenate A and B save A_n_B.txt A_n_B –ASCII % Save A_n_B.txt zip('A_n_B.zip', 'A_n_B.txt'); % Compress A_n_B.txt A_n_B_file = dir('A_n_B.zip'); % Get file information % Return CDM dissimilarity dist = A_n_B_file.bytes / (A_file.bytes + B_file.bytes); Compression Based Distance Measure
But wait! Isn’t the choice of compression algorithm itself a parameter! No! We are trying to approximate the optimal Kolmogorov compression. So the best compression algorithm to use is the one that gives the smallest file size
Baboon Barbary Ape Chimpanzee Pygmy Chimpanzee Human Gorilla Orangutan Sumatran Orangutan Gibbon Capuchin Malayan Flying Lemur Ring-Tailed Lemur Oyster pongines prosimians cercopithecoids hominoids greater apes panines anthropoids catarrhines taxonomy level controversial primates The clustering achieved by CDM on 16,300 symbols from the mitochondrial DNA of 12 primates, and one “outlier” species. Sang-Hee Lee (Noted physical anthropologist) This is the correct clustering for these species
Argentina Mexico Spain Brazil Catalan Italy Denmark Norway Sweden Germany USA Argentina Mexico Spain Brazil Catalan Italy Denmark Norway Sweden Germany USA Danish Norwegian German French Dutch Italian Latin English Maori Danish Norwegian German French Dutch Italian Latin English Maori The clustering achieved on the text from various Yahoo portals (Jan-15 th 2004). The smallest webpage had 1,615 characters, excluding white spaces The clustering achieved on the text from the first fifty chapters of Genesis
But DNA strings and natural language text are inherently discrete. Much of data mining is on real valued data, like images, video and time series Yeah, what about time series?
1 2 3 0.13812500000000 0.51250000000000 0.49561523437690 0.04875000000000 0.50000000000000 0.49604248046834 0.10375000000000 0.50000000000000 0.49653076171875 0.17875000000000 0.47562500000000 0.49706481933594 0.24093750000000 0.45125000000000 0.49750732421875 0.29875000000000 0.45125000000000 0.49808715820312 0.37000000000000 0.47656250000000 0.49875854492187 0.48375000000000 0.50000000000000 0.49939941406230 0.55593750000000 0.48281250000000 0.50007080078125 0.64625000000000 0.48468750000000 0.50062011718750 0.70125000000000 0.46937500000000 0.50123046875826...... What about Time Series? Part I In the example below, 1 and 3 are very similar ECGS...
What about Time Series? Part II 00 20 406080100120 b b b a c c c a baabccbc SAXIFY! In the SAX representation, every bit has about the same importance… 1 2 3 0.13812500000000 0.51250000000000 0.49561523437690 0.04875000000000 0.50000000000000 0.49604248046834 0.10375000000000 0.50000000000000 0.49653076171875
How well does CDM time series clustering work? We compared to 51 other measures (we re-implemented every time series measure from SIGKDD, SIGMOD, ICDM, ICDE, VLDB, ICML, SSDM, PKDD and PAKDD conferences in the last decade). We optimized the parameters of the above We tested on diverse problems Homogenous datasets: All ECGs etc Heterogeneous datasets: A mixed bag of 18 diverse datasets We used a formal metric to score the clusterings (but here we will just show a few pictures)
CDM does best (by a huge margin) on the heterogeneous data problems
1 16 17 18 19 20 4 7 9 6 2 5 8 11 12 14 15 13 10 3 CDM’s clustering CDM does best (by a huge margin) on the homogenous data problems The second best clustering (of 51 approaches) MIT-BIH Noise Stress Test Database (nstdb): record 118e6 Long Term ST Database (ltstdb): record 20021 BIDMC Congestive Heart Failure Database (chfdb): record chf02 BIDMC Congestive Heart Failure Database (chfdb): record chf15
OK, you can cluster stuff, what about other data mining problems, like classification, motif discovery, anomaly detection Yes, what about anomaly detection?
Parameter-Free Anomaly Detection The connection between the notion of similarity and anomalies is inherent in the English idiom. When confronted with an anomalous object or occurrence, people usually exclaim "I have never seen anything like it!" or "It is like nothing you have ever seen".
function loc_of_anomaly = kolmogorov_anomaly(data) loc_of_anomaly = 1; while size(data,1) > 2 if CDM(data(1:floor(end/2),:),data); < CDM(data(ceil(end/2):end,:),data); loc_of_anomaly = loc_of_anomaly + size(data,1) / 2; data = data(ceil(end/2):end,:); else data = data(1:floor(end/2),:); end … A Parameter-Free Anomaly Detection Algorithm
A) Our Approach B) Support Vector Machine (SVM) based approach (6 parameters). C) Immunology (IMM) inspired approach (5 parameters). D) Association Rule (AR) based approach (5 parameters). E) TSA-tree Wavelet based approach (3 parameters). For each experiment we spend one hour of CPU time, and one hour of human time trying to find the best parameters and only reported the best results. A problem from SIGKDD 2003…
What happens when we change the anomaly? A) Our Approach B) Support Vector Machine (SVM) based approach (6 parameters). C) Immunology (IMM) inspired approach (5 parameters). D) Association Rule (AR) based approach (5 parameters). E) TSA-tree Wavelet based approach (3 parameters).
200030004000 5000 A small excerpt from dataset 118e06 from the MIT-BIH Noise Stress Test Database. The full dataset is 21,600 data points long. Here, we show only a subsection containing the two most interesting events detected by our algorithm The gray markers are independent annotations by a cardiologist indicating Premature Ventricular Contractions
L-1u L-1n L-1t L-1v The results of using our algorithm on various datasets from the Aerospace Corp collection. The bolder the line, the stronger the anomaly. Note that because of the way we plotted these there is a tendency to locate the beginning of the anomaly as opposed to the most anomalous part. Aerospace Corp Problems
Hand resting at side Hand above holster Aiming at target Actor misses holster Briefly swings gun at target, but does not aim Laughing and flailing hand 0100200300400500600 100 150 200 250 300 350 400 450 500 550 The parameter-free algorithm generalizes to multi-dimensional time series, without changing a single line of code!
What about classification? Yes, what about classification?
Euclidean DTWCDM ECG: signal 1 42.25 %16.25 %6.25 % ECG: signal 2 47.50 %11.25 %7.50 % Gun: 2 classes 5.00 %0.00 % Gun: 4 classes 37.50 %12.5 %5.00 % Parameter-Free (Structural) Time Series Classification
What are the limitations of parameter-free algorithms? The data cannot be arbitrarily small, but the amazing clustering results show here only used 1,000 datapoints per object!
Conclusions Parameter laden algorithms are bad! They are often hard to implement They make it hard to reproduce others results They make it difficult to judge the significance of a contribution They make it hard not to overfit They are next to useless to the non-specialist Parameter free/lite algorithms are good! They require little coding effort They make it trivial for others to reproduce your results They make it nearly impossible to overfit They have a real chance of being used by the non-specialist You should use a parameter free/lite approach before doing anything more complex!
Questions? All datasets and code used in this talk can be found atwww.cs.ucr.edu/~eamonn/TSDMA/index.html