Finding Approximately Repeated Data

Finding Approximately Repeated Data
Part I

New variant The two most common similarity search variants are:
K nearest neighbor search. Find me the five closest Starbucks to my office. Range search. Find me all Starbucks within 4 miles of my office. New variant Similarity Join. Given this set of 12,000 Starbucks, find me the pair that is the closest to each other. Or For all 12,000 Starbucks, find me their 1-nearest neighbor

Let us review the matrix view of the world
Many datasets naturally are, or can be converted into, sparse matrices. C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 R1 R2 R3 R4 R5 R6 1

Examples: The rows are patients, the columns are the drugs they have taken. The rows are Netflix users, the columns are the movies they purchased. The rows are animals, the columns are the genes they have. The rows are documents, the columns are words (or shingles). Note: The dimensionally can be very high, there are 1.7 million movies on IMDB The numerosity can be very high, there are 44 million US Netflix users. The data is generally very very sparse (sparser than my example below) 1

1 Note: These matrices are sets, not lists.
You can permute the rows or columns, it makes no difference. 1

It is possible that some datasets are not Boolean
It is possible that some datasets are not Boolean. For example, they cells might contain the users ranking of movies. Surprisingly, we rarely care! The Boolean version of the matrix is good enough for almost everything we want to do. If they are counts, not Boolean, we call them bags. 1 3 2 4

1 We can look at the data in two different ways, by row or by column.
Note that User 3 and User 5 have very similar tastes in movies (we will define similar later) This could be an exploitable fact. For example , User 3 has not yet seen Movie C6, we could suggest it to her as “you might also like…” . C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 1 R1 R2 R3 R4 R5 R6

1 We can look at the data in two different ways, by row or by column.
Note that Movie 1 and Movie 15 are similar, because they are liked by they same people (we will define similar later). These is also exploitable in many ways. C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 1 R1 R2 R3 R4 R5 R6

Getting data in the matrix format
Some data are already intrinsically in Boolean format For data that is not, we will have to convert it. This has been done for sounds, earthquakes, fingerprint, images, faces, genes, etc. However, we will mostly consider text as our motivating example, due to its importance. It is worth taking the time contrast data mining of text with information retrieval of text….

We can place words in cells (as below) but we typically don’t In the example below, documents A and B, seem related, but have nothing in common according to this naïve representation. Consider three short documents A = humans can swim B = The man went swimming C = dogs will bark humans can swim the man went swimming dogs will bark 1

Instead of words, we use Shingles
A k -shingle (or k -gram) for a document is a sequence of k characters that appears in the document. Example: k=2; doc = abcab. Set of 2-shingles = {ab, bc, ca}. Option: regard shingles as a bag, and count ab twice. Represent a doc by its set of k-shingles.

Representing a doc by its set of k-shingles.
words humans can swim the man went swimming dogs will bark 1 A = humans can swim B = The man went swimming C = dogs will bark shingles hu ma an do sw im th wi ba 1

Why use Shingles instead of words?
Consider three short documents A = A human can swim B = The man went swimming C = A dog might bark The 3 -shingles that occur in both A and B are: {man,swi,wim} So while A and B have no words in common, they do have shingles in common. (note that stemming etc could partly solve this, but it is domain dependent) English: {England, information, addresses} Norwegian: {Storbritannia, informasjon, adressebok} Danish: {Storbritannien, informationer, adressekartotek}

Basic Assumption Documents that have lots of shingles in common have similar text, even if the text appears in different order. man made god god made man Careful: you must pick k large enough, or most documents will have most shingles. k = 5 is OK for short documents; k = 10 is better for long documents. We can use cross validation to find k

Minutiae (Galton Details) Ridge Ending Enclosure Bifurcation Island
Sir Francis Galton Minutiae (Galton Details) Galton's mathematical conclusions predicted the possible existence of some 64 billion different fingerprint patterns Ridge Ending Enclosure Bifurcation Island

1 1 1 1

Jaccard Similarity of Sets
The Jaccard similarity of two sets is the size of their intersection divided by the size of their union. Sim (C1, C2) = |C1C2|/|C1C2|. Sim(U3,U5) = 6/7 Also written as J(U3,U5) The Jaccard similarity is a metric (on finite sets) Its range is between zero and one. If both sets are empty, sim(A,B)=1 1

Jaccard Similarity / Jaccard Distance
We can convert to a distance measure if we want..

The Search Problem Given a query Q, find the most similar object (row)
or Given a query Q, find the most similar feature (column) We know how to solve this problem, but it might be slow…. Q 1 Algorithm Algorithm Sequential_Scan(Q) 1. 1. best_so_far best_so_far = infinity; = infinity; 1 2. 2. for for all sequences in database all sequences in database 3. 3. true_dist = Jdist(Q, Ci) 4. 4. if if true_dist < best_so_far true_dist < best_so_far 5. 5. best_so_far best_so_far = true_dist; = true_dist; 6. 6. index_of_best_match index_of_best_match = i; = i; 7. 7. endif endif 8. 8. endfor

lower/upper bounding search
We need to actually do upper bounding search, because we have similarity, not distance. Can we create an upper bound for Jacard? Algorithm Algorithm Upper_Bounding_Sequential_Scan(Q) 1. 1. best_so_far best_so_far = 0; 2. 2. for for all sequences in database all sequences in database Algorithm Algorithm Sequential_Scan(Q) 3. 3. UB_dist = upper_bound_distance( C i , Q); 1. 1. best_so_far best_so_far = infinity; = infinity; 4. 4. if if UB_dist > best_so_far best_so_far 2. 2. for for all sequences in database all sequences in database 5. 5. true_dist = Jaccard ( C i , Q); 3. 3. true_dist = Jdist(Q, Ci) 6. 6. if if true_dist > best_so_far 4. 4. if if true_dist < best_so_far true_dist < best_so_far 7. 7. best_so_far best_so_far = true_dist; = true_dist; 5. 5. best_so_far best_so_far = true_dist; = true_dist; 8. 8. index_of_best_match index_of_best_match = i; = i; 6. 6. index_of_best_match index_of_best_match = i; = i; 9. 9. endif endif 7. 7. endif endif 10. 10. endif endif 8. 8. endfor 11. 11. endfor endfor

Upper Bounding Jaccard Similarity
Sim (C1, C2) = |C1C2| |C1C2| C1 C2 0 1 1 0 1 1 0 0 The intersection can be at most 3 |C1C2| |C1C2| 3 4 ≤ The union is at least 4 Sim (C1, C2) = 2/5 = 0.4 3 4 UpperBound(C1, C2) = 3/4 = 0.75

The Search Problem The search problem is easy!
Even without any “tricks” you can search millions of objects per second… However the next problem we will consider, while superficially similar, is really hard

Fundamental Data Mining Problem
The similarity join problem (motif problem) Find the pair of objects that are most similar to each other Why is this useful? Plagiarism detection Mirror pages Finding articles from the same source Finding good candidates for a marketing campaign Finding similar earthquakes Finding similar faces in images (camera handoff) etc 1

Algorithm to Solve the most Similar Pair Problem
Find the pair of users that are most similar to each other (or the pair of movies ) bestSoFar=inf; for i = 1 to num_users for j = i+1 to num_users if Jdist(useri,userj) < bestSoFar bestSoFar = Jdist(useri,userj) disp(‘So far, the best pair is ’ i, j) endif end 1 There are 44 million US Netflix users. So we must compute the Jaccard index 967,999,978,000,000 times (~967 trillion)

We are going to learn to solve the most similar pair problem for sets
Sets can be anything, but documents and movie/users are our running examples The solution involves MinHashing and Locality Sensitive Hashing. However, before we do, we will spend the rest of this class solving a very similar problem, but for the special case of time series. The time series version will be the ideal warmup for us.

Time Series Motif Discovery (finding repeated patterns)
Winding Dataset ( The angular speed of reel 2 ) 50 1000 150 2000 2500 Are there any repeated patterns, of about this length in the above time series? Chiu, B. Keogh, E., & Lonardi, S. (2003). Probabilistic Discovery of Time Series Motifs. In the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. August , Washington, DC, USA. pp

Time Series Motif Discovery (finding repeated patterns)
B Winding Dataset C ( The angular speed of reel 2 ) 50 1000 150 2000 2500 Chiu, B. Keogh, E., & Lonardi, S. (2003). Probabilistic Discovery of Time Series Motifs. In the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. August , Washington, DC, USA. pp Yoshiki Tanaka and Kuniaki Uehara (2003)Discover Motifs in Multi Dimensional Time-Series Using the Principal Component Analysis and the MDL Principle, Proc. of The Third International Conference on Machine Learning and Data Mining in Pattern Recognition(MLDM 2003), pp (2003). Ohsaki,M. - Sato,Y. - Yokoi,H. - Yamaguchi,T.: A Rule Discovery Support System for Sequential Medical Data In the Case Study of a Chronic Hepatitis Dataset. ECML 2003 Lei Chen, Tamer Ozsu and Vincent Oria (2003). Symbolic Representation and Retrieval of Moving Object Trajectories. Technical Report CS University of Waterloo. Odest Chadwicke Jenkins and Maja J Mataric´, "A Spatio-temporal Extension to Isomap Nonlinear Dimension Reduction", USC Center for Robotics and Embedded Systems Technical Report, CRES , Feb 2004 S. Rombo, G. Terracina, Discovering Representative Models in Large Time Series Databases, Proc. of 6th International Conference On Flexible Query Answering Systems (FQAS 2004), Lyon, France, 2004, Lecture Notes in Computer Science, Springer-Verlag A-S Silvent, C Garbay, P-Y Carry, M Dojat . (2003). Data, Information and Knowledge for Medical Scenario Construction .INTELLIGENT DATA ANALYSIS IN MEDICINE AND PHARMACOLOGY. IDAMAP-2003 Odest Chadwicke Jenkins and Maja J Mataric´, "A Spatio-temporal Extension to Isomap Nonlinear Dimension Reduction" ICML 2004. Udechukwu, A., Barker, K., Alhajj, R., An Efficient framework for Time Series Trend Mining, International Conference on Enterprise Information Systems (ICEIS), Porto, Portugal, LNCS, April 14-17, 2004. Udechukwu, A., Barker, K., Alhajj, R., Discovering All Frequent Trends in Time Series, Proceedings of the Winter International Symposium on Information and Communication Technologies, ACM International Conference Proceedings, January 2004. Celly, B., Zordan, V. B., Animated People Textures, 17th International Conference on Computer Animation and Social Agents (CASA 2004), 2004 Geneva, Switzerland Yoshiki Tanaka, Kuniaki Uehara (2004). Motif Discovery Algorithm from Motion Data. The 18th Annual Conference of the Japanese Society for Artificial Intelligence, Shinya Kitaguchi (2004). Extracting Feature based on Motif from a Chronic Hepatitis Dataset. The 18th Annual Conference of the Japanese Society for Artificial Intelligence. Yoshiki Tanaka and Kuniaki Uehara (2005). Multi-dimensional time-series motif discovery based on MDL principle. Machine Leaning Steven Zoraster, Ramoj Paruchuri, and Steve Darby (2004). Curve Alignment for Well-to-Well Log Correlation. SPE Annual Technical Conference and Exhibition held in Houston, Texas, U.S.A., 26–29 September 2004. Shinya Kitaguchi et al (2004). Extracting Feature based on Motif from a Chronic Hepatitis Dataset. AM2004 Joint Workshop of Vietnamese Society of AI, SIGKBS-JSAI, ICS-IPSJ and IEICE-SIGAI on Active Mining A B C 20 40 60 80 100 120 140 20 40 60 80 100 120 140 20 40 60 80 100 120 140

Why Find Motifs? · Mining association rules in time series requires the discovery of motifs. These are referred to as primitive shapes and frequent patterns. · Several time series classification algorithms work by constructing typical prototypes of each class. These prototypes may be considered motifs. · Many time series anomaly/interestingness detection algorithms essentially consist of modeling normal behavior with a set of typical shapes (which we see as motifs), and detecting future patterns that are dissimilar to all typical shapes. · In robotics, Oates et al., have introduced a method to allow an autonomous agent to generalize from a set of qualitatively different experiences gleaned from sensors. We see these “experiences” as motifs. · In medical data mining, Caraca-Valente and Lopez-Chavarrias have introduced a method for characterizing a physiotherapy patient’s recovery based of the discovery of similar patterns. Once again, we see these “similar patterns” as motifs. Animation and video capture… (Tanaka and Uehara, Zordan and Celly) Chiu, B. Keogh, E., & Lonardi, S. (2003). Probabilistic Discovery of Time Series Motifs. In the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. August , Washington, DC, USA. pp Yoshiki Tanaka and Kuniaki Uehara (2003)Discover Motifs in Multi Dimensional Time-Series Using the Principal Component Analysis and the MDL Principle, Proc. of The Third International Conference on Machine Learning and Data Mining in Pattern Recognition(MLDM 2003), pp (2003). Ohsaki,M. - Sato,Y. - Yokoi,H. - Yamaguchi,T.: A Rule Discovery Support System for Sequential Medical Data In the Case Study of a Chronic Hepatitis Dataset. ECML 2003 Lei Chen, Tamer Ozsu and Vincent Oria (2003). Symbolic Representation and Retrieval of Moving Object Trajectories. Technical Report CS University of Waterloo. Odest Chadwicke Jenkins and Maja J Mataric´, "A Spatio-temporal Extension to Isomap Nonlinear Dimension Reduction", USC Center for Robotics and Embedded Systems Technical Report, CRES , Feb 2004 S. Rombo, G. Terracina, Discovering Representative Models in Large Time Series Databases, Proc. of 6th International Conference On Flexible Query Answering Systems (FQAS 2004), Lyon, France, 2004, Lecture Notes in Computer Science, Springer-Verlag A-S Silvent, C Garbay, P-Y Carry, M Dojat . (2003). Data, Information and Knowledge for Medical Scenario Construction .INTELLIGENT DATA ANALYSIS IN MEDICINE AND PHARMACOLOGY. IDAMAP-2003 Odest Chadwicke Jenkins and Maja J Mataric´, "A Spatio-temporal Extension to Isomap Nonlinear Dimension Reduction" ICML 2004. Udechukwu, A., Barker, K., Alhajj, R., An Efficient framework for Time Series Trend Mining, International Conference on Enterprise Information Systems (ICEIS), Porto, Portugal, LNCS, April 14-17, 2004. Udechukwu, A., Barker, K., Alhajj, R., Discovering All Frequent Trends in Time Series, Proceedings of the Winter International Symposium on Information and Communication Technologies, ACM International Conference Proceedings, January 2004. Celly, B., Zordan, V. B., Animated People Textures, 17th International Conference on Computer Animation and Social Agents (CASA 2004), 2004 Geneva, Switzerland Yoshiki Tanaka, Kuniaki Uehara (2004). Motif Discovery Algorithm from Motion Data. The 18th Annual Conference of the Japanese Society for Artificial Intelligence, Shinya Kitaguchi (2004). Extracting Feature based on Motif from a Chronic Hepatitis Dataset. The 18th Annual Conference of the Japanese Society for Artificial Intelligence. Yoshiki Tanaka and Kuniaki Uehara (2005). Multi-dimensional time-series motif discovery based on MDL principle. Machine Leaning Steven Zoraster, Ramoj Paruchuri, and Steve Darby (2004). Curve Alignment for Well-to-Well Log Correlation. SPE Annual Technical Conference and Exhibition held in Houston, Texas, U.S.A., 26–29 September 2004. Shinya Kitaguchi et al (2004). Extracting Feature based on Motif from a Chronic Hepatitis Dataset. AM2004 Joint Workshop of Vietnamese Society of AI, SIGKBS-JSAI, ICS-IPSJ and IEICE-SIGAI on Active Mining

An Example on Real Customer Data: Oil Refinery
In the next few slides I will show you a prototype motif discovery tool that we built in my lab to support exploitation of oil refinery data. Although this is real data, because of the propriety nature of the data, I cannot give too many details. Let us just say we have time series that measures one aspect of a machine process (say temp or pressure or tank-level etc) There is a lot of data, how do we make sense of it? The most basic thing we can do are: Ask what are the repeated patterns (motifs) that keep on showing up?

Here is the software tool examining about 6 months of real data
This is the original time series This is a derived meta-time series. Where the blue value is low, the corresponding red time series is somewhat “typical” This is the top motif This is the second motif This is the third motif There are the three most unusual patterns

Note that there appear to be three regimes discovered
An 8-degree ascending slope A 4-degree ascending slope A 0-degree constant slope We can now ask are the regimes associated with yield quality, by looking up the yield numbers on the days in question. We find.. A = {bad, bad, fair, bad, fair, bad, bad} B = {bad, good, fair, bad, fair, good, fair} C = {good, good, good, good, good, good, good} So yes! These patterns appear to be precursors to the quality of yield (we have not fully teased out causality here). So now we can monitor for patterns “B” and “A” and sound an alarm if we see them, take action, and improve quality/save costs etc. 8 degrees 4 degrees 0 degrees

My lab made two fundamental contributions that make this possible.
Speed: If done in a brute-force manner, doing this would take 144 days*. However, we can do this in just a few seconds. Meaningfulness: Without careful definitions and constraints, on many datasets we would find meaningless or degenerate solutions. For example, we might have “lumped” all these three patterns together, and missed their subtle and important differences. 8 degrees 4 degrees 0 degrees *Say each operation takes seconds We have to do 1000 * * (( )/2) operations

Motif Example Motif discovery can often surprise you.
(Zebra Finch Vocalizations in MFCC, 100 day old male) 1000 2000 3000 4000 5000 6000 7000 8000 1000 2000 3000 4000 5000 6000 7000 8000 Motif discovery can often surprise you. While it is clear that this time series is not random, we did not expect the motifs to be so well conserved or repeated so many times. motif 1 motif 2 motif 3 2 seconds 200

T C Trivial Matches Space Shuttle STS - 57 Telemetry 100 200 3 00 400
( Inertial Sensor ) 100 200 3 00 400 500 600 70 800 900 100 Definition 1. Match: Given a positive real number R (called range) and a time series T containing a subsequence C beginning at position p and a subsequence M beginning at q, if D(C, M)  R, then M is called a matching subsequence of C. Definition 2. Trivial Match: Given a time series T, containing a subsequence C beginning at position p and a matching subsequence M beginning at q, we say that M is a trivial match to C if either p = q or there does not exist a subsequence M’ beginning at q’ such that D(C, M’) > R, and either q < q’< p or p < q’< q. Definition 3. K-Motif(n,R): Given a time series T, a subsequence length n and a range R, the most significant motif in T (hereafter called the 1-Motif(n,R)) is the subsequence C1 that has highest count of non-trivial matches (ties are broken by choosing the motif whose matches have the lower variance). The Kth most significant motif in T (hereafter called the K-Motif(n,R) ) is the subsequence CK that has the highest count of non-trivial matches, and satisfies D(CK, Ci) > 2R, for all 1  i < K. Chiu, B. Keogh, E., & Lonardi, S. (2003). Probabilistic Discovery of Time Series Motifs. In the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. August , Washington, DC, USA. pp Yoshiki Tanaka and Kuniaki Uehara (2003)Discover Motifs in Multi Dimensional Time-Series Using the Principal Component Analysis and the MDL Principle, Proc. of The Third International Conference on Machine Learning and Data Mining in Pattern Recognition(MLDM 2003), pp (2003). Ohsaki,M. - Sato,Y. - Yokoi,H. - Yamaguchi,T.: A Rule Discovery Support System for Sequential Medical Data In the Case Study of a Chronic Hepatitis Dataset. ECML 2003 Lei Chen, Tamer Ozsu and Vincent Oria (2003). Symbolic Representation and Retrieval of Moving Object Trajectories. Technical Report CS University of Waterloo. Odest Chadwicke Jenkins and Maja J Mataric´, "A Spatio-temporal Extension to Isomap Nonlinear Dimension Reduction", USC Center for Robotics and Embedded Systems Technical Report, CRES , Feb 2004 S. Rombo, G. Terracina, Discovering Representative Models in Large Time Series Databases, Proc. of 6th International Conference On Flexible Query Answering Systems (FQAS 2004), Lyon, France, 2004, Lecture Notes in Computer Science, Springer-Verlag A-S Silvent, C Garbay, P-Y Carry, M Dojat . (2003). Data, Information and Knowledge for Medical Scenario Construction .INTELLIGENT DATA ANALYSIS IN MEDICINE AND PHARMACOLOGY. IDAMAP-2003 Odest Chadwicke Jenkins and Maja J Mataric´, "A Spatio-temporal Extension to Isomap Nonlinear Dimension Reduction" ICML 2004. Udechukwu, A., Barker, K., Alhajj, R., An Efficient framework for Time Series Trend Mining, International Conference on Enterprise Information Systems (ICEIS), Porto, Portugal, LNCS, April 14-17, 2004. Udechukwu, A., Barker, K., Alhajj, R., Discovering All Frequent Trends in Time Series, Proceedings of the Winter International Symposium on Information and Communication Technologies, ACM International Conference Proceedings, January 2004. Celly, B., Zordan, V. B., Animated People Textures, 17th International Conference on Computer Animation and Social Agents (CASA 2004), 2004 Geneva, Switzerland Yoshiki Tanaka, Kuniaki Uehara (2004). Motif Discovery Algorithm from Motion Data. The 18th Annual Conference of the Japanese Society for Artificial Intelligence, Shinya Kitaguchi (2004). Extracting Feature based on Motif from a Chronic Hepatitis Dataset. The 18th Annual Conference of the Japanese Society for Artificial Intelligence. Yoshiki Tanaka and Kuniaki Uehara (2005). Multi-dimensional time-series motif discovery based on MDL principle. Machine Leaning Steven Zoraster, Ramoj Paruchuri, and Steve Darby (2004). Curve Alignment for Well-to-Well Log Correlation. SPE Annual Technical Conference and Exhibition held in Houston, Texas, U.S.A., 26–29 September 2004. Shinya Kitaguchi et al (2004). Extracting Feature based on Motif from a Chronic Hepatitis Dataset. AM2004 Joint Workshop of Vietnamese Society of AI, SIGKBS-JSAI, ICS-IPSJ and IEICE-SIGAI on Active Mining

OK, we can define motifs, but how do we find them?
The obvious brute force search algorithm is just too slow… The most referenced algorithm is based on a hot idea from bioinformatics, random projection* and the fact that SAX allows use to lower bound discrete representations of time series. Chiu, B. Keogh, E., & Lonardi, S. (2003). Probabilistic Discovery of Time Series Motifs. In the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. August , Washington, DC, USA. pp Yoshiki Tanaka and Kuniaki Uehara (2003)Discover Motifs in Multi Dimensional Time-Series Using the Principal Component Analysis and the MDL Principle, Proc. of The Third International Conference on Machine Learning and Data Mining in Pattern Recognition(MLDM 2003), pp (2003). Ohsaki,M. - Sato,Y. - Yokoi,H. - Yamaguchi,T.: A Rule Discovery Support System for Sequential Medical Data In the Case Study of a Chronic Hepatitis Dataset. ECML 2003 Lei Chen, Tamer Ozsu and Vincent Oria (2003). Symbolic Representation and Retrieval of Moving Object Trajectories. Technical Report CS University of Waterloo. Odest Chadwicke Jenkins and Maja J Mataric´, "A Spatio-temporal Extension to Isomap Nonlinear Dimension Reduction", USC Center for Robotics and Embedded Systems Technical Report, CRES , Feb 2004 S. Rombo, G. Terracina, Discovering Representative Models in Large Time Series Databases, Proc. of 6th International Conference On Flexible Query Answering Systems (FQAS 2004), Lyon, France, 2004, Lecture Notes in Computer Science, Springer-Verlag A-S Silvent, C Garbay, P-Y Carry, M Dojat . (2003). Data, Information and Knowledge for Medical Scenario Construction .INTELLIGENT DATA ANALYSIS IN MEDICINE AND PHARMACOLOGY. IDAMAP-2003 Odest Chadwicke Jenkins and Maja J Mataric´, "A Spatio-temporal Extension to Isomap Nonlinear Dimension Reduction" ICML 2004. Udechukwu, A., Barker, K., Alhajj, R., An Efficient framework for Time Series Trend Mining, International Conference on Enterprise Information Systems (ICEIS), Porto, Portugal, LNCS, April 14-17, 2004. Udechukwu, A., Barker, K., Alhajj, R., Discovering All Frequent Trends in Time Series, Proceedings of the Winter International Symposium on Information and Communication Technologies, ACM International Conference Proceedings, January 2004. Celly, B., Zordan, V. B., Animated People Textures, 17th International Conference on Computer Animation and Social Agents (CASA 2004), 2004 Geneva, Switzerland Yoshiki Tanaka, Kuniaki Uehara (2004). Motif Discovery Algorithm from Motion Data. The 18th Annual Conference of the Japanese Society for Artificial Intelligence, Shinya Kitaguchi (2004). Extracting Feature based on Motif from a Chronic Hepatitis Dataset. The 18th Annual Conference of the Japanese Society for Artificial Intelligence. Yoshiki Tanaka and Kuniaki Uehara (2005). Multi-dimensional time-series motif discovery based on MDL principle. Machine Leaning Steven Zoraster, Ramoj Paruchuri, and Steve Darby (2004). Curve Alignment for Well-to-Well Log Correlation. SPE Annual Technical Conference and Exhibition held in Houston, Texas, U.S.A., 26–29 September 2004. Shinya Kitaguchi et al (2004). Extracting Feature based on Motif from a Chronic Hepatitis Dataset. AM2004 Joint Workshop of Vietnamese Society of AI, SIGKBS-JSAI, ICS-IPSJ and IEICE-SIGAI on Active Mining * J Buhler and M Tompa. Finding motifs using random projections. In RECOMB'

SAX allows (for the first time) a symbolic representation that allows
Lower bounding of Euclidean distance Dimensionality Reduction Numerosity Reduction Jessica Lin 1976- c c c b b b Lin, J., Keogh, E., Lonardi, S. & Chiu, B. (2003). A Symbolic Representation of Time Series, with Implications for Streaming Algorithms. In proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. San Diego, CA. June 13. Chiu, B., Keogh, E. & Lonardi, S. (2003). Probabilistic Discovery of Time Series Motifs. In the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. August Washington, DC, USA. pp Lin, J., Keogh, E., Lonardi, S., Lankford, J. P. & Nystrom, D. M. (2004). Visually Mining and Monitoring Massive Time Series. In proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Seattle, WA, Aug 22-25, (This work also appears as a VLDB 2004 demo paper, under the title "VizTree: a Tool for Visually Mining and Monitoring Massive Time Series.") Keogh, E., Lonardi, S. and Ratanamahatana, C. (2004). Towards Parameter-Free Data Mining. In proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Seattle, WA, Aug 22-25, 2004. See also Celly, B. & Zordan, V. B. (2004). Animated People Textures. In proceedings of the 17th International Conference on Computer Animation and Social Agents (CASA 2004). July 7-9. Geneva, Switzerland. a a - - 20 40 60 80 100 120 baabccbc

A simple worked example of the motif discovery algorithm
( m= 1000 ) 500 1000 C 1 ^ C a c b a 1 ^ S a c b a 1 b c a b 2 : Chiu, B. Keogh, E., & Lonardi, S. (2003). Probabilistic Discovery of Time Series Motifs. In the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. August , Washington, DC, USA. pp : : : : : : : : : a c c a 58 : : : : : b c c c 985

a c b a b c a b : : : : : : : : : : a c c a : : : : : b c c c
T ( m= 1000 ) 500 1000 C 1 Key observation: By doing the Dimensionality Reduction and Cardinality Reduction of SAX, the SAX word that describe the two occurrences are almost the same. Could we make the more similar by changing the SAX parameters? Yes, and no. What can we do? Hash! a c b a 1 b c a b 2 : Chiu, B. Keogh, E., & Lonardi, S. (2003). Probabilistic Discovery of Time Series Motifs. In the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. August , Washington, DC, USA. pp : : : : : : : : : a c c a 58 : : : : : b c c c 985

A mask {1,2} was randomly chosen, so the values in columns {1,2} were used to project matrix into buckets. Collisions are recorded by incrementing the appropriate location in the collision matrix Chiu, B. Keogh, E., & Lonardi, S. (2003). Probabilistic Discovery of Time Series Motifs. In the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. August , Washington, DC, USA. pp

Once again, collisions are recorded by incrementing the appropriate location in the collision matrix
A mask {2,4} was randomly chosen, so the values in columns {2,4} were used to project matrix into buckets. Chiu, B. Keogh, E., & Lonardi, S. (2003). Probabilistic Discovery of Time Series Motifs. In the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. August , Washington, DC, USA. pp

We can calculate the expected values in the matrix, assuming there are NO patterns…
1 2 2 1 : 3 Chiu, B. Keogh, E., & Lonardi, S. (2003). Probabilistic Discovery of Time Series Motifs. In the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. August , Washington, DC, USA. pp 27 2 58 1 3 2 1 : 2 1 2 1 3 98 5 1 2 : 58 : 98 5

A Simple Experiment Let us embed two motifs into a random walk time series, and see if we can recover them C A D B 20 40 60 80 100 120 20 40 60 80 100 120 Chiu, B. Keogh, E., & Lonardi, S. (2003). Probabilistic Discovery of Time Series Motifs. In the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. August , Washington, DC, USA. pp

Planted Motifs C Chiu, B. Keogh, E., & Lonardi, S. (2003). Probabilistic Discovery of Time Series Motifs. In the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. August , Washington, DC, USA. pp A B D

“Real” Motifs 20 40 60 80 100 120 Chiu, B. Keogh, E., & Lonardi, S. (2003). Probabilistic Discovery of Time Series Motifs. In the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. August , Washington, DC, USA. pp 20 40 60 80 100 120

Review We can place many kinds of data into a Boolean matrix
A fundamental problem is to quickly find the closest pair of objects in that matrix. For a very similar problem in time series, a fast solution involves hashing multiple times into buckets, and hoping that the “closest pair of objects” will hash into the same bucket many times. Next time we will see that that hashing trick can be made work for the general case. hu ma an do sw im th wi ba 1

Applications Shingling Minhashing Locality-Sensitive Hashing
Part II Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Adapted from slides by Jeffrey D. Ullman

Useful Advice Doubt Knock Shake

Problem Reminder (adversarial view)
I give you a million files. One of them is a copy of another. I want you to find the pair that includes the copy. For the copy I… Did nothing (test for equality at bit level) Changed one letter (use hamming distance) Deleted the first word (use string edit distance) Swap paragraphs, or added my own extra paragraphs (treat as sets, or bag of words) Changed tense/rewrite a little. The man likes river swimming The man likes to swim in rivers (treat as sets, but use shingles)

Representing a doc by its set of k-shingles.
words humans can swim the man went swimming dogs will bark 1 A = humans can swim B = The man went swimming C = dogs will bark shingles hu ma an do sw im th wi ba 1

Another advantage of shingles
You have seen that shingles give us an advantage over raw words.. A = humans can swim B = The man went swimming C = dogs will bark Another reason to use shingles is that they do reward word order a little Suppose we have two documents… One about coal mining, with data from Kentucky One about data mining in Ireland Obviously both have the words ‘data’ and ‘mining’, but only the latter will have the shingle ‘a_m’

Jaccard Similarity of Sets
review The Jaccard similarity of two sets is the size of their intersection divided by the size of their union. Sim (C1, C2) = |C1C2|/|C1C2|. Sim(U3,U5) = 6/7 Also written as J(U3,U5) The Jaccard similarity is a metric (on finite sets) Its range is between zero and one. If both sets are empty, sim(A,B)=1 1

Goals Many data-mining problems can be expressed as finding “similar” sets: Pages with similar words, e.g., for classification by topic. NetFlix users with similar tastes in movies, for recommendation systems. Dual: movies with similar sets of fans. Images of related things. Time Series Motifs Fingerprints

Important I use the word “documents”, to be consistent with the literature. However it is possible to see time-series, DNA, videos, images, songs etc as “documents” .

Similar Documents Given a body of documents, e.g., the Web, find pairs of documents with a lot of text in common, e.g.: Mirror sites, or approximate mirrors. Application: Don’t want to show both in a search. Plagiarism, including large quotations. Similar news articles at many news sites. Application: Cluster articles by “same story.”

Three Essential Techniques for Similar Documents
Shingling : convert documents, s, etc., to sets. Minhashing : convert large sets to short signatures, while preserving similarity. Locality-sensitive hashing : focus on pairs of signatures likely to be similar.

Shingles and Similarity
Documents that are intuitively similar will have many shingles in common. Changing a word only affects k-shingles within distance k-1 from the word. Example: k=3, “The dog which chased the cat” versus “The dog that chased the cat”. Only 3-shingles replaced are g_w, _wh, whi, hic, ich, ch_, and h_c. All other singles are the same for the two documents. Reordering paragraphs only affects the 2k shingles that cross paragraph boundaries We need to assure ourselves that replacing a document by its shingles still lets us detect pairs of documents that are intuitively similar. It fact, similarity of shingle sets captures many of the kinds of document changes that we would regard as keeping the documents similar. Click 1 For example, if we are using k-shingles, and we change one word, only the k shingles to the left and right of the word, as well as shingles within the word, can be effected. Click 2 And we can reorder entire paragraphs without affecting any shingles except the shingles that cross the boundaries between the paragraph we moved and the paragraphs just before and just after, in both the new and old positions. Click 3 For example, suppose we use k=3, and we correctly change the “which” in this sentence to “that”. The only shingles that can be affected are the ones that begin at most two characters before “which” and end at most two characters after “which” (DRAW). These are g-blank-w (POINT), blank-w-h, and so on, up to h-blank-c (POINT), a total of seven shingles. These are replaced by six different shingles, g-blank-t (POINT), and so on. However, all other shingles remain the same in the two sentences.

Working Assumption Documents that have lots of shingles in common have similar text, even if the text appears in different order. Careful: you must pick k large enough, or most documents will have most shingles. k = 5 is OK for short documents; k = 10 is better for long documents. We can use cross validation to find k

MinHashing Data as Sparse Matrices Jaccard Similarity Measure
Constructing Signatures

Basic Data Model: Sets Many similarity problems can be couched as finding subsets of some universal set that have significant intersection. Examples include: Documents represented by their sets of shingles (or hashes of those shingles). Similar customers or products.

Important Point C1 C2 0 1 1 0 1 1 0 0 The number of rows is often very large. Imagine it is set of all books..

When is this computationally challenging?
When the sets are so large or so many that they cannot fit in main memory. Or, when there are so many sets that comparing all pairs of sets takes too much time. Or both. Find the pair of users that are most similar to each other bestSoFar=inf; for i = 1 to num_users for j = i+1 to num_users if Jdist(useri,userj) < bestSoFar bestSoFar = Jdist(useri,userj) disp(‘So far, the best pair is ’ i, j) endif end

Outline: Finding Similar Columns (or rows)
The matrix is so big it lives on disk Compute signatures of columns (or rows) = small summaries of columns (or rows). Signatures are small enough to live in main memory Examine pairs of signatures to find similar signatures. Essential: similarities of signatures and columns are highly related. Optional: check that columns with similar signatures are really similar. Making only a few disk accesses.

Warnings Comparing all pairs of signatures may take too much time, even if not too much space. A job for Locality-Sensitive Hashing. These methods can produce false negatives, and even false positives (if the optional check is not made).

Signatures Key idea: “hash” each column C to a small signature Sig (C), such that: 1. Sig (C) is small enough that we can fit a signature in main memory for each column. Sim (C1, C2) is the approximately the same as the “similarity” of Sig (C1) and Sig (C2). Lives in main memory Lives on disk

Four Types of Rows Given columns C1 and C2, rows may be classified as:
C1 C2 a I like it, you like it b I like it, you don't c I don’t like it, you do d I don’t like it, you don’t like it Also, a = # rows of type a , etc. Note Sim (C1, C2) = a /(a +b +c ). Note d does not appear, yet most rows are of type d

Minhashing Imagine the rows permuted randomly.
Define “hash” function h (C ) = the number of the first (in the permuted order) row in which column C has 1. Use several (e.g., 100) independent hash functions to create a signature. In practice, we don’t really move the rows, it would take too long

Minhashing Example 3 4 7 6 1 2 5 1 random permutation of rows
Input matrix Signature matrix M 3 4 7 6 1 2 5 1

Minhashing Example random permutation of rows Input matrix Signature matrix M 1 3 4 7 6 1 2 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Minhashing Example random permutation of rows Input matrix Signature matrix M 1 2 3 4 7 6 1 2 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Minhashing Example Input matrix 1 Signature matrix M 1 2 4 5 2 6 7 3 1 5 7 6 3 1 2 4 3 4 7 6 1 2 5

Surprising Property The probability (over all permutations of the rows) that h (C1) = h (C2) is the same as Sim (C1, C2). Both are a /(a +b +c )! Why? Look down the permuted columns C1 and C2 until we see a 1. If it’s a type-a row, then h (C1) = h (C2). If a type-b or type-c row, then not.

Similarity for Signatures
The similarity of signatures (actually, distance) is the fraction of the hash functions in which they agree.

Min Hashing – Example Input matrix 1 Signature matrix M 1 2 4 5 2 6 7 3 1 5 7 6 3 1 2 4 3 4 7 6 1 2 5 Similarities: Col/Col Sig/Sig

Locality-Sensitive Hashing
Focusing on Similar Minhash Signatures

Checking All Pairs is Hard
While the signatures of all columns may fit in main memory, comparing the signatures of all pairs of columns is quadratic in the number of columns. Example: 106 columns implies 5*1011 column-comparisons. At 1 microsecond/comparison: 6 days.

Locality-Sensitive Hashing
General idea: Use a function f(x,y) that tells whether or not x and y is a candidate pair : a pair of elements whose similarity must be evaluated. For minhash matrices: Hash columns to many buckets, and make elements of the same bucket candidate pairs.

Candidate Generation From Minhash Signatures
Pick a similarity threshold s, a fraction < 1. A pair of columns c and d is a candidate pair if their signatures agree in at least fraction s of the rows. I.e., M (i, c ) = M (i, d ) for at least fraction s values of i.

LSH for Minhash Signatures
Big idea: hash columns of signature matrix M several times. Arrange that (only) similar columns are likely to hash to the same bucket (plus maybe a few false posatives) Candidate pairs are those that hash at least once to the same bucket.

However, that pair differs by two places…
We have say a million columns. They are all pretty much different from each other, except the pair highlighted below. However, that pair differs by two places… 3 2 1 6 3 2 1 3 2 1 3 2 22527 3 2 1 3 7 1 9 4542

However, suppose you just looked a ‘band’ of rows.
For some bands (including the one below) we do have a perfect match between the two similar columns… 3 2 1 6 3 2 1 3 2 1 3 2 22527 3 2 1 3 7 1 9 4542

However, for some bands (including the one below) we do not have a perfect match between the two similar columns. So this slide, and the last slides suggest we can get lucky or unlucky… We can improve our luck with many bands…. 3 2 1 6 3 2 1 3 2 1 3 2 22527 3 2 1 3 7 1 9 4542

3 2 1 3 2 1

Partition into Bands – (2)
Divide matrix M into b bands of r rows. For each band, hash its portion of each column to a hash table with k buckets. Make k as large as possible. Candidate column pairs are those that hash to the same bucket for ≥ 1 band. Tune b and r to catch most similar pairs, but few nonsimilar pairs.

Key insight: Most buckets will have only one entry
Key insight: Most buckets will have only one entry. Such items are almost certainly unique documents, that don’t need to be checked

Simplifying Assumption
There are enough buckets that columns are unlikely to hash to the same bucket unless they are identical in a particular band.

Example: Effect of Bands
Suppose 100,000 columns. Signatures of 100 integers. Therefore, signatures take 40Mb. Want all 80%-similar pairs. 5,000,000,000 pairs of signatures can take a while to compare. Choose 20 bands of 5 integers/band.

Suppose C1, C2 are 80% Similar
Choose 20 bands of 5 integers/band. Suppose C1, C2 are 80% Similar Probability C1, C2 identical in one particular band: (0.8)5 = This seems low, but it is only one of 20 chances we have to get a collision. Probability C1, C2 are not similar in any of the 20 bands: ( )20 = i.e., about 1/3000th of the 80%-similar column pairs are false negatives.

LSH Involves a Tradeoff
Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives. Example: if we had only 15 bands of 5 rows, the number of false positives would go down, but the number of false negatives would go up. Because we have to check these false positives, we slow down.

LSH Summary We have taken a quadratic algorithm and made it linear (with high constants). We can have very few false negatives We can have very few false positives. We can have no false positives, if we do a follow up check on disk with the original data. This is a truly nice idea, but is does need careful parameter tuning.

Summary for Data Mining
At all costs, avoid algorithms that have to go to disk a lot. We typically need to make changes of representation, to Achieve speed-up To obtain good, meaningful results We probably need to learn to be content with high quality, but approximate solutions. If we insist on exact solutions, we are condemned to have very slow algorithms.

If you enjoyed this class, like me on yelp!

Finding Approximately Repeated Data

Similar presentations

Presentation on theme: "Finding Approximately Repeated Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Finding Approximately Repeated Data

Similar presentations

Presentation on theme: "Finding Approximately Repeated Data"— Presentation transcript:

Similar presentations

About project

Feedback