Similarity Searches in Sequence Databases

Slides:

Advertisements

Similar presentations

Indexing DNA Sequences Using q-Grams

Advertisements

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.

Spatial Database Systems. Spatial Database Applications GIS applications (maps): Urban planning, route optimization, fire or pollution monitoring, utility.

Word Spotting DTW.

Fast Algorithms For Hierarchical Range Histogram Constructions

Searching on Multi-Dimensional Data

3D Shape Histograms for Similarity Search and Classification in Spatial Databases. Mihael Ankerst,Gabi Kastenmuller, Hans-Peter-Kriegel,Thomas Seidl Univ.

Fast Algorithm for Nearest Neighbor Search Based on a Lower Bound Tree Yong-Sheng Chen Yi-Ping Hung Chiou-Shann Fuh 8 th International Conference on Computer.

Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.

A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.

A Novel Scheme for Video Similarity Detection Chu-Hong Hoi, Steven March 5, 2003.

Multimedia Indexing and Retrieval Kowshik Shashank Project Advisor: Dr. C.V. Jawahar.

Multimedia DBs.

Indexing Time Series. Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)

Themis Palpanas1 VLDB - Aug 2004 Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use.

Efficient Similarity Search in Sequence Databases Rakesh Agrawal, Christos Faloutsos and Arun Swami Leila Kaghazian.

Visual Querying By Color Perceptive Regions Alberto del Bimbo, M. Mugnaini, P. Pala, and F. Turco University of Florence, Italy Pattern Recognition, 1998.

Multimedia DBs. Time Series Data

1. 2 General problem Retrieval of time-series similar to a given pattern.

Based on Slides by D. Gunopulos (UCR)

Tracking Moving Objects in Anonymized Trajectories Nikolay Vyahhi 1, Spiridon Bakiras 2, Panos Kalnis 3, and Gabriel Ghinita 3 1 St. Petersburg State University.

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept.

Trip Planning Queries F. Li, D. Cheng, M. Hadjieleftheriou, G. Kollios, S.-H. Teng Boston University.

A fuzzy video content representation for video summarization and content-based retrieval Anastasios D. Doulamis, Nikolaos D. Doulamis, Stefanos D. Kollias.

Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.

Video Trails: Representing and Visualizing Structure in Video Sequences Vikrant Kobla David Doermann Christos Faloutsos.

1 Dot Plots For Time Series Analysis Dragomir Yankov, Eamonn Keogh, Stefano Lonardi Dept. of Computer Science & Eng. University of California Riverside.

Time Series Data Analysis - II

Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.

Exact Indexing of Dynamic Time Warping

The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

Subsequence Matching in Time Series Databases Xiaojin Xu

A Query Adaptive Data Structure for Efficient Indexing of Time Series Databases Presented by Stavros Papadopoulos.

Beyond Sliding Windows: Object Localization by Efficient Subwindow Search The best paper prize at CVPR 2008.

Fast Subsequence Matching in Time-Series Databases Author: Christos Faloutsos etc. Speaker: Weijun He.

Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.

E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal.

Efficient EMD-based Similarity Search in Multimedia Databases via Flexible Dimensionality Reduction / 16 I9 CHAIR OF COMPUTER SCIENCE 9 DATA MANAGEMENT.

ICDE, San Jose, CA, 2002 Discovering Similar Multidimensional Trajectories Michail VlachosGeorge KolliosDimitrios Gunopulos UC RiversideBoston UniversityUC.

Exact indexing of Dynamic Time Warping

BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies A hierarchical clustering method. It introduces two concepts : Clustering feature Clustering.

An Approximate Nearest Neighbor Retrieval Scheme for Computationally Intensive Distance Measures Pratyush Bhatt MS by Research(CVIT)

Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept. of Electronic.

Tomáš Skopal 1, Benjamin Bustos 2 1 Charles University in Prague, Czech Republic 2 University of Chile, Santiago, Chile On Index-free Similarity Search.

Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.

Predicting the Location and Time of Mobile Phone Users by Using Sequential Pattern Mining Techniques Mert Özer, Ilkcan Keles, Ismail Hakki Toroslu, Pinar.

Time Series Sequence Matching Jiaqin Wang CMPS 565.

APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.

1 Complex Spatio-Temporal Pattern Queries Cahide Sen University of Minnesota.

Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.

Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.

Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.

Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.

1 Spatial Query Processing using the R-tree Donghui Zhang CCIS, Northeastern University Feb 8, 2005.

Keogh, E. , Chakrabarti, K. , Pazzani, M. & Mehrotra, S. (2001)

Fast Subsequence Matching in Time-Series Databases.

Supervised Time Series Pattern Discovery through Local Importance

A Time Series Representation Framework Based on Learned Patterns

K Nearest Neighbor Classification

Probabilistic Data Management

Scale-Space Representation for Matching of 3D Models

Group 9 – Data Mining: Data

BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies

Topological Signatures For Fast Mobility Analysis

Efficient Aggregation over Objects with Extent

Presentation transcript:

Similarity Searches in Sequence Databases Sang-Hyun Park KMeD Research Group Computer Science Department University of California, Los Angeles

Contents Introduction Whole Sequence Searches Subsequence Searches Segment-Based Subsequence Searches Multi-Dimensional Subsequence Searches Conclusion

What is Sequence? A sequence is an ordered list of elements. Sequences are principal data format in many applications. 8AM 10AM 12PM 2PM 4PM 6PM 8PM 10PM temperature (oC) time 5 10 15 20 25

What is Similarity Search? Similarity search finds sequences whose changing patterns are similar to that of a query sequence. Example Detect stocks with similar growth patterns Find persons with similar voice clips Find patients whose brain tumors have similar evolution patterns Similarity search helps in clustering, data mining, and rule discovery.

Classification of Similarity Search Similarity Searches are classified as: Whole sequence searches Subsequence searches Example S =  1,2,3  Subsequences (S) = { 1, 2, 3, 1,2, 2,3, 1,2,3 } In whole sequence searches, the sequence S itself is compared with a query sequence Q. In subsequence searches, every possible subsequence of S can be compared with a query sequence q.

Similarity Measure Lp Distance Metric L1 : Manhattan distance or city-block distance L2 : Euclidean distance L : maximum distance in any element pairs requires that two sequences should have the same length

Similarity Measure (2) Time Warping Distance Q = 10, 15, 20  Originally introduced in the area of speech recognition Allows sequences to be stretched along the time axis 3,5,6  3,3,5,6  3,3,3,5,6  3,3,3,5,5,6  … Each element of a sequence can be mapped to one or more neighboring elements of another sequence. Useful in applications where sequences may be of different lengths or different sampling rates Q = 10, 15, 20  S =  10, 15, 16, 20 

Similarity Measure (3) Time Warping Distance (2) Defined recursively Computed by dynamic programming technique, O(|S||Q|) DTW (S, Q[2:-]) DTW (S[2:-], Q) DTW (S[2:-], Q[2:-]) DTW (S, Q) = DBASE (S[1], Q[1]) + min DBASE (S[1], Q[1]) = | S[1] – Q[1] | P Q Q[1] Q[2:-] S S[1] S[2:-]

Similarity Measure (4) Time Warping Distance (3) 3 Q S 4 1 6 10 13 16 S = 4,5,6,7,6,6, Q = 3,4,3 When using L1 as a DBASE, DTW (S, Q) = 12 3 Q S 4 1 6 10 13 16 2 5 7 8 9 11 12 | S[i]Q[j] | + min (V1,V2,V3) S[i] V2 V3 V1 Q[j]

False Alarm and False Dismissal Candidates not similar to a query. Minimize false alarms for efficiency False Dismissal Similar sequences not retrieved by index search Avoid false dismissals for correctness data sequences candidates candidates false alarm similar seq. similar seq. false dismissal

Contents Introduction Whole Sequence Searches Subsequence Searches Segment-Based Subsequence Searches Multi-Dimensional Subsequence Searches Conclusion

Problem Definition Input Output Similarity Measure Set of data sequences {S} Query sequence Q Distance tolerance  Output Set of data sequences whose distances to Q are within  Similarity Measure Time warping distance function, DTW L as a distance function for each element pair If the distance of every element pair is within , then DTW(S,Q)  .

Previous Approaches Naïve Scan [Ber96] FastMap-Based Technique [Yi98] Read every data sequence from database Apply dynamic programming technique For m data sequences with average length L, O(mL|Q|) FastMap-Based Technique [Yi98] Use FastMap technique for feature extraction Map features into multi-dimensional points Use Euclidean distance in index space for filtering Could not guarantee “no false dismissal”

Previous Approaches (2) LB-Scan [Yi98] Read every data sequence from database Apply the lower-bound distance function Dlb which satisfies the following lower-bound theorem: Dlb (S,Q)    DTW (S,Q)   Faster than the original time warping distance function (O(|S|+|Q|) vs. O(|S||Q|)) Guarantee no false dismissal Based on sequential scanning

Proposed Approach Goal Sketch No false dismissal High query processing performance Sketch Extract a time-warping invariant feature vector Build a multi-dimensional index Use a lower-bound distance function for filtering

Proposed Approach (2) Feature Extraction F(S) =  First(S), Last(S), Max(S), Min(S)  F(S) is invariant to time warping transformation. Distance Function for Feature Vectors | First(S)  First(Q) | | Last(S)  Last(Q) | | Max(S)  Max(Q) | | Min(S)  Min(Q) | DFT (F(S), F(Q)) = max

Proposed Approach (3) Distance Function for Feature Vectors (2) Satisfies lower-bounding theorem: DFT (F(S),F(Q))    DTW (S,Q)   More accurate than Dlb proposed in LB-Scan Faster than Dlb (O(1) vs. O(|S|+|Q|))

Proposed Approach (4) Indexing Query Processing Build a multi-dimensional index from a set of feature vectors Index entry  First(S), Last(S), Max(S), Min(S), Identifier(S)  Query Processing Extract a feature vector F(Q) Perform range queries in index space to find data points included in the following query rectangle:  [ First(Q)  , First(Q) +  ],[ Last(Q)  , Last(Q) +  ], [ Max(Q)  , Max(Q) +  ], [ Min(Q)  , Min(Q) +  ]  Perform post-processing to discard false alarms

Performance Evaluation Implementation Implemented with C++ on UNIX operating system R-tree is used as a multi-dimensional index. Experimental Setup S&P 500 stock data set (m=545, L=232) Random walk synthetic data set SunSparc Ultra-5

Performance Evaluation (2) Filtering Ratio Better-than LB-Scan

Performance Evaluation (3) Query Processing Time Faster than LB-Scan and Naïve-Scan

Contents Introduction Whole Sequence Searches Subsequence Searches Segment-Based Subsequence Searches Multi-Dimensional Subsequence Searches Conclusion

Problem Definition Input Output Similarity Measure Set of data sequences {S} Query sequence q Distance tolerance  Output Set of subsequences whose distances to q are within  Similarity Measure Time warping distance function, DTW Any LP metric as a distance function for element pairs

Previous Approaches Naïve-Scan [Ber96] Read every data subsequence from database Apply dynamic programming technique For m data sequences with average length n, O(mL2|q|)

Previous Approaches (2) ST-Index [Fal94] Assume that the minimum query length (w) is known in advance. Locates a sliding window of size w at every possible location Extract a feature vector inside the window Map a feature vector into a point and group trails into MBR (Minimum Bounding Rectangle) Use Euclidean distance in index space for filtering Could not guarantee “no false dismissal”

Proposed Approach Goal Sketch No false dismissal High performance Support diverse similarity measure Sketch Convert into sequences of discrete symbols Build a sparse suffix tree Use a lower-bound distance function for filtering Apply branch-pruning to reduce the search space

Proposed Approach (2) Conversion Generate categories from the distribution of element values Maximum-entropy method Equal-interval method DISC method Convert element to the symbol of the corresponding category Example A = [0, 1.0], B = [1.1, 2.0], C = [2.1, 3.0], D = [3.1, 4.0] S = 1.3, 1.6, 2.9, 3.3, 1.5, 0.1 SC = B, B, C, D, B, A

Proposed Approach (3) Indexing Extract suffixes from sequences of discrete symbols. Example From S1C= A, B, B, A, we extract four suffixes: ABBA, BBA, BA, A

Proposed Approach (4) Indexing (2) Build a suffix tree. Suffix tree is originally proposed to retrieve substrings exactly matched to the query string. Suffix tree consists of nodes and edges. Each suffix is represented by the path from the root node to a leaf node. Labels on the path from the root to the internal node Ni represents the longest common prefix of the suffixes under Ni Suffix tree is built with computation and space complexity, O(mL).

Proposed Approach (4) Indexing (3) Example : suffix tree from S1C= A, B, B, A and S2C= A, B A B B B $ A A B $ $ $ A $ $ S1C[1:-] S2C[1:-] S1C[4:-] S1C[2:-] S1C[3:-] S2C[2:-]

Proposed Approach (5) Query Processing query (q, ) Index Searching candidates answers Post Processing suffix tree data sequences

Proposed Approach (6) Index Searching Visit each node of suffix tree by depth-first traversal. Build lower-bound distance table for q and edge labels. Inspect the last columns of newly added rows to find candidates. Apply branch-pruning to reduce the search space. Branch-pruning theorem: If all columns of the last row of the distance table have values larger than a distance tolerance , adding more rows on this table does not yield the new values less than or equal to .

Proposed Approach (7) Index Searching (2) Example : q = 2, 2, 1,  = 1.5 N1 A 1 2 2 A q 2 2 1 ….. N2 B D B 1 1 1.1 D 2.1 2.1 4.1 A 1 2 2 N3 N4 A 1 2 2 q 2 2 1 — — q 2 2 1 ….. …..

Proposed Approach (8) Lower-Bound Distance Function DTW-LB 0 if v is within the range of A (A.min  v) P if v is smaller than A.min (v  A.max) P if v is larger than A.max DBASE-LB (A, v) = v A.max A.max A.max v A.min A.min A.min v possible minimum distance = 0 possible minimum distance = (A.min – v)P possible minimum distance = (v – A.max)P

DTW-LB (sC, q) = DBASE-LB(sC[1], q[1]) + Proposed Approach (9) Lower-Bound Distance Function DTW-LB (2) satisfies the lower-bounding theorem DTW-LB(sC, q)    DTW (s,q)   computation complexity O(|sC||q|) DTW-LB (sC, q) = DBASE-LB(sC[1], q[1]) + min DTW-LB (sC, q[2:-]) DTW-LB (sC[2:-], q) DTW-LB (sC[2:-], q[2:-])

Proposed Approach (10) Computation Complexity m is the number of data sequences. L is the average length of data sequences. The left expression is for index searching. The right expression is for post-processing. RP ( 1) is the reduction factor by branch-pruning. RD ( 1) is the reduction factor by sharing distance tables. n is the number of subsequences requiring post-processing.

Proposed Approach (11) Sparse Indexing The index size is linear to the number of suffixes stored. To reduce the index size, we build a sparse suffix tree (SST). That is, we store the suffix SC[i:-] only if SC[i]  SC[i–1]. Compaction Ratio Example SC = A, A, A, A, C, B, B store only three suffixes (SC[1:-], SC[5:-], and SC[6:-]) compaction ratio C = 7/3

Proposed Approach (12) Sparse Indexing (2) When traversing the suffix tree, we need to find non-stored suffixes and compute their distances to q. Assume that k elements of sC have the same value. Then, sC[1:-] is stored but sC[i:-] (i=2,3,…,k) is not stored. For non-stored suffixes, we introduce another lower-bound distance function. DTW-LB2 (sC[i:-], q) = DTW-LB(sC, q) – (i – 1)  DBASE-LB (sC[1], q[1]) DTW-LB2 satisfies the lower-bounding theorem. DTW-LB2 is O(1) when DTW-LB(sC, q) is given.

Proposed Approach (13) Sparse Indexing (3) With sparse indexing, the complexity becomes: m is the number of data sequences. L is the average length of data sequences. C is the compaction ratio. n is the number of subsequences requiring post-processing. RP ( 1) is the reduction factor by branch-pruning. RD ( 1) is the reduction factor by sharing distance tables.

Performance Evaluation Implementation Implemented with C++ on UNIX operating system Experimental Setup S&P 500 stock data set (m=545, L=232) Random walk synthetic data set Maximum-Entropy (ME) categorization Disk-based suffix tree construction algorithm SunSparc Ultra-5

Performance Evaluation (2) Comparison with Naïve-Scan increasing distance-tolerances S&P 500 stock data set, |q|=20

Performance Evaluation (3) Scalability Test increasing average length of data sequences random-walk data set, |q|=20,m=200

Performance Evaluation (4) Scalability Test (2) increasing total number of data sequences random-walk data set, |q|=20, L=200

Contents Introduction Whole Sequence Searches Subsequence Searches Segment-Based Subsequence Searches Multi-Dimensional Subsequence Searches Conclusion

Introduction We extend the proposed subsequence searching method to large sequence databases. In the retrieval of similar subsequences with time warping distance function, Sequential Scanning is O(mL2|q|). The proposed method is O(mL2|q| / R) (R  1). It makes search algorithms suffer from severe performance degradation when L is very large. For a database with long sequences, we need a new searching scheme linear to L.

SBASS We propose a new searching scheme: Segment-Based Subsequence Searching scheme (SBASS) Sequences are divided into a series of piece-wise segments. When a query sequence q with k segments is submitted, q is compared with those subsequences which consist of k consecutive data segments. The lengths of segments may be different. SS represents the segmented sequence of S. S = 4,5,8,9,11,8,4,3 |S| = 8 SS = 4,5,8,9,11, 8,4,3 |SS| = 2

SBASS (2) S SS qS Only four subsequences of SS are compared with QS. SS[1],SS[2], SS[2],SS[3], SS[3],SS[4], SS[4],SS[5] S SS[3] SS[2] SS[1] SS[4] SS[5] SS qS qS[1] qS[2]

SBASS (3) For SBASS scheme, we define the piece-wise time warping distance function (where k = |qS| = |sS|). Sequential scanning for SBASS scheme is O(mL|q|). We introduce an indexing technique with O(mL|q|/R) (R  1).

Sketch of Proposed Approach Indexing Convert sequences to segmented sequences. Extract a feature vector from each segment. Categorize feature vectors. Convert segmented sequences to sequences of symbols. Construct suffix tree from sequences of symbols. Query Processing Traverse the suffix tree to find candidates. Discard false alarms in post processing.

Segmentation Approach Compaction Ratio (C) = |S| / |SS| Divide at peak points. Divide further if maximum deviation from interpolation line is too large. Eliminate noises. Compaction Ratio (C) = |S| / |SS| too large deviation noises

Feature Extraction From each subsequence segment, extract a feature vector: (V1, VL,L, +, –) VL + – V1 L

Categorization and Index Construction Group similar feature vectors together using multi-dimensional categorization methods like Multi-attribute Type Abstraction Hierarchy (MTAH). Assign unique symbol to each category Convert segmented sequences to sequences of symbols. S = 4,5,8,8,8,8,9,11,8,4,3 SS = 4,5,8,8,8,8,9,11, 8,4,3 SF = (4,11,8,2,1), (8,3,3,0,1.5) SC = A,B From sequences of symbols, construct the suffix tree.

Query Processing For query processing, we calculate lower-bond distances between symbols and keep them in table. Given the query sequence q and the distance tolerance , Convert q to qS and then to qC. Search the suffix tree to find those subsequences whose lower-bound distances to qC are within . Discard false alarms in post processing.

Query Processing (2) Index Searching candidates answers q,  qS qC Post Processing suffix tree data sequences

Computation Complexity Sequential scanning is O(mL|q|). Complexity of the proposed search algorithm is : n is the number of subsequences contained in candidates. C is the compaction ratio or the average number of elements in segments. RD ( 1) is the reduction factor by sharing edges of suffix tree.

Performance Evaluation Test Set : Pseudo Periodic Synthetic Sequences m = 100, L = 10,000 Achieved up to 6.5 times speed-up compared to sequential scanning. 60 50 40 SeqScan 30 time (sec) 20 Our Approach 10 0.2 0.4 0.6 0.8 1.0 distance tolerance

Contents Introduction Whole Sequence Searches Subsequence Searches Segment-Based Subsequence Searches Multi-Dimensional Subsequence Searches Conclusion

Introduction So far, we assumed that elements have single-dimensional numeric values. Now, we consider multi-dimensional sequences. Image Sequences Video Streams Medical Image Sequence

Introduction (2) In multi-dimensional sequences, elements are represented by feature vectors. S = S[1], …, S[N], S[i] = (S[i][1], …, S[i][F]) Our proposed subsequence searching techniques are extended to the retrieval of similar multi-dimensional subsequences.

Introduction (3) Multi-Dimensional Time Warping Distance DMTW (S, Q[2:-]) DMTW (S, Q) = DMBASE (S[1], Q[1]) + min DMTW (S[2:-], Q) DMTW (S[2:-],Q[2:-]) DMBASE (S[1], Q[1]) = ( Wi  | S[1][i]  Q[1][i] | ) F is the number of features in each element. Wi is the weight of i-th dimension.

Sketch of Our Approach Indexing Query Processing Categorize multi-dimensional element values using MTAH. Assign unique symbols to categories. Convert multi-dimensional sequences into sequences of symbols. Construct suffix tree from a set of sequences of symbols. Query Processing Traverse suffix tree. Find candidates whose lower-bound distances to q are within . Do post processing to discard false alarms.

Application to KMeD In the environment of KMeD, the proposed technique is applied to the retrieval of medical image sequences having similar spatio-temporal characteristics to those of the query sequence. KMeD [CCT:95] has the following features: Query by both image and alphanumeric contents Model temporal, spatial and evolutionary nature of objects Formulate queries using conceptual and imprecise terms Support cooperative processing

Application to KMeD (2) Query Medical Image Sequence Attribute names and their relative weights Distance tolerance DistFromLV (0.6) Circularity (0.1) Size (0.3)

Application to KMeD (3) Query User Model Query Analysis Contour Extraction Feature Extraction Distance Function matching seq. Similarity Searches Visual Presentation feedback medical image seq. index structure

Contents Introduction Whole Sequence Searches Subsequence Searches Segment-Based Subsequence Searches Multi-Dimensional Subsequence Searches Conclusion

Summary Sequence is an ordered list of elements. Similarity search helps in clustering and data mining. For sequences of different lengths or different sampling rates, time warping distance is useful. We proposed the whole sequence searching method with spatial access method and lower-bound distance function. We proposed the subsequence searching method with suffix tree and lower-bound distance functions. We proposed the segment-based subsequence searching method for large sequence databases. We extended the subsequence searching method to the retrieval of similar multi-dimensional subsequences.

Contribution We proposed the tighter and faster lower-bound distance function for efficient whole sequence searches without false dismissal. We demonstrated the feasibility of using time warping similarity measure on a suffix tree. We introduced the branch pruning theorem and the fast lower-bound distance function for efficient subsequence searches without false dismissal. We applied categorization and sparse indexing for scalability. We applied the proposed technique to the real application (KMeD).