Similarity Search: A Matching Based Approach

Slides:



Advertisements
Similar presentations
Similarity Search: A Matching Based Approach Rui Zhang The University of Melbourne July 2006.
Advertisements

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
        iDistance -- Indexing the Distance An Efficient Approach to KNN Indexing C. Yu, B. C. Ooi, K.-L. Tan, H.V. Jagadish. Indexing the distance:
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Similarity Search on Bregman Divergence, Towards Non- Metric Indexing Zhenjie Zhang, Beng Chi Ooi, Srinivasan Parthasarathy, Anthony K. H. Tung.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.
Multidimensional Data
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
BTrees & Bitmap Indexes
Efficient Similarity Search in Sequence Databases Rakesh Agrawal, Christos Faloutsos and Arun Swami Leila Kaghazian.
High-Dimensional Similarity Search using Data-Sensitive Space Partitioning ┼ Sachin Kulkarni 1 and Ratko Orlandic 2 1 Illinois Institute of Technology,
Visual Querying By Color Perceptive Regions Alberto del Bimbo, M. Mugnaini, P. Pala, and F. Turco University of Florence, Italy Pattern Recognition, 1998.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept.
Indexing Spatio-Temporal Data Warehouses Dimitris Papadias, Yufei Tao, Panos Kalnis, Jun Zhang Department of Computer Science Hong Kong University of Science.
Multidimensional Data Many applications of databases are ``geographic'' = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Dept. of Electrical Engineering and Computer Science, Northwestern University Context-Aware Optimization of Continuous Query Maintenance for Trajectories.
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
A Statistical Approach to Speed Up Ranking/Re-Ranking Hong-Ming Chen Advisor: Professor Shih-Fu Chang.
PMLAB Finding Similar Image Quickly Using Object Shapes Heng Tao Shen Dept. of Computer Science National University of Singapore Presented by Chin-Yi Tsai.
Physical Database Design I, Ch. Eick 1 Physical Database Design I About 25% of Chapter 20 Simple queries:= no joins, no complex aggregate functions Focus.
A fast algorithm for the generalized k- keyword proximity problem given keyword offsets Sung-Ryul Kim, Inbok Lee, Kunsoo Park Information Processing Letters,
Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
CS848 Similarity Search in Multimedia Databases Dr. Gisli Hjaltason Content-based Retrieval Using Local Descriptors: Problems and Issues from Databases.
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept. of Electronic.
Physical Database Design I, Ch. Eick 1 Physical Database Design I Chapter 16 Simple queries:= no joins, no complex aggregate functions Focus of this Lecture:
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Approximate NN queries on Streams with Guaranteed Error/performance Bounds Nick AT&T labs-research Beng Chin Ooi, Kian-Lee Tan, Rui National.
Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.
1 Complex Spatio-Temporal Pattern Queries Cahide Sen University of Minnesota.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.
CS Machine Learning Instance Based Learning (Adapted from various sources)
Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.
Content-Based Image Retrieval Using Color Space Transformation and Wavelet Transform Presented by Tienwei Tsai Department of Information Management Chihlee.
Indexing Multidimensional Data
Spatial Data Management
SIMILARITY SEARCH The Metric Space Approach
Indexes By Adrienne Watt.
Indexing Structures for Files and Physical Database Design
Indexing Goals: Store large files Support multiple search keys
Indexing & querying text
Parallel Databases.
Multidimensional Access Structures
Efficient Image Classification on Vertically Decomposed Data
Spatio-temporal Pattern Queries
Efficient Image Classification on Vertically Decomposed Data
Chapter 11: Indexing and Hashing
Indexing and Hashing Basic Concepts Ordered Indices
Multidimensional Indexes
Storage Structure and Efficient File Access
Distributed Probabilistic Range-Aggregate Query on Uncertain Data
2018, Spring Pusan National University Ki-Joune Li
Nearest Neighbors CSC 576: Data Mining.
Color Image Retrieval based on Primitives of Color Moments
Chapter 11: Indexing and Hashing
The Skyline Query in Databases Which Objects are the Most Important?
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Donghui Zhang, Tian Xia Northeastern University
Presentation transcript:

Similarity Search: A Matching Based Approach Anthony K. H. Tung National University of Singapore Rui Zhang University of Melbourne Nick Koudas University of Toronto Beng Chin Ooi National University of Singapore

Outline Traditional approach to similarity search Deficiencies of the traditional approach Our proposal: the n-match query Algorithms to process the n-match query Experimental results Conclusions and future work

Similarity Search : Traditional Approach Objects represented by multidimensional vectors The traditional approach to similarity search: kNN query Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) Elevation Aspect Slope Hillshade (9am) Hillshade (noon) Hillshade (3pm) … 2596 51 3 221 232 148 ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 Dist P1 1.1 1 1.2 1.6 P2 1.4 1.5 P3 2 P4 20 21 22 19 P5 18 P6 0.93 0.98 1.73 57.7 60.5 59.8

Deficiencies of the Traditional Approach Distance is affected by a few dimensions with high dissimilarity Partial similarities can not be discovered The traditional approach to similarity search: kNN query Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 Dist P1 1.1 1 1.2 1.6 P2 1.4 1.5 P3 2 P4 20 21 22 19 P5 18 P6 100 99.0 0.93 100 99.0 0.98 100 99.0 1.73 57.7 60.5 59.8

Thoughts Aggregating too many dimensional differences into a single value result in too much information loss. Can we try to reduce that loss? While high dimensional data typically give us problem when in come to similarity search, can we turn what is against us into advantage? Our approach: Since we have so many dimensions, we can compute more complex statistics over these dimensions to overcome some of the “noise” introduce due to scaling of dimensions, outliers etc.

The N-Match Query : Warm-Up Description Matches between two objects in n dimensions. (n ≤ d) The n dimensions are chosen dynamically to make the two objects match best. How to define a “match” Exact match Match with tolerance δ The similarity search example Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) n = 6 ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 Dist P1 1.1 1 1.2 1.6 P2 1.4 1.5 P3 2 P4 20 21 22 19 P5 18 P6 100 0.2 100 0.4 0.98 100 1.73 19 19 19

The N-Match Query : The Definition The n-match difference Given two d-dimensional points P(p1, p2, …, pd) and Q(q1, q2, …, qd), let δi = |pi - qi|, i=1,…,d. Sort the array {δ1 , …, δd} in increasing order and let the sorted array be {δ1’, …, δd’}. Then δn’ is the n-match difference between P and Q. The n-match query Given a d-dimensional database DB, a query point Q and an integer n (n≤d), find the point P  DB that has the smallest n-match difference to Q. P is called the n-match of Q. The similarity search example Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) 1-match=A 2-match=B n = 6 n = 8 n = 7 ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 Dist P1 1.1 1 1.2 1.6 P2 1.4 1.5 P3 2 P4 20 21 22 19 P5 18 P6 100 0.6 0.2 0.2 100 0.4 0.4 0.4 0.98 100 1 1 1.73 19 19 19 19 19 19 19 19 19

The N-Match Query : Extensions The k-n-match query Given a d-dimensional database DB, a query point Q, an integer k, and an integer n, find a set S which consists of k points from DB so that for any point P1  S and any point P2 DB-S, P1’s n-match difference is smaller than P2’s n-match difference. S is called the k-n-match of Q. The frequent k-n-match query Given a d-dimensional database DB, a query point Q, an integer k, and an integer range [n0, n1] within [1,d], let S0, …, Si be the answer sets of k-n0-match, …, k-n1-match, respectively, find a set T of k points, so that for any point P1  T and any point P2  DB-T, P1’s number of appearances in S0, …, Si is larger than or equal to P2’s number of appearances in S0, …, Si . The similarity search example Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) 2-1-match={A,D} 2-2-match={A,B} n = 6 ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 Dist P1 1.1 1 1.2 1.6 P2 1.4 1.5 P3 2 P4 20 21 22 19 P5 18 P6 100 0.2 100 0.4 0.98 100 1.73 19 19 19

Cost Model The multiple system information retrieval model The cost Objects are stored in different systems and scored by each system Each system can sort the objects according to their scores A query retrieves the scores of objects from different systems and then combine them using some aggregation function The cost Retrieval of scores – proportional to the number of scores retrieved The goal To minimize the scores retrieved Q : color=“red” & shape=“round” & texture “cloud” System 1: Color Object ID Score 1 2 3 4 5 System 1: Color Object ID Score 1 0.4 2 2.8 5 3.5 3 6.5 4 9.0 System 2: Shape Object ID Score 1 1.0 5 1.5 2 5.5 3 7.8 4 9.0 System 2: Shape Object ID Score 1 2 3 4 5 System 3: Texture Object ID Score 1 1.0 2 2.0 3 5.0 5 8.0 4 9.0 System 3: Texture Object ID Score 1 2 3 4 5 0.4 1.0 1.0 2.8 5.5 2.0 6.5 7.8 5.0 9.0 9.0 9.0 3.5 1.5 8.0

The AD Algorithm The AD algorithm for the k-n-match query Locate the query’s attributes in every dimension Retrieve the objects’ attributes from the query’s attributes in both directions The objects’ attributes are retrieved in Ascending order of their Differences to the query’s attributes. An n-match is found when it appears n times. Q : color=“red” & shape=“round” & texture “cloud” Q : ( 3.0 , 7.0 , 4.0 ) 2-2-match of Q : ( 3.0 , 7.0 , 4.0 ) System 1: Color Object ID Score 1 0.4 2 2.8 5 3.5 3 6.5 4 9.0 d1 System 2: Shape Object ID Score 1 1.0 5 1.5 2 5.5 3 7.8 4 9.0 d2 System 3: Texture Object ID Score 1 1.0 2 2.0 3 5.0 5 8.0 4 9.0 d3 Attr Attr Attr 3.0 4.0 7.0 Auxiliary structures Next attribute to retrieve g[2d] Number of appearances appear[c] Answer set S d1 d2 d3 2 , 0.2 5 , 0.5 2 , 1.5 3 , 0.8 2 , 2.0 3 , 1.0 d1 d2 d3 1 , 2.6 3 , 3.5 4 , 2.0 5 , 4.0 1 2 3 4 5 1 2 2 1 1 { 3 , 2 } { 3 } { }

The AD Algorithm : Extensions The AD algorithm for the frequent k-n-match query The frequent k-n-match query Given an integer range [n0, n1], find k-n0-match, k-(n0+1)-match, ... , k-n1-match of the query, S0, S1, ... , Si. Find k objects that appear most frequently in S0, S1, ... , Si. Retrieve the same number of attributes as processing a k-n1-match query. Disk based solutions for the (frequent) k-n-match query Disk based AD algorithm Sort each dimension and store them sequentially on the disk When reaching the end of a disk page, read the next page from disk Existing indexing techniques Tree-like structures: R-trees, k-d-trees Mapping based indexing: space-filling curves, iDistance Sequential scan Compression based approach (VA-file)

Experiments : Effectiveness Searching by k-n-match COIL-100 database 54 features extracted, such as color histograms, area moments k-n-match query, k=4 n Images returned 5 36, 42, 78, 94 10 27, 35, 42, 78 15 3, 38, 42, 78 20 27, 38, 42, 78 25 35, 40, 42, 94 30 10, 35, 42, 94 35 35, 42, 94, 96 40 45 50 kNN query k Images returned 10 13, 35, 36, 40, 42 64, 85, 88, 94, 96 Searching by frequent k-n-match UCI Machine learning repository Competitors: IGrid Human-Computer Interactive NN search (HCINN) Data sets (d) IGrid HCINN Freq. k-n-match Ionosphere (34) 80.1% 86% 87.5% Segmentation (19) 79.9% 83% 87.3% Wdbc (30) 87.1% N.A. 92.5% Glass (9) 58.6% 67.8% Iris (4) 88.9% 89.6%

Experiments : Efficiency Disk based algorithms for the Frequent k-n-mach query Texture dataset (68,040 records); uniform dataset (100,000 records) Competitors: The AD algorithm VA-file Sequential scan

Experiments : Efficiency (continued) Comparison with other similarity search techniques Texture dataset ; synthetic dataset Competitors: Frequent k-n-match query using the AD algorithm IGrid scan

Conclusion We proposed a new approach to do similarity search, that is, the k-n-match query. It has the advantage of being tolerant to noise and able to discover partial similarity. If we don’t choose a good n value, the results of the k-n-match query may not be good enough to find full similarity, so we further propose the frequent k-n-match query to address this problem. We proposed the AD algorithm, which is optimal for both the k-n-match query and the frequent k-n-match query under the multiple system information retrieval model. We also apply it in a disk based model. Based on an extensive experimental study, we see that the frequent k-n-match query is more effective in similarity search than existing techniques such as IGrid and Human-Computer Interactive NN search. We also see that the frequent k-n-match query can be processed more efficiently than other techniques by our proposed AD algorithm in a disk based model.

Questions?