Similarity Search: A Matching Based Approach Rui Zhang The University of Melbourne July 2006.

Slides:



Advertisements
Similar presentations
1 Radio Maria World. 2 Postazioni Transmitter locations.
Advertisements

Números.
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory
SKELETAL QUIZ 3.
PDAs Accept Context-Free Languages
ALAK ROY. Assistant Professor Dept. of CSE NIT Agartala
/ /17 32/ / /
Reflection nurulquran.com.
EuroCondens SGB E.
Worksheets.
Addition and Subtraction Equations
Disability status in Ethiopia in 1984, 1994 & 2007 population and housing sensus Ehete Bekele Seyoum ESA/STAT/AC.219/25.
By John E. Hopcroft, Rajeev Motwani and Jeffrey D. Ullman
1 When you see… Find the zeros You think…. 2 To find the zeros...
Western Public Lands Grazing: The Real Costs Explore, enjoy and protect the planet Forest Guardians Jonathan Proctor.
EQUS Conference - Brussels, June 16, 2011 Ambros Uchtenhagen, Michael Schaub Minimum Quality Standards in the field of Drug Demand Reduction Parallel Session.
Add Governors Discretionary (1G) Grants Chapter 6.
CALENDAR.
CHAPTER 18 The Ankle and Lower Leg
Summative Math Test Algebra (28%) Geometry (29%)
Introduction to Turing Machines
ASCII stands for American Standard Code for Information Interchange
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Object Recognition Using Locality-Sensitive Hashing of Shape Contexts Andrea Frome, Jitendra Malik Presented by Ilias Apostolopoulos.
The 5S numbers game..
突破信息检索壁垒 -SciFinder Scholar 介绍
A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
The basics for simulations
© 2010 Concept Systems, Inc.1 Concept Mapping Methodology: An Example.
1 IMDS Tutorial Integrated Microarray Database System.
MM4A6c: Apply the law of sines and the law of cosines.
Figure 3–1 Standard logic symbols for the inverter (ANSI/IEEE Std
Reducing Power Consumption in Body- centric Zigbee Communication Links by means of Wearable Textile Antennas P. Vanveerdeghem, B. Jooris, P. Becue, P.
TCCI Barometer March “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”
1 Prediction of electrical energy by photovoltaic devices in urban situations By. R.C. Ott July 2011.
Dynamic Access Control the file server, reimagined Presented by Mark on twitter 1 contents copyright 2013 Mark Minasi.
TCCI Barometer March “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”
Progressive Aerobic Cardiovascular Endurance Run
CSE 6007 Mobile Ad Hoc Wireless Networks
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
Artificial Intelligence
When you see… Find the zeros You think….
2011 WINNISQUAM COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=1021.
Before Between After.
2011 FRANKLIN COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=332.
ST/PRM3-EU | | © Robert Bosch GmbH reserves all rights even in the event of industrial property rights. We reserve all rights of disposal such as copying.
2.10% more children born Die 0.2 years sooner Spend 95.53% less money on health care No class divide 60.84% less electricity 84.40% less oil.
Foundation Stage Results CLL (6 or above) 79% 73.5%79.4%86.5% M (6 or above) 91%99%97%99% PSE (6 or above) 96%84%100%91.2%97.3% CLL.
Numeracy Resources for KS2
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Static Equilibrium; Elasticity and Fracture
ANALYTICAL GEOMETRY ONE MARK QUESTIONS PREPARED BY:
CSE3201/4500 Information Retrieval Systems
Resistência dos Materiais, 5ª ed.
Lial/Hungerford/Holcomb/Mullins: Mathematics with Applications 11e Finite Mathematics with Applications 11e Copyright ©2015 Pearson Education, Inc. All.
Biostatistics course Part 14 Analysis of binary paired data
CDI and SIM Section EDB Nov & Dec Programme Description Objectives Helpdesk Gentle Reminder Students’ Access to SLP Module SLP Module – JUPAS.
UNDERSTANDING THE ISSUES. 22 HILLSBOROUGH IS A REALLY BIG COUNTY.
Patient Survey Results 2013 Nicki Mott. Patient Survey 2013 Patient Survey conducted by IPOS Mori by posting questionnaires to random patients in the.
A Data Warehouse Mining Tool Stephen Turner Chris Frala
Chart Deception Main Source: How to Lie with Charts, by Gerald E. Jones Dr. Michael R. Hyman, NMSU.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Introduction Embedded Universal Tools and Online Features 2.
What impact does the address have on the tribe?
úkol = A 77 B 72 C 67 D = A 77 B 72 C 67 D 79.
Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.
Similarity Search: A Matching Based Approach
Presentation transcript:

Similarity Search: A Matching Based Approach Rui Zhang The University of Melbourne July 2006

Outline Traditional approach to similarity search Deficiencies of the traditional approach Our proposal: the n-match query Algorithms to process the n-match query Experimental results Conclusions and future work

Similarity Search : Traditional Approach Objects represented by multidimensional vectors The traditional approach to similarity search: kNN query Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) ElevationAspectSlopeHillshade (9am)Hillshade (noon)Hillshade (3pm) … … IDd1d2d3d4d5d6d7d8d9d10Dist P P P P P P

Deficiencies of the Traditional Approach Deficiencies Distance is affected by a few dimensions with high dissimilarity Partial similarities can not be discovered The traditional approach to similarity search: kNN query Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) IDd1d2d3d4d5d6d7d8d9d10Dist P P P P P P

The N-Match Query : Warm-Up Description Matches between two objects in n dimensions. (n ≤ d) The n dimensions are chosen dynamically to make the two objects match best. How to define a “match” Exact match Match with tolerance δ The similarity search example Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) IDd1d2d3d4d5d6d7d8d9d10Dist P P P P P P n = 6

The N-Match Query : The Definition The n-match difference Given two d-dimensional points P(p 1, p 2, …, p d ) and Q(q 1, q 2, …, q d ), let δ i = |p i - q i |, i=1,…,d. Sort the array {δ 1, …, δ d } in increasing order and let the sorted array be {δ 1 ’, …, δ d ’}. Then δ n ’ is the n-match difference between P and Q. The n-match query Given a d-dimensional database DB, a query point Q and an integer n (n≤d), find the point P  DB that has the smallest n-match difference to Q. P is called the n-match of Q. The similarity search example Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) IDd1d2d3d4d5d6d7d8d9d10Dist P P P P P P n = 6n = 7n = match=A 2-match=B

The N-Match Query : Extensions The k-n-match query Given a d-dimensional database DB, a query point Q, an integer k, and an integer n, find a set S which consists of k points from DB so that for any point P1  S and any point P2  DB-S, P1’s n-match difference is smaller than P2’s n-match difference. S is called the k-n-match of Q. The frequent k-n-match query Given a d-dimensional database DB, a query point Q, an integer k, and an integer range [n 0, n 1 ] within [1,d], let S 0, …, S i be the answer sets of k-n 0 -match, …, k-n 1 -match, respectively, find a set T of k points, so that for any point P1  T and any point P2  DB-T, P1’s number of appearances in S 0, …, S i is larger than or equal to P2’s number of appearances in S 0, …, S i. The similarity search example Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) IDd1d2d3d4d5d6d7d8d9d10Dist P P P P P P n = match={A,D} 2-2-match={A,B}

Cost Model The multiple system information retrieval model Objects are stored in different systems and scored by each system Each system can sort the objects according to their scores A query retrieves the scores of objects from different systems and then combine them using some aggregation function The cost Retrieval of scores – proportional to the number of scores retrieved The goal To minimize the scores retrieved System 1: Color Object IDScore System 1: Color Object IDScore System 3: Texture Object IDScore System 2: Shape Object IDScore System 2: Shape Object IDScore System 3: Texture Object IDScore Q : color=“red” & shape=“round” & texture “cloud”

The AD Algorithm The AD algorithm for the k-n-match query Locate the query’s attributes in every dimension Retrieve the objects’ attributes from the query’s attributes in both directions The objects’ attributes are retrieved in Ascending order of their Differences to the query’s attributes. An n-match is found when it appears n times. System 1: Color Object IDScore System 2: Shape Object IDScore System 3: Texture Object IDScore Q : color=“red” & shape=“round” & texture “cloud”Q : ( 3.0, 7.0, 4.0 ) d1 d2 d3 2-2-match of Q : ( 3.0, 7.0, 4.0 ) Auxiliary structures Next attribute to retrieve g[2d] Number of appearances appear[c] Answer set S d1d2d3 2, 0.25, 0.52, 1.53, 0.82, 2.03, { } 1 1, , , , { 3 } d1d2d3 { 3, 2 } Attr

The AD Algorithm : Extensions The AD algorithm for the frequent k-n-match query The frequent k-n-match query Given an integer range [n 0, n 1 ], find k-n 0 -match, k-(n 0 +1)-match,..., k-n 1 -match of the query, S 0, S 1,..., S i. Find k objects that appear most frequently in S 0, S 1,..., S i. Retrieve the same number of attributes as processing a k-n 1 -match query. Disk based solutions for the (frequent) k-n-match query Disk based AD algorithm Sort each dimension and store them sequentially on the disk When reaching the end of a disk page, read the next page from disk Existing indexing techniques Tree-like structures: R-trees, k-d-trees Mapping based indexing: space-filling curves, iDistance Sequential scan Compression based approach (VA-file)

Experiments : Effectiveness Searching by k-n-match COIL-100 database 54 features extracted, such as color histograms, area moments Searching by frequent k-n-match UCI Machine learning repository Competitors: IGrid Human-Computer Interactive NN search (HCINN) k-n-match query, k=4 nImages returned 536, 42, 78, , 35, 42, , 38, 42, , 38, 42, , 40, 42, , 35, 42, , 42, 94, , 42, 94, , 42, 94, , 42, 94, 96 kNN query kImages returned 1013, 35, 36, 40, 42 64, 85, 88, 94, 96 Data sets (d)IGridHCINNFreq. k-n-match Ionosphere (34)80.1%86%87.5% Segmentation (19)79.9%83%87.3% Wdbc (30)87.1%N.A.92.5% Glass (9)58.6%N.A.67.8% Iris (4)88.9%N.A.89.6%

Experiments : Efficiency Disk based algorithms for the Frequent k-n-mach query Texture dataset (68,040 records); uniform dataset (100,000 records) Competitors: The AD algorithm VA-file Sequential scan

Experiments : Efficiency (continued) Comparison with other similarity search techniques Texture dataset ; synthetic dataset Competitors: Frequent k-n-match query using the AD algorithm IGrid Human-Computer Interactive NN search (HCINN)

Conclusions and Future Work Conclusions We proposed a new approach to do similarity search, that is, the k- n-match query. It has the advantage of being tolerant to noise and able to discover partial similarity. If we don’t choose a good n value, the results of the k-n-match query may not be good enough to find full similarity, so we further propose the frequent k-n-match query to address this problem. We proposed the AD algorithm, which is optimal for both the k-n- match query and the frequent k-n-match query under the multiple system information retrieval model. We also apply it in a disk based model. Based on an extensive experimental study, we see that the frequent k-n-match query is more effective in similarity search than existing techniques such as IGrid and Human-Computer Interactive NN search. We also see that the frequent k-n-match query can be processed more efficiently than other techniques by our proposed AD algorithm in a disk based model. Future work We may perform more experiments to see whether the traditional kNN search can always be replaced by frequent k-n-match search; if not, in which scenarios we should use it?

Questions? My contact Website: