# Similarity Search on Bregman Divergence, Towards Non- Metric Indexing Zhenjie Zhang, Beng Chi Ooi, Srinivasan Parthasarathy, Anthony K. H. Tung.

## Presentation on theme: "Similarity Search on Bregman Divergence, Towards Non- Metric Indexing Zhenjie Zhang, Beng Chi Ooi, Srinivasan Parthasarathy, Anthony K. H. Tung."— Presentation transcript:

Similarity Search on Bregman Divergence, Towards Non- Metric Indexing Zhenjie Zhang, Beng Chi Ooi, Srinivasan Parthasarathy, Anthony K. H. Tung

Metric v.s. Non-Metric Euclidean distance dominates DB queries Similarity in human perception Metric distance is not enough! 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 2

Outline Bregman Divergence Solution  Basic solution  Better pruning bounds  Query distribution Experiments Conclusion 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 3

Bregman Divergence 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 4 qp Euclidean dist. convex function f(x) Bregman divergence D f (p,q) (q,f(q)) (p,f(p)) h

Bregman Divergence Mathematical Interpretation  The distance between p and q is defined as the difference between f(p) and the first order Taylor expansion at q 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 5 original f(x)first order Taylor expansion of f(x) at q

Bregman Divergence General Properties  Uniqueness A function f(x) uniquely decides the D f (p,q)  Non-Negativity D f (p,q)≥0 for any p, q  Identity D f (p,p)=0 for any p  Symmetry and Triangle Inequality Do NOT hold any more 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 6

Examples 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 7 Distancef(x)Df(p,q)Df(p,q)Usage KL-Divergencex logxp log (p/q)distribution, color histogram Itakura-Saito Distance -logxp/q-log (p/q)-1signal, speech Squared Euclidean x2x2 (p-q)2(p-q)2 traditional queries Von-Nuemann Entropy tr(X log X – X)tr(X logX – X logY – X + Y) symmetric matrix

Why in DB system? Database application  Retrieval of similar images, speech signals, or time series  Optimization on matrices in machine learning  Efficiency is important! Query Types  Nearest Neighbor Query  Range Query 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 8

Euclidean Space How to answer the queries  R-Tree 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 9

Euclidean Space How to answer the queries  VA File 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 10

Our goal Re-use the infrastructure of existing DB system to support Bregman divergence  Storage management  Indexing structures  Query processing algorithms 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 11

Outline Bregman Divergence Solution  Basic solution  Better pruning bounds  Query distribution Experiments Conclusion 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 12

Basic Solution Extended Space  Convex function f(x) = x 2 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 13 pointA1A1 A2A2 p01 q0.5 r10.8 t1.50.3 pointA1A1 A2A2 A3A3 p+p+ 011 q+q+ 0.5 r+r+ 10.81.64 t+t+ 1.50.33.15

Basic Solution After the extension  Index extended points with R-Tree or VA File  Re-use existing algorithms with new lower and upper bound computation 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 14

How to improve? Reformulation of Bregman divergence Tighter bounds are derived No change on index construction or query processing algorithm 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 15

A New Formulation 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 16 qp D f (p,q)+Δ query vector v q D* f (p,q) h h’

Math. Interpretation Reformulation of similarity search queries  k-NN query: query q, data set P, divergence D f Find the point p, minimizing  Range query: query q, threshold θ, data set P Return any point p that 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 17

Naïve Bounds Check the corners of the bounding rectangles 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 18

Tighter Bounds Take the curve f(x) into consideration 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 19

Query distribution Distortion of rectangles  The difference between maximum and minimum distances from inside the rectangle to the query 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 20

Can we improve it more? When Building R-Tree in Euclidean space  Minimize the volume/edge length of MBRs  Does it remain valid? 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 21

Query distribution Distortion of bounding rectangles  Invariant in Euclidean space (triangle inequality)  Query-dependent for Bregman Divergence 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 22

Utilize Query Distribution Summarize query distribution with O(d) real number Estimation on expected distortion on any bounding rectangle in O(d) time Allows better index to be constructed for both R-Tree and VA File 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 23

Outline Bregman Divergence Solution  Basic solution  Better pruning bounds  Query distribution Experiments Conclusion 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 24

Experiments Data Sets  KDD’99 data Network data, the proportion of packages in 72 different TCP/IP connection Types  DBLP data Use co-authorship graph to generate the probabilities of the authors related to 8 different areas 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 25

Experiment Data Sets  Uniform Synthetic data Generate synthetic data with uniform distribution  Clustered Synthetic data Generate synthetic data with Gaussian Mixture Model 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 26

Experiments Methods to compare 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 27 BasicImproved Bounds Query Distribution R-TreeRR-BR-BQ VA FileVV-BV-BQ Linear ScanLS BB-TreeBBT

Experiments Index Construction Time 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 28

Experiments Varying dimensionality 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 29

Experiments Varying dimensionality (cont.) 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 30

Experiments Varying k for nearest neighbor query 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 31

Conclusion A general technique on similarity for Bregman Divergence All techniques are based on existing infrastructure of commercial database Extensive experiments to compare performances with R-Tree and VA File with different optimizations 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 32

Acknowledgment Zhenjie Zhang, Anthony K. H. Tung and Beng Chin Ooi were supported by Singapore NRF grant R- 252-000-376-279. Srinivasan Parthasarathy was supported by NSF IIS-0347662 (CAREER) and NSF CCF-0702587.

Q & A

Download ppt "Similarity Search on Bregman Divergence, Towards Non- Metric Indexing Zhenjie Zhang, Beng Chi Ooi, Srinivasan Parthasarathy, Anthony K. H. Tung."

Similar presentations