Download presentation

Presentation is loading. Please wait.

Published byYvonne Wragg Modified about 1 year ago

1
Similarity Search on Bregman Divergence, Towards Non- Metric Indexing Zhenjie Zhang, Beng Chi Ooi, Srinivasan Parthasarathy, Anthony K. H. Tung

2
Metric v.s. Non-Metric Euclidean distance dominates DB queries Similarity in human perception Metric distance is not enough! 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 2

3
Outline Bregman Divergence Solution Basic solution Better pruning bounds Query distribution Experiments Conclusion 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 3

4
Bregman Divergence 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 4 qp Euclidean dist. convex function f(x) Bregman divergence D f (p,q) (q,f(q)) (p,f(p)) h

5
Bregman Divergence Mathematical Interpretation The distance between p and q is defined as the difference between f(p) and the first order Taylor expansion at q 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 5 original f(x)first order Taylor expansion of f(x) at q

6
Bregman Divergence General Properties Uniqueness A function f(x) uniquely decides the D f (p,q) Non-Negativity D f (p,q)≥0 for any p, q Identity D f (p,p)=0 for any p Symmetry and Triangle Inequality Do NOT hold any more 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 6

7
Examples 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 7 Distancef(x)Df(p,q)Df(p,q)Usage KL-Divergencex logxp log (p/q)distribution, color histogram Itakura-Saito Distance -logxp/q-log (p/q)-1signal, speech Squared Euclidean x2x2 (p-q)2(p-q)2 traditional queries Von-Nuemann Entropy tr(X log X – X)tr(X logX – X logY – X + Y) symmetric matrix

8
Why in DB system? Database application Retrieval of similar images, speech signals, or time series Optimization on matrices in machine learning Efficiency is important! Query Types Nearest Neighbor Query Range Query 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 8

9
Euclidean Space How to answer the queries R-Tree 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 9

10
Euclidean Space How to answer the queries VA File 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 10

11
Our goal Re-use the infrastructure of existing DB system to support Bregman divergence Storage management Indexing structures Query processing algorithms 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 11

12
Outline Bregman Divergence Solution Basic solution Better pruning bounds Query distribution Experiments Conclusion 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 12

13
Basic Solution Extended Space Convex function f(x) = x 2 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 13 pointA1A1 A2A2 p01 q0.5 r10.8 t1.50.3 pointA1A1 A2A2 A3A3 p+p+ 011 q+q+ 0.5 r+r+ 10.81.64 t+t+ 1.50.33.15

14
Basic Solution After the extension Index extended points with R-Tree or VA File Re-use existing algorithms with new lower and upper bound computation 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 14

15
How to improve? Reformulation of Bregman divergence Tighter bounds are derived No change on index construction or query processing algorithm 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 15

16
A New Formulation 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 16 qp D f (p,q)+Δ query vector v q D* f (p,q) h h’

17
Math. Interpretation Reformulation of similarity search queries k-NN query: query q, data set P, divergence D f Find the point p, minimizing Range query: query q, threshold θ, data set P Return any point p that 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 17

18
Naïve Bounds Check the corners of the bounding rectangles 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 18

19
Tighter Bounds Take the curve f(x) into consideration 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 19

20
Query distribution Distortion of rectangles The difference between maximum and minimum distances from inside the rectangle to the query 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 20

21
Can we improve it more? When Building R-Tree in Euclidean space Minimize the volume/edge length of MBRs Does it remain valid? 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 21

22
Query distribution Distortion of bounding rectangles Invariant in Euclidean space (triangle inequality) Query-dependent for Bregman Divergence 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 22

23
Utilize Query Distribution Summarize query distribution with O(d) real number Estimation on expected distortion on any bounding rectangle in O(d) time Allows better index to be constructed for both R-Tree and VA File 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 23

24
Outline Bregman Divergence Solution Basic solution Better pruning bounds Query distribution Experiments Conclusion 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 24

25
Experiments Data Sets KDD’99 data Network data, the proportion of packages in 72 different TCP/IP connection Types DBLP data Use co-authorship graph to generate the probabilities of the authors related to 8 different areas 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 25

26
Experiment Data Sets Uniform Synthetic data Generate synthetic data with uniform distribution Clustered Synthetic data Generate synthetic data with Gaussian Mixture Model 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 26

27
Experiments Methods to compare 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 27 BasicImproved Bounds Query Distribution R-TreeRR-BR-BQ VA FileVV-BV-BQ Linear ScanLS BB-TreeBBT

28
Experiments Index Construction Time 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 28

29
Experiments Varying dimensionality 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 29

30
Experiments Varying dimensionality (cont.) 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 30

31
Experiments Varying k for nearest neighbor query 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 31

32
Conclusion A general technique on similarity for Bregman Divergence All techniques are based on existing infrastructure of commercial database Extensive experiments to compare performances with R-Tree and VA File with different optimizations 2015-4-29 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 32

33
Acknowledgment Zhenjie Zhang, Anthony K. H. Tung and Beng Chin Ooi were supported by Singapore NRF grant R- 252-000-376-279. Srinivasan Parthasarathy was supported by NSF IIS-0347662 (CAREER) and NSF CCF-0702587.

34
Q & A

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google