Presentation is loading. Please wait.

Presentation is loading. Please wait.

Redundant Bit Vectors for the Audio Fingerprinting Server John Platt Jonathan Goldstein Chris Burges.

Similar presentations


Presentation on theme: "Redundant Bit Vectors for the Audio Fingerprinting Server John Platt Jonathan Goldstein Chris Burges."— Presentation transcript:

1 Redundant Bit Vectors for the Audio Fingerprinting Server John Platt Jonathan Goldstein Chris Burges

2 Structure of Talk 1. Problem Statement 2. Problems with Existing Techniques 3. Bit Vectors 4. Partitioning the Query Space 5. Results 6. Future Extensions

3 Find all hyperspheres that overlap a query point  Centers of spheres = undistorted fingerprint of song  Radius of sphere = acceptable distortion of fingerprint  Sphere overlaps query = query song is same as dB song Problem Statement Data sphere Query S1S1 S2S2 S3S3 S4S4 S5S5 Query is song 2

4 Existing Techniques Linear scan  For each sphere, check if query is inside  Linear effort in size of dB  Previous best known technique Data partitioning (R-trees, SS-trees, etc)  Store data in a tree of shapes  Shapes chop up space  Descend and backtrack in tree, finding all leaf nodes that could match query  Performs worse than linear scan!!!

5 Key New Ideas New Algorithm: Redundant Bit Vectors (RBV) 1. Partition the queries, not the data Avoids hopeless task 2. Store each data point redundantly Combats high dimensionality 3. Use bit vectors to index database Small & fast representation

6 Data Structure: Bit Vectors Bit vectors represent every data point as one bit 1011111110111111 0 means exclude this example from linear search 1 means linear search must still look at example Point 1 Point N 1101101111011011 1111001111110011 1001110110011101 &&&→ 1001101110011011 1001001110010011 1001000110010001 Most examples are excluded from a final linear scan

7 Why Bit Vectors are Good Small memory footprint  1 bit per example per bit vector Fast on modern CPUs  1 CPU cycle operates on 32 examples per clock cycle  Compare to Euclidean distance 3 operations/example/dimension  Potential speed up: 96x !!! we use 1 bit vector per dimension in lookup

8 Partition the Queries, not the Data bin Query 1011111010111110 1101111111011111 1101111111011111 1101011011010110 For each query dimension, dimension indexed into bins bit vector associated with each bin when query falls into bin … use the corresponding bit vector AND together bit vectors for some or all dimensions  Each dimension trims examples  Perform linear scan on survivors

9 How We Compute the RBVs At index building time, construct the bit vectors: S1S1 S2S2 S3S3 S5S5 S4S4 S6S6 Project spheres into each dimension 110000110000 101000101000 001101001101 000111000111 i th vector, j th bit = does sphere j overlap bin i ?

10 How We Decide on the Bin Edges Two equivalent heuristics  each bin should have ~ same number of spheres  adjacent bit vectors should be ~ constant Hamming distance apart Place bin edges equal num. of sphere edges apart S1S1 S2S2 S3S3 S5S5 S4S4 S6S6 3 6 912

11 Improve Selectivity by Shrinking Boxes Fingerprinting with hyperspheres (L 2 norm)  Low, but non-zero false negative rate Bit vectors implement hyperrectangles (L ∞ norm) bit vectors guaranteed to never introduce false positives bit vectors empirically found to introduce no extra false positives We shrink the hyperrectangles to speed up final linear search

12 Speed Comparisons Ran 1000 queries against fingerprint database Database size = 240K 64 dimensional points 14 bit vector dimensions used  chosen to optimize bit vector + linear scan speed  more dim: bit vectors slow down, linear scan speeds up 32 bins per dimension  chosen to optimize memory/speed tradeoff Pentium 4, 2.2 GHz

13 Results MethodQueries/secFactor Slower Than RBVs 14-D RBVs + 64-D limited L 2 scan 14901 64-D full L 2 scan 3148 14-D Hilbert R-trees 10149 all linear scans used early bailing measured code by itself, not in context of SQL or IIS

14 Code Details ~600 lines of C++ (pretty simple) Not integrated with SQL Server or IIS, etc. Running as part of audio fingerprinting demo

15 How Fast Is It Really? You tell us: it depends on several factors  Linear time in size of database  Linear time in amount of resistance to cropping  Sorting by popularity may help substantially

16 Resistance to Cropping How much “crop slop” do you need? current system = 5.4 traces / second of slop  possible to reduce by 2x server load linear in number of traces largest acceptable crop to recognize true start of song location of fingerprint need to search this portion of song

17 Popularity Sorting May Help Order database by approximate popularity of music Split search into different sections 5000 most popular 50000 next most popular the rest first search here if not found, search here if not found, search here May yield substantial speed gain

18 Memory Performance Bit Vectors can stay resident in memory  For 240K songs  All fingerprints live in 128M  Bit vector indices only require 13M  We can store fingerprints as 2-byte short, save 2x mem. Bit vector search blows out of cache  speed depends on memory bandwidth of server

19 Summary Bit Vectors are Simple, Small, and Fast Must be used to get good server-side performance


Download ppt "Redundant Bit Vectors for the Audio Fingerprinting Server John Platt Jonathan Goldstein Chris Burges."

Similar presentations


Ads by Google