Redundant Bit Vectors for the Audio Fingerprinting Server John Platt Jonathan Goldstein Chris Burges.

Slides:



Advertisements
Similar presentations
Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
Advertisements

Trees for spatial indexing
AI Pathfinding Representing the Search Space
Robust query processing Goetz Graefe, Christian König, Harumi Kuno, Volker Markl, Kai-Uwe Sattler Dagstuhl – September 2010.
Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.
Fast Firewall Implementation for Software and Hardware-based Routers Lili Qiu, Microsoft Research George Varghese, UCSD Subhash Suri, UCSB 9 th International.
Types of Algorithms.
Copyright © 2014, 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Starting Out with C++ Early Objects Eighth Edition by Tony Gaddis,
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
Chapter 9: Searching, Sorting, and Algorithm Analysis
Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.
Robust video fingerprinting system Daniel Luis
FINGER PRINTING BASED AUDIO RETRIEVAL Query by example Content retrieval Srinija Vallabhaneni.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Accessing Spatial Data
Chapter 8 File organization and Indices.
Spatial Information Systems (SIS) COMP Spatial access methods: Indexing.
Cache Conscious Indexing for Decision-Support in Main Memory Pradip Dhara.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Module 04: Algorithms Topic 07: Instance-Based Learning
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
1 Hypersphere Dominance: An Optimal Approach Cheng Long, Raymond Chi-Wing Wong, Bin Zhang, Min Xie The Hong Kong University of Science and Technology Prepared.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
SEMILARITY JOIN COP6731 Advanced Database Systems.
Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Project 2 Presentation & Demo Course: Distributed Systems By Pooja Singhal 11/22/
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
The Central Processing Unit
1 Chapter 17 Disk Storage, Basic File Structures, and Hashing Chapter 18 Index Structures for Files.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method Gang Qian University of Central Oklahoma November 2006.
Multidimensional Indexes Applications: geographical databases, data cubes. Types of queries: –partial match (give only a subset of the dimensions) –range.
Pointers OVERVIEW.
Saarland University, Germany B-KD Trees for Hardware Accelerated Ray Tracing of Dynamic Scenes Sven Woop Gerd Marmitt Philipp Slusallek.
September, 2002 Efficient Bitmap Indexes for Very Large Datasets John Wu Ekow Otoo Arie Shoshani Lawrence Berkeley National Laboratory.
Fast BVH Construction on GPUs (Eurographics 2009) Park, Soonchan KAIST (Korea Advanced Institute of Science and Technology)
IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.
Spatial Issues in DBGlobe Dieter Pfoser. Location Parameter in Services Entering the harbor (x,y position)… …triggers information request.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
P-Tree Implementation Anne Denton. So far: Logical Definition C.f. Dr. Perrizo’s slides Logical definition Defines node information Representation of.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Types of Algorithms. 2 Algorithm classification Algorithms that use a similar problem-solving approach can be grouped together We’ll talk about a classification.
1 CPS216: Advanced Database Systems Notes 05: Operators for Data Access (contd.) Shivnath Babu.
Chapter 5 Index and Clustering
Bitwise Sort By Matt Hannon. What is Bitwise Sort It is an algorithm that works with the individual bits of each entry in order to place them in groups.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani.
Course Code #IDCGRF001-A 5.1: Searching and sorting concepts Programming Techniques.
C++ How to Program, 7/e © by Pearson Education, Inc. All Rights Reserved.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
IMinMax B.C. Ooi, K.-L Tan, C. Yu, S. Stephen. Indexing the Edges -- A Simple and Yet Efficient Approach to High dimensional Indexing. ACM SIGMOD-SIGACT-
Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]
Amortized Analysis and Heaps Intro David Kauchak cs302 Spring 2013.
Audio Fingerprinting Overview: RARE Algorithms, Resources Chris Burges, John Platt, Jon Goldstein, Erin Renshaw
Spatial Data Management
Memory COMPUTER ARCHITECTURE
Optimizing Parallel Algorithms for All Pairs Similarity Search
CS 540 Database Management Systems
Real-Time Ray Tracing Stefan Popov.
Database Performance Tuning and Query Optimization
Chapter 8: Main Memory.
Chapter 11 Database Performance Tuning and Query Optimization
CS703 - Advanced Operating Systems
Presentation transcript:

Redundant Bit Vectors for the Audio Fingerprinting Server John Platt Jonathan Goldstein Chris Burges

Structure of Talk 1. Problem Statement 2. Problems with Existing Techniques 3. Bit Vectors 4. Partitioning the Query Space 5. Results 6. Future Extensions

Find all hyperspheres that overlap a query point  Centers of spheres = undistorted fingerprint of song  Radius of sphere = acceptable distortion of fingerprint  Sphere overlaps query = query song is same as dB song Problem Statement Data sphere Query S1S1 S2S2 S3S3 S4S4 S5S5 Query is song 2

Existing Techniques Linear scan  For each sphere, check if query is inside  Linear effort in size of dB  Previous best known technique Data partitioning (R-trees, SS-trees, etc)  Store data in a tree of shapes  Shapes chop up space  Descend and backtrack in tree, finding all leaf nodes that could match query  Performs worse than linear scan!!!

Key New Ideas New Algorithm: Redundant Bit Vectors (RBV) 1. Partition the queries, not the data Avoids hopeless task 2. Store each data point redundantly Combats high dimensionality 3. Use bit vectors to index database Small & fast representation

Data Structure: Bit Vectors Bit vectors represent every data point as one bit means exclude this example from linear search 1 means linear search must still look at example Point 1 Point N &&&→ Most examples are excluded from a final linear scan

Why Bit Vectors are Good Small memory footprint  1 bit per example per bit vector Fast on modern CPUs  1 CPU cycle operates on 32 examples per clock cycle  Compare to Euclidean distance 3 operations/example/dimension  Potential speed up: 96x !!! we use 1 bit vector per dimension in lookup

Partition the Queries, not the Data bin Query For each query dimension, dimension indexed into bins bit vector associated with each bin when query falls into bin … use the corresponding bit vector AND together bit vectors for some or all dimensions  Each dimension trims examples  Perform linear scan on survivors

How We Compute the RBVs At index building time, construct the bit vectors: S1S1 S2S2 S3S3 S5S5 S4S4 S6S6 Project spheres into each dimension i th vector, j th bit = does sphere j overlap bin i ?

How We Decide on the Bin Edges Two equivalent heuristics  each bin should have ~ same number of spheres  adjacent bit vectors should be ~ constant Hamming distance apart Place bin edges equal num. of sphere edges apart S1S1 S2S2 S3S3 S5S5 S4S4 S6S

Improve Selectivity by Shrinking Boxes Fingerprinting with hyperspheres (L 2 norm)  Low, but non-zero false negative rate Bit vectors implement hyperrectangles (L ∞ norm) bit vectors guaranteed to never introduce false positives bit vectors empirically found to introduce no extra false positives We shrink the hyperrectangles to speed up final linear search

Speed Comparisons Ran 1000 queries against fingerprint database Database size = 240K 64 dimensional points 14 bit vector dimensions used  chosen to optimize bit vector + linear scan speed  more dim: bit vectors slow down, linear scan speeds up 32 bins per dimension  chosen to optimize memory/speed tradeoff Pentium 4, 2.2 GHz

Results MethodQueries/secFactor Slower Than RBVs 14-D RBVs + 64-D limited L 2 scan D full L 2 scan D Hilbert R-trees all linear scans used early bailing measured code by itself, not in context of SQL or IIS

Code Details ~600 lines of C++ (pretty simple) Not integrated with SQL Server or IIS, etc. Running as part of audio fingerprinting demo

How Fast Is It Really? You tell us: it depends on several factors  Linear time in size of database  Linear time in amount of resistance to cropping  Sorting by popularity may help substantially

Resistance to Cropping How much “crop slop” do you need? current system = 5.4 traces / second of slop  possible to reduce by 2x server load linear in number of traces largest acceptable crop to recognize true start of song location of fingerprint need to search this portion of song

Popularity Sorting May Help Order database by approximate popularity of music Split search into different sections 5000 most popular next most popular the rest first search here if not found, search here if not found, search here May yield substantial speed gain

Memory Performance Bit Vectors can stay resident in memory  For 240K songs  All fingerprints live in 128M  Bit vector indices only require 13M  We can store fingerprints as 2-byte short, save 2x mem. Bit vector search blows out of cache  speed depends on memory bandwidth of server

Summary Bit Vectors are Simple, Small, and Fast Must be used to get good server-side performance