Audio Fingerprinting Wes Hatch MUMT-614 Mar.13, 2003.

Audio Fingerprinting Wes Hatch MUMT-614 Mar.13, 2003

What is Audio Fingerprinting? a small, unknown segment of audio data (it can be as short as just a couple of seconds) is used to identify the original audio file from which it came

Applications Broadcast monitoring playlist generation royalty collection ad verification Connected Audio general term for consumer applications Other Napster--use of fingerprinting systems to prohibit the transmission of copywritten materials Finding desired content efficiently in “an overwhelming amount of audio material”

“Benefits” Automated search of illegal content on the Internet –examines the real audio information rather than just tag information For the consumer –make the meta-data of songs in a library consistent, allowing for easy organization –can guarantee that what is downloaded is actually what it says it is –will allow consumer to record signatures of sound and music on small handheld devices

Two principle components Compute the fingerprint Compare it to a database of previously computed fingerprints –A text example: “…in a box. I will not eat them with a fox. I…”

Details to worry about Robustness (to noise, distortion) Reliability Fingerprint size (reduced dimensionality) Granularity Search speed and scalablity Computationally efficient Resulting features must be informative about the audio content Semantic or non-semantic features? Hash table or vector representation?

Computing the fingerprint Compare to hash functions…? –compare computed hash value with that stored in a database Drawback –need to worry about perceptual similarity and not mathematical similarity PCM audio vs. MP3: both sound alike but mathematically (i.e. spectral content) are quite different –perceptual similarity is not transitive not possible to design a system which computes mathematical fingerprints for perceptually similar objects

Techniques (general) Any ‘x’ number of seconds may be used to compute the fingerprint Audio gets separated into frames –Features computed for each frame: Fourier coefficients MFCC, LPC Spectral flatness sharpness “features mapped into a more compact representation by using …HMM, or quantization”

Techniques (Haitsma, Kalker) one 32-bit sub-fingerprint every 11.6 ms –A block consists of 256 sub-fingerprints Corresponds to a granularity of only 3 seconds –Large overlap (31/32), so subsequent sub- fingerprints are similar and vary slowly in time –worst-case scenario: the frame boundaries used during identification are 5.8 ms off with those in database

Techniques (Haitsma, Kalker) Data from each frame is sent through a filterbank –33 filters, logarithmically spaced (to correspond roughly to the Bark scale) between 300 and 2000Hz –phase is neglected (perceptual reasons)

System overview

Techniques (Burges, Platt) downsampled to 11.025 kHz, split into frames with overlap of 2 –MCLT is then applied to each frame. A 128-sample log spectrum is generated by taking the log modulus of each MCLT coefficient

Techniques (Burges, Platt) Use prior knowledge to define form of the feature extractor Features computed by a “linear, convolutional” neural network convert signal into a feature vector –uses Pattern Classification and Scene Analysis (PCA) to find a set of projections –generates a vector of 128 values for every 11.6ms interval dimensional-reduction method (i.e. lots of math)

Techniques (Burges, Platt) 3 layers of Oriented PCA (OPCA) –operates on a frame of 128 values layer 1: generates 10 values for each frame layer 2: takes 42 ‘layer 1 outputs’ and produces 20 values layer 3: takes 40 ‘layer 2 outputs’ and produces 64 values (11K inputs --> 64 outputs)

Searching the Database Look for the most similar (not necessarily exact) fingerprint –10,000 5-min. songs  250 million sub- fingerprints –brute force takes in excess of 20 minutes on a very fast PC brute force computes bit-error rate for every possible position in the database

Searching the Database make assumption that at least 1 (of the 256) sub-fingerprints are error- free –then, use a hash table (as opposed to more memory-intensive look-up table) –800,000 times faster

Results false-positive rate of 3.6x10-2 (Haitsma, Kalker) On tests with a large (500,000) set of input traces –has a “low” false-positive and false-negative rate. (Burges, Platt) –didn’t test on time compression, expansion can withstand distortions occurring from transmission over mobile phones.

Audio Fingerprinting Wes Hatch MUMT-614 Mar.13, 2003.

Similar presentations

Presentation on theme: "Audio Fingerprinting Wes Hatch MUMT-614 Mar.13, 2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Audio Fingerprinting Wes Hatch MUMT-614 Mar.13, 2003.

Similar presentations

Presentation on theme: "Audio Fingerprinting Wes Hatch MUMT-614 Mar.13, 2003."— Presentation transcript:

Similar presentations

About project

Feedback