Presentation is loading. Please wait.

Presentation is loading. Please wait.

National Taiwan University

Similar presentations

Presentation on theme: "National Taiwan University"— Presentation transcript:

1 National Taiwan University
Audio Fingerprinting J.-S. Roger Jang (張智星) MIR Lab, CSIE Dept. National Taiwan University

2 Intro to Audio Fingerprinting (AFP)
Goal Identify a noisy version of a given audio clips Also known as… “Query by exact example”  no “cover versions” Can also be used to… Align two different-speed audio clips of the same source Dan Ellis used AFP for aligned annotation on Beatles dataset

3 AFP Challenges Music variations Efficiency (6M tags/day for Shazam)
Encoding/compression (MP3 encoding, etc) Channel variations Speakers & microphones, room acoustics Environmental noise Efficiency (6M tags/day for Shazam) Database collection (15M tracks for Shazam)

4 AFP Applications Commercial applications of AFP
Music identification & purchase Royalty assignment (over radio) TV shows or commercials ID (over TV) Copyright violation (over web) Major commercial players Shazam, Soundhound, Intonow, Viggle…

5 Company: Shazam Facts Technology Founder
First commercial product of AFP Since 2002, UK Technology Audio fingerprinting Founder Avery Wang (PhD at Standard, 1994)

6 Company: Soundhound Facts Technologies Founder
First product with multi-modal music search AKA: midomi Technologies Audio fingerprinting Query by singing/humming Speech recognition Founder Keyvan Mohajer (PhD at Stanford, 2007)

7 Two Stages in AFP Offline Online Feature extraction
Hash table construction for songs in database Inverted indexing Online Feature extraction Hash table search Ranked list of the retrieved songs/music

8 Robust Feature Extraction
Various kinds of features for AFP Invariance along time and frequency Landmark of a pair of local maxima Wavelets Extensive test required for choosing the best features

9 Representative Approaches to AFP
Philips J. Haitsma and T. Kalker, “A highly robust audio fingerprinting system”, ISMIR 2002. Shazam A.Wang, “An industrial-strength audio search algorithm”, ISMIR 2003 Google S. Baluja and M. Covell, “Content fingerprinting using wavelets”, Euro. Conf. on Visual Media Production, 2006. V. Chandrasekhar, M. Sharifi, and D. A. Ross, “Survey and evaluation of audio fingerprinting schemes for mobile query-by-example applications”, ISMIR 2011

10 Philips: Thresholding as Features
Observation The sign of energy differences is robust to various operations Lossy encoding Range compression Added noise Thresholding as Features Magnitude spectrum S(t, f) Fingerprint F(t, f) (Source: Dan Ellis)

11 Philips: Thresholding as Features (II)
Robust to low-bitrate MP3 encoding (see the right) Sensitive to “frame time difference”  Hop size is kept small! Original fingerprinting Fingerprinting after MP3 encoding BER=0.078

12 Philips: Robustness of Features
BER of the features after various operations General low High for speed and time-scale changes (which is not likely to occur under query by example)

13 Philips: Search Strategies
Via hashing Inverted indexing

14 Shazam’s Method Ideas Take advantage of music local structures
Find salient peaks on spectrogram Pair peaks to form landmarks for comparison Efficient search by hash tables Use positions of landmarks as hash keys Use song ID and offset time as hash values Use time constraints to find matched landmarks

15 Database Preparation Compute spectrogram Detect salient peaks
Perform mean subtraction & high-pass filtering Detect salient peaks Find initial threshold Update the threshold along time Pair salient peaks to form landmarks Define target zone Form landmarks and save them to a hash table

16 Query Match Identify landmarks Find matched landmarks
Retrieve landmarks from the hash table Keep only time-consistent landmarks Rank the database items Via matched landmark counts Via other confidence measures

17 Shazam: Landmarks as Features
Pair peaks in target zone to form landmarks Spectrogram Landmark: [t1, f1, t2, f2] 24-bit hash key: f1: 9 bits Δf = f2-f1: 8 bits Δt = t2-t1: 7 bits Hash value: Song ID Landmark’s start time t1 Salient peaks of spectrogram (Avery Wang, 2003)

18 How to Find Salient Peaks
We need to find peaks that are salient along both frequency and time axes Frequency axis: Gaussian local smoothing Time axis: Decaying threshold over time

19 How to Find Initial Threshold?
Goal To suppress neighboring peaks Ideas Find the local max. of mag. spectra of initial 10 frames Superimpose a Gaussian on each local max. Find the max. of all Gaussians Example Based on Bad Romance envelopeGen.m

20 How to Update the Threshold along Time?
Decay the threshold Find local maxima larger than the threshold  salient peaks Define the new threshold as the max of the old threshold and the Gaussians passing through the active local maxima

21 How to Control the No. of Salient peaks?
To decrease the no. of salient peaks Perform forward and backward sweep to find salient peaks along both directions Use Gaussians with larger standard deviation

22 Time-decaying Thresholds
landmarkFind01.m Forward: Backward:

23 How to Pair Salient Peaks?
Target zone A target zone is created right following each salient peaks The leading peak are paired with each peak in the target zone to form landmarks. Each landmark is denoted by [ 𝑡 1 , 𝑓 1 , 𝑡 2 , 𝑓 2 ]

24 Salient Peaks and Landmarks
Peak picking after forward smoothing Matched landmarks (green) (Source: Dan Ellis)

25 Time Skew Out of sync of frame boundary Solution Increase frame size
Repeated LM extraction time Reference frames 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 1 Query frames 2 3 4 1 2 3 4 1 time skew! 2 3 1 4 2 3 4

26 To Avoid Time Skew To avoid time skew, query landmarks are extracted at various time shifts Example of 4 shifts of step = hop/4 LM set 1 LM set 2 Union & unique Query landmark set LM set 3 LM set 4

27 Landmarks for Hash Table Access
Convert each landmark to hash key (and value) Landmark from the database  hash table creation Landmark from the query  hash table lookup Use 𝑓 1 , 𝑡 2 , 𝑓 2 to generate hash key for hash table lookup Use 𝑡 1 to find matched (time-consistent) landmarks songId: [ 𝑡 1 , 𝑓 1 , 𝑡 2 , 𝑓 2 ] 24-bit hash key = 𝑓 1 (9 bits) + ∆𝑓 8 𝑏𝑖𝑡𝑠 + ∆𝑡 (7 𝑏𝑖𝑡𝑠) 32-bit hash value = songId (18 bits) + 𝑡 1 (14 𝑏𝑖𝑡𝑠)

28 Parameters in Our Implementation
Landmarks Sample rate = 8000 Hz Frame size = 1024 Overlap = 512 Frame rate = Landmark rate = ~400 LM/sec Hash table Hash key size = 2^24 = 16.78M Max song ID = 2^18 = 262 K Max start time = 2^14/frameRate = minutes Our implementation is based on Dan Ellis’ work: Robust Landmark-Based Audio Fingerprinting,

29 Structure of Hash Table
Collision happens when LMs have the same [ 𝑓 1 , 𝑡 2 , 𝑓 2 ]:

30 Hash Table Lookup Query (hash keys from landmarks) Hash table 8002
15007 Hash keys 1 8002 15007 224-2 224-1 9753 1432 1232 41 10002 436 19662 653 677 1461 438 Hash values 142 486 997 73 1977 Retrieved landmarks 65

31 How to Find Query Offset Time?
Offset time of query can be derived by… Retrieved LM Database LM start time Database landmarks Retrieved and matched LM Query landmarks Query offset time Query LM start time

32 Find Matched Landmarks
Start time plot for landmarks X axis: start time of database LM Y axis: start time of query LM Query offset time ≈ x - y A given LM starting at 9.5 sec retrieves 3 LMs in the hash table But only this one is matched! t=9.5 sec Query offset time

33 Find Matched Landmarks
We can determine the offset time by plotting histogram of start time difference (x-y): Start time plot Histogram of start time difference (x-y) (Avery Wang, 2003)

34 Matched Landmark Count
To find matched (time-consistent) landmark count of a song: All retrieved landmarks Histogram of Offset time of a song Song ID Offset time Hash value 2046 6925 485890 2286 555 795 1035 2715 384751 556 963157 Offset time Count 555±1 18 795±1 1 1035±1 2715±1 Matched landmark count of song 2286 LM from the same song 2286

35 Matched landmark count
Final Ranking A common way to have the final ranking Based on each song’s matched landmark count Can also be converted into scores between 0~100 Song ID Matched landmark count Offset time 2286 18 555 ±1 2746 13 5002±1 2255 9 1681±1 2033 5 2347±1 2019 4 527±1

36 Matched Landmarks vs. Noise
Original Noisy01 Noisy02 Noisy03 Run goLmVsNoise.m in AFP toolbox to create this example.

37 Optimization Strategies for AFP
Several ways to optimize AFP Strategy for query landmark extraction Confidence measure Incremental retrieval Better use of the hash table Re-ranking for better performance

38 Strategy for LM Extraction (1)
10-sec query Goal To trade computation for accuracy Steps: Construct a classifier to determine if a query is a “hit” or a “miss”. Increase the landmark counts of “miss” queries for better accuracy Classifier “hit” “miss” Regular LM extraction Dense LM extraction AFP engine Retrieved songs

39 Strategy for LM Extraction (2)
Classifier construction Training data: “hit” and “miss” queries Classifier: SVM Features mean volume standard deviation of volume standard deviation of absolute sum of high-order difference Requirement Fast in evaluation Simple or readily available features Efficient classifier Adaptive Effective threshold for detecting miss queries

40 Strategy for LM Extraction (3)
To increase landmarks for “miss” queries Use more time-shifted query for LM extraction Our test takes 4 shifts vs. 8 shifts Decay the thresholds more rapidly to reveal more salient peaks

41 Strategy for LM Extraction (4)
Song database 44.1kHz, 16-bits 1500 songs 1000 songs (30 seconds) from GTZAN dataset 500 songs (3~5 minutes) from our own collection of English/Chinese songs Datasets 10-sec clips recorded by mobile phones Training data 1412 clips (1223:189) Test data 1062 clips

42 Strategy for LM Extraction (5)
AFP accuracy vs. computing time

43 Confidence Measure (1) Confusion matrix Performance indices
False acceptance rate FAR = 𝑐01 𝑐00+𝑐01 False rejection rate FRR = 𝑐10 𝑐10+𝑐11 Predicted No Yes C00 C01 C10 C11 Groundtruth

44 Confidence Measure (2) Factors for confidence measure
Matched landmark count Landmark count Salient peak count How to use these factors Take a value of the factor and used it as a threshold Normalize the threshold by dividing it by query duration Vary the threshold to identify FAR & FRR

45 Dataset for Confidence Measure
Song database 44.1kHz, 16-bits 1500 songs 1000 songs (30 seconds) from GTZAN dataset 16284 songs (3~5 minutes) from our own collection of English songs Datasets 10-sec clips recorded by mobile phones In the database 1062 clips Not in the database 1412 clips

46 Toleranace of matched landmarks
Confidence Measure (3) DET (Detection Error Tradeoff) Curve Accuracy vs. tolerance No OOV queries Toleranace of matched landmarks Accuracy ±0 79.19% ±1 79.66% ±2 79.57%

47 Incremental Retrieval
Goal Take additional query input if the confidence measure is not high enough Implementation issues Use only forward mode for landmark extraction  no. of landmarks ↗  computation time ↗ Use statistics of matched landmarks to restricted the number of extracted landmarks for comparison

48 Hash Table Optimization
Possible directions for hash table optimization To increase song capacity  20 bits for songId Song capacity = 2^20 = 1 M Max start time = 2^12/frameRate = 4.37 minutes  Longer songs are split into shorter segments To increase efficiency  rule Put 20% of the most likely songs to fast memory Put 80% of the lese likely songs to slow memory To avoid collision  better hashing strategies

49 Re-ranking for Better Performance
Features that can be used to rank the matched songs Matched landmark count Matched frequency count 1 Matched frequency count 2

50 Our AFP Engine Music database Driving forces Methods Platform
260k tracks currently 1M tracks in the future Driving forces Fundamental issues in computer science (hashing, indexing…) Requests from local companies Methods Landmarks as feature (Shazam’s method) Speedup by GPU Platform Single CPU + 3 GPUs

51 Specs of Our AFP Engine Platform Database OS: CentOS 6
CPU: Intel Xeon x5670 six cores 2.93GHz Memory: 96GB Database Please refer to this page.

52 Experiments Corpora Accuracy test Database: 2550 tracks
Test files: 5 mobile-recorded songs chopped into segments of 5, 10, 15, and 20 seconds Accuracy test 5-sec clips: 161/275=58.6% 10-sec clips: 121/136=89.0% 15-sec clips: 88/90=97.8% 20-sec clips: 65/66=98.5% 測試歌曲: 雙聲道 Hz 16 bits Accuracy vs. duration Computing time. vs. duration Accuracy vs. computing time

53 MATLAB Prototype for AFP
Toolboxes Audio fingerprinting SAP Utility Dataset Russian songs Instruction Download the toolboxes Modify afpOptSet.m (in the audio fingerprinting toolbox) to add toolbox paths Run goDemo.m.

54 Demos of Audio Fingerprinting
Commercial apps Shazam Soundhound Our demo

55 QBSH vs. AFP QBSH AFP Goal: MIR Feature: Pitch Method: LS Database
Perceptible Small data size Method: LS Database Harder to collect Small storage Bottleneck CPU/GPU-bound AFP Goal: MIR Features: Landmarks Not perceptible Big data size Method: Matched LM Database Easier to collect Large storage Bottleneck I/O-bound

56 Conclusions For AFP Conclusions Future work: Scale up
Landmark-based methods are effective Machine learning is indispensable for further improvement. Future work: Scale up Shazam: 15M tracks in database, 6M tags/day Our goal: 1M tracks with a single PC and GPU 10M tracks with cloud computing of 10 PC

57 References (I) Robust Landmark-Based Audio Fingerprinting, Dan Ellis, Avery Wang (Shazam) “An Industrial-Strength Audio Search Algorithm”, ISMIR, 2003 “The Shazam music recognition service”, Comm. ACM 49(8), 44-48,, J. Haitsma and T. Kalker (Phlillips) “A highly robust audio fingerprinting system”, ISMIR, 2002 “A highly robust audio fingerprinting system with an efficient search strategy,” J. New Music Research 32(2), , 2003.

58 References (II) Google:
“Content Fingerprinting Using Wavelets”, Baluja, Covell., Proc. CVMP , 2006 “Survey and Evaluation of Audio Fingerprinting Schemes for Mobile Query-by-Example Applications”, Vijay Chandrasekhar, Matt Sharifi, David A. Ross, ISMIR, 2011 “Computer Vision for Music Identification”, Y. Ke, D. Hoiem, and R. Sukthankar, CVPR, 2005

Download ppt "National Taiwan University"

Similar presentations

Ads by Google