Presentation is loading. Please wait.

Presentation is loading. Please wait.

Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) MIR Lab, CSIE Dept. National Taiwan University 1.

Similar presentations


Presentation on theme: "Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) MIR Lab, CSIE Dept. National Taiwan University 1."— Presentation transcript:

1 Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) MIR Lab, CSIE Dept. National Taiwan University 1

2 Intro to Audio Fingerprinting (AFP) zGoal yIdentify a noisy version of a given audio clips zAlso known as… y“Query by exact example”  no “cover versions” zCan also be used to… yAlign two different-speed audio clips of the same source xDan Ellis used AFP for aligned annotation on Beatles dataset

3 AFP Challenges zMusic variations yEncoding/compression (MP3 encoding, etc) yChannel variations xSpeakers & microphones, room acoustics yEnvironmental noise zEfficiency (6M tags/day for Shazam) zDatabase collection (15M tracks for Shazam)

4 AFP Applications zCommercial applications of AFP yMusic identification & purchase yRoyalty assignment (over radio) yTV shows or commercials ID (over TV) yCopyright violation (over web) zMajor commercial players yShazam, Soundhound, Intonow, Viggle…

5 Company: ShazamShazam zFacts yFirst commercial product of AFP ySince 2002, UK zTechnology yAudio fingerprinting zFounder yAvery Wang (PhD at Standard, 1994)

6 Company: SoundhoundSoundhound zFacts yFirst product with multi-modal music search yAKA: midomimidomi zTechnologies yAudio fingerprinting yQuery by singing/humming ySpeech recognition zFounder yKeyvan Mohajer (PhD at Stanford, 2007)

7 Two Stages in AFP zOffline yFeature extraction yHash table construction for songs in database yInverted indexing zOnline yFeature extraction yHash table search yRanked list of the retrieved songs/music

8 Robust Feature Extraction zVarious kinds of features for AFP yInvariance along time and frequency yLandmark of a pair of local maxima yWavelets y… zExtensive test required for choosing the best features

9 Representative Approaches to AFP zPhilips yJ. Haitsma and T. Kalker, “A highly robust audio fingerprinting system”, ISMIR zShazam yA.Wang, “An industrial- strength audio search algorithm”, ISMIR 2003 zGoogle yS. Baluja and M. Covell, “Content fingerprinting using wavelets”, Euro. Conf. on Visual Media Production, yV. Chandrasekhar, M. Sharifi, and D. A. Ross, “Survey and evaluation of audio fingerprinting schemes for mobile query-by-example applications”, ISMIR 2011

10 Philips: Thresholding as Features zObservation yThe sign of energy differences is robust to various operations xLossy encoding xRange compression xAdded noise zThresholding as Features Fingerprint F(t, f) Magnitude spectrum S(t, f) (Source: Dan Ellis)

11 Philips: Thresholding as Features (II) zRobust to low-bitrate MP3 encoding (see the right) zSensitive to “frame time difference”  Hop size is kept small! Original fingerprinting BER=0.078 Fingerprinting after MP3 encoding

12 Philips: Robustness of Features zBER of the features after various operations yGeneral low yHigh for speed and time- scale changes (which is not likely to occur under query by example)

13 Philips: Search Strategies Via hashing Inverted indexing

14 Shazam’s Method zIdeas yTake advantage of music local structures xFind salient peaks on spectrogram xPair peaks to form landmarks for comparison yEfficient search by hash tables xUse positions of landmarks as hash keys xUse song ID and offset time as hash values xUse time constraints to find matched landmarks

15 Database Preparation zCompute spectrogram yPerform mean subtraction & high-pass filtering zDetect salient peaks yFind initial threshold yUpdate the threshold along time zPair salient peaks to form landmarks yDefine target zone yForm landmarks and save them to a hash table

16 Query Match zIdentify landmarks zFind matched landmarks yRetrieve landmarks from the hash table yKeep only time-consistent landmarks zRank the database items yVia matched landmark counts yVia other confidence measures

17 Shazam: Landmarks as Features Spectrogram Salient peaks of spectrogram Pair peaks in target zone to form landmarks Landmark: [t1, f1, t2, f2] 24-bit hash key: f1: 9 bits Δf = f2-f1: 8 bits Δt = t2-t1: 7 bits Hash value: Song ID Landmark’s start time t1 (Avery Wang, 2003)

18 How to Find Salient Peaks zWe need to find peaks that are salient along both frequency and time axes yFrequency axis: Gaussian local smoothing yTime axis: Decaying threshold over time

19 How to Find Initial Threshold? zGoal yTo suppress neighboring peaks zIdeas yFind the local max. of mag. spectra of initial 10 frames ySuperimpose a Gaussian on each local max. yFind the max. of all Gaussians zExample yBased on Bad RomanceBad Romance yenvelopeGen.m

20 How to Update the Threshold along Time? yDecay the threshold yFind local maxima larger than the threshold  salient peaks yDefine the new threshold as the max of the old threshold and the Gaussians passing through the active local maxima

21 How to Control the No. of Salient peaks? zTo decrease the no. of salient peaks yPerform forward and backward sweep to find salient peaks along both directions yUse Gaussians with larger standard deviation y…

22 Time-decaying Thresholds zlandmarkFind01.mlandmarkFind01.m Forward: Backward:

23 How to Pair Salient Peaks? Target zone

24 Salient Peaks and Landmarks zPeak picking after forward smoothing zMatched landmarks (green) (Source: Dan Ellis)

25 Time Skew zOut of sync of frame boundary zSolution yIncrease frame size yRepeated LM extraction time Reference frames Query frames time skew!

26 To Avoid Time Skew zTo avoid time skew, query landmarks are extracted at various time shifts yExample of 4 shifts of step = hop/4 LM set 1 LM set 2 LM set 3 LM set 4 Query landmark set Union & unique

27 Landmarks for Hash Table Access

28 Parameters in Our Implementation zLandmarks ySample rate = 8000 Hz yFrame size = 1024 yOverlap = 512 yFrame rate = yLandmark rate = ~400 LM/sec zHash table yHash key size = 2^24 = 16.78M yMax song ID = 2^18 = 262 K yMax start time = 2^14/frameRate = minutes Our implementation is based on Dan Ellis’ work: Robust Landmark-Based Audio Fingerprinting,

29 Structure of Hash Table

30 Hash Table Lookup zQuery (hash keys from landmarks) zHash table Hash keys Hash values …… ……… Retrieved landmarks …

31 How to Find Query Offset Time? zOffset time of query can be derived by… Database landmarks Query landmarks Database LM start time Query LM start timeQuery offset time Retrieved LM Retrieved and matched LM

32 Find Matched Landmarks Query offset time A given LM starting at 9.5 sec retrieves 3 LMs in the hash table But only this one is matched! t=9.5 sec

33 Find Matched Landmarks zWe can determine the offset time by plotting histogram of start time difference (x-y): (Avery Wang, 2003) Start time plot Histogram of start time difference (x-y)

34 Matched Landmark Count zTo find matched (time-consistent) landmark count of a song: Song IDOffset timeHash value ……… Offset time Count …… Matched landmark count of song 2286 LM from the same song 2286 All retrieved landmarks Histogram of Offset time of a song

35 Final Ranking zA common way to have the final ranking yBased on each song’s matched landmark count yCan also be converted into scores between 0~100 Song ID Matched landmark count Offset time ………

36 Matched Landmarks vs. Noise Original Noisy01 Noisy02 Noisy03 Run goLmVsNoise.m in AFP toolbox to create this example.

37 Optimization Strategies for AFP zSeveral ways to optimize AFP yStrategy for query landmark extraction yConfidence measure yIncremental retrieval yBetter use of the hash table yRe-ranking for better performance

38 Strategy for LM Extraction (1) zGoal yTo trade computation for accuracy zSteps: 1.Construct a classifier to determine if a query is a “hit” or a “miss”. 2.Increase the landmark counts of “miss” queries for better accuracy 10-sec query Classifier Dense LM extraction Regular LM extraction AFP engine Retrieved songs “hit” “miss”

39 Strategy for LM Extraction (2) zClassifier construction yTraining data: “hit” and “miss” queries yClassifier: SVM yFeatures xmean volume xstandard deviation of volume xstandard deviation of absolute sum of high- order difference x… zRequirement yFast in evaluation xSimple or readily available features xEfficient classifier yAdaptive xEffective threshold for detecting miss queries

40 Strategy for LM Extraction (3) zTo increase landmarks for “miss” queries yUse more time-shifted query for LM extraction xOur test takes 4 shifts vs. 8 shifts yDecay the thresholds more rapidly to reveal more salient peaks y…

41 Strategy for LM Extraction (4) zSong database y44.1kHz, 16-bits y1500 songs x1000 songs (30 seconds) from GTZAN dataset x500 songs (3~5 minutes) from our own collection of English/Chinese songs zDatasets y10-sec clips recorded by mobile phones yTraining data x1412 clips (1223:189) yTest data x1062 clips

42 Strategy for LM Extraction (5) zAFP accuracy vs. computing time

43 Confidence Measure (1) zConfusion matrix NoYes NoC 00 C 01 YesC 10 C 11 Predicted Groundtruth

44 Confidence Measure (2) zFactors for confidence measure yMatched landmark count yLandmark count ySalient peak count y… zHow to use these factors yTake a value of the factor and used it as a threshold yNormalize the threshold by dividing it by query duration yVary the threshold to identify FAR & FRR

45 Dataset for Confidence Measure zSong database y44.1kHz, 16-bits y1500 songs x1000 songs (30 seconds) from GTZAN dataset x16284 songs (3~5 minutes) from our own collection of English songs zDatasets y10-sec clips recorded by mobile phones yIn the database x1062 clips yNot in the database x1412 clips

46 Confidence Measure (3) zDET (Detection Error Tradeoff) Curve zAccuracy vs. tolerance yNo OOV queries Toleranace of matched landmarks Accuracy 79.19% 79.66% 79.57%

47 Incremental Retrieval zGoal yTake additional query input if the confidence measure is not high enough zImplementation issues yUse only forward mode for landmark extraction  no. of landmarks ↗  computation time ↗ yUse statistics of matched landmarks to restricted the number of extracted landmarks for comparison

48 Hash Table Optimization zPossible directions for hash table optimization yTo increase song capacity  20 bits for songId xSong capacity = 2^20 = 1 M xMax start time = 2^12/frameRate = 4.37 minutes  Longer songs are split into shorter segments yTo increase efficiency  rule xPut 20% of the most likely songs to fast memory xPut 80% of the lese likely songs to slow memory yTo avoid collision  better hashing strategies

49 Re-ranking for Better Performance zFeatures that can be used to rank the matched songs yMatched landmark count yMatched frequency count 1 yMatched frequency count 2 y…

50 Our AFP Engine zMusic database y260k tracks currently y1M tracks in the future zDriving forces yFundamental issues in computer science (hashing, indexing…) yRequests from local companies zMethods yLandmarks as feature (Shazam’s method) ySpeedup by GPU zPlatform ySingle CPU + 3 GPUs

51 Specs of Our AFP Engine zPlatform yOS: CentOS 6 yCPU: Intel Xeon x5670 six cores 2.93GHz yMemory: 96GB zDatabase yPlease refer to this page.this page

52 Experiments zCorpora yDatabase: 2550 tracks yTest files: 5 mobile- recorded songs chopped into segments of 5, 10, 15, and 20 seconds zAccuracy test y5-sec clips: 161/275=58.6% y10-sec clips: 121/136=89.0% y15-sec clips: 88/90=97.8% y20-sec clips: 65/66=98.5% Accuracy vs. durationComputing time. vs. durationAccuracy vs. computing time

53 MATLAB Prototype for AFP zToolboxes yAudio fingerprintingAudio fingerprinting ySAPSAP yUtilityUtility zDataset yRussian songsRussian songs zInstruction yDownload the toolboxes yModify afpOptSet.m (in the audio fingerprinting toolbox) to add toolbox paths yRun goDemo.m.

54 Demos of Audio Fingerprinting zCommercial apps yShazamShazam ySoundhoundSoundhound zOur demo yhttp://mirlab.org/demo/audioFingerprintinghttp://mirlab.org/demo/audioFingerprinting

55 QBSH vs. AFP zQBSH yGoal: MIR yFeature: Pitch xPerceptible xSmall data size yMethod: LS yDatabase xHarder to collect xSmall storage yBottleneck xCPU/GPU-bound zAFP yGoal: MIR yFeatures: Landmarks xNot perceptible xBig data size yMethod: Matched LM yDatabase xEasier to collect xLarge storage yBottleneck xI/O-bound

56 Conclusions For AFP zConclusions yLandmark-based methods are effective yMachine learning is indispensable for further improvement. zFuture work: Scale up yShazam: 15M tracks in database, 6M tags/day yOur goal: x1M tracks with a single PC and GPU x10M tracks with cloud computing of 10 PC

57 References (I) zRobust Landmark-Based Audio Fingerprinting, Dan Ellis, zAvery Wang (Shazam) y“An Industrial-Strength Audio Search Algorithm”, ISMIR, 2003 y“The Shazam music recognition service”, Comm. ACM 49(8), 44-48,, zJ. Haitsma and T. Kalker (Phlillips) y“A highly robust audio fingerprinting system”, ISMIR, 2002 y“A highly robust audio fingerprinting system with an efficient search strategy,” J. New Music Research 32(2), , 2003.

58 References (II) zGoogle: y“Content Fingerprinting Using Wavelets”, Baluja, Covell., Proc. CVMP, 2006 y“Survey and Evaluation of Audio Fingerprinting Schemes for Mobile Query-by-Example Applications”, Vijay Chandrasekhar, Matt Sharifi, David A. Ross, ISMIR, 2011 z“Computer Vision for Music Identification”, Y. Ke, D. Hoiem, and R. Sukthankar, CVPR, 2005


Download ppt "Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) MIR Lab, CSIE Dept. National Taiwan University 1."

Similar presentations


Ads by Google