2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab.

2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab CS Dept., Tsing Hua Univ., Taiwan http://mirlab.org/jang

-2- Outline zIntroduction to MIRIntroduction to MIR zQBSH (query by singing/humming) yIntro, demos, conclusions zAFP (audio fingerprinting) yIntro, demos, conclusions

-3- Introduction to QBSH zQBSH: Query by Singing/Humming yInput: Singing or humming from microphone yOutput: A ranking list retrieved from the song database zProgression yFirst paper: Around 1994 yExtensive studies since 2001 yState of the art: QBSH tasks at ISMIR/MIREXQBSH tasks at ISMIR/MIREX

-4- 「哼唱選歌」的流程  前處理：  收集單軌標準答案（通常是 MIDI 檔）  轉換成適合比對的中介格式  即時處理：  將使用者的音訊輸入轉成音高向量  由音高向量轉成音符（選擇性）  和標準答案進行比對  列出排名

-5- Pitch Tracking for QBSH zTwo categories for pitch tracking algorithms y Time domain ( 時域 ) xACF (Autocorrelation function) xAMDF (Average magnitude difference function) xSIFT (Simple inverse filtering tracking) y Frequency domain ( 頻域 ) xHarmonic product spectrum method xCepstrum method

-6- Frame Blocking for Pitch Tracking Frame size=256 points Overlap=84 points Frame rate=11025/(256-84)=64 pitch/sec Zoom in Overlap Frame

-7- ACF: Auto-correlation Function Frame s(i): Shifted frame s(i+  ):  =30 30 acf(30) = inner product of overlap part  Pitch period

-8- Pitch Tracking via ACF zSpecs ySampe rate = 11025 Hz yFrame size = 32 ms yOverlap = 0 yFrame rate = 31.25 zPlayback ysoo.wavsoo.wav ysooPitch.wavsooPitch.wav

-9- Frequency to Semitone Conversion zSemitone : A music scale based on A440 zReasonable pitch range: yE2 - C6 y82 Hz - 1047 Hz ( - )

-10- Typical Result of Pitch Tracking Pitch tracking via autocorrelation for 茉莉花 (jasmine)

-11- Comparison of Pitch Vectors Yellow line : Target pitch vector

-12- Comparison Methods of QBSH zCategories of approaches to QBSH yHistogram/statistics-based yNote vs. note xEdit distance yFrame vs. note xHMM yFrame vs. frame xLinear scaling, DTW, recursive alignment

-13- Linear Scaling (LS) zConcept yScale the query linearly to match the candidates zExample:

-14- Linear Scaling (II) zStrength yOne-shot for dealing with key transposition yEfficient and effective yIndexing methods available zWeakness yCannot deal with non- uniform tempo variations zTypical mapping path

-15- Linear Scaling (III) zDistance function for LS yNormalized L 1 -norm yNormalized L 2 -norm zRest handling yExtend previous non-zero note zAlignment example

-16- Dynamic Time Warping (DTW) zGoal: yAllows comparison of high tolerance to tempo variation zCharacteristics: yRobust for irregular tempo variations yTrial-and-error for dealing with key transposition yExpensive in computation yDoes not conform to triangle inequality ySome indexing algorithms do exist z#1 method for task 2 in QBSH/MIREX 2006

-17- Dynamic Time Warping: Type 1 i j t(i-1) r(j) t: input pitch vector (8 sec, 128 points) r: reference pitch vector Local paths: 27-45-63 degrees DTW recurrence: r(j-1) t(i)

-18- Dynamic Time Warping: Type 2 i j t(i-1) r(j) r(j-1) t(i) t: input pitch vector (8 sec, 128 points) r: reference pitch vector Local paths: 0-45-90 degrees DTW recurrence:

-19- Local Path Constraints zType 1: y27-45-63 local paths zType 2: y0-45-90 local paths

-20- DTW Path of “Match Beginning”

-21- DTW Path of “Match Anywhere”

-22- DTW Path of “Match Anywhere”

-23- Challenges in QBSH Systems zSong database preparation yMIDIs, singing clips, or audio music zReliable pitch tracking for acoustic input yInput from mobile devices or noisy karaoke bar zEfficient/effective retrieval yKaraoke machine: ~10,000 songs yInternet music search engine: ~500,000,000 songs

-25- Goal and Approach zGoal: To retrieve songs effectively within a given response time, say 5 seconds or so zOur strategy yMulti-stage progressive filtering yIndexing for different comparison methods yRepeating pattern identification

-26- MIRACLE zMIRACLE yMusic Information Retrieval Acoustically via Clustered and paralleL Engines zDatabase (~13000) yMIDI files ySolo vocals (<100) yMelody extracted from polyphonic music (<100) zComparison methods yLinear scaling yDynamic time warping zTop-10 Accuracy y70~75% zPlatform ySingle CPU+GPU

-27- MIRACLE Before Oct. 2011 zClient-server distributed computing zCloud computing via clustered PCs Master server Clients Clustered servers PC PDA/Smartphone Cellular Slave Master server Slave servers Request: pitch vector Response: search result Database size: ~12,000

-28- Current MIRACLE zSingle server with GPU yNVIDIA 560 Ti, 384 cores (speedup factor = 66) Master server Clients Single server PC PDA/Smartphone Cellular Master server Request: pitch vector Response: search result Database size: ~13,000

-29- MIRACLE in the Future zMulti-modal retrieval ySinging, humming, speech, audio, tapping… Master server Clients Clustered servers PC PDA/Smartphone Cellular Slave Master server Slave servers Request: feature vector Response: search result

-30- Outlook of MIRACLE zWeb versionzStand-alone version

-31- QBSH for Other Platforms zEmbedded systems yKaraoke machines zSmartphones yiPhone/iPad yAndroid phone zToysToys

-32- Returned Results zTypical results of MIRACLE

-33- QBSH Demo zDemo page of MIR Lab: yhttp://mirlab.org/mir_products.asphttp://mirlab.org/mir_products.asp zMIRACLE demo: yhttp://mirlab.org/demo/miraclehttp://mirlab.org/demo/miracle zExisting commercial QBSH systems ywww.midomi.comwww.midomi.com ywww.soundhound.comwww.soundhound.com

-34- To Make QBSH More Efficient zAlgorithms yIndexing of LS/DTW yProgressive filtering zNew Platforms yGPU (10 times faster for QBSH!) yGrid/clustered computing yMulti-core platforms

-35- Conclusions for QBSH zQBSH yFun and interesting way to retrieve music yCan be extend to singing scoring yCommercial applications getting mature zChallenges yHow to deal with massive music databases? yHow to extract melody from audio music?

-36- Audio Fingerprinting (AFP) zGoal yIdentify a noisy version of a given audio clips (no “cover versions”) zTechnical barrier yRobustness yEfficiency (6M tags/day for Shazam) yDatabase collection (15M tracks for Shazam) zApplications ySong purchase yRoyalty assignment (over radio) yConfirmation of commercials (over TV) yCopyright violation (over web) yTV program ID

-37- Company: ShazamShazam zFacts yFirst commercial product of audio fingerprinting ySince 2002, UK zTechnology yAudio fingerprinting zFounder yAvery Wang (PhD at Standard, 1994)

-38- Company: SoundhoundSoundhound zFacts yFirst product with multi-modal music search yAKA: midomimidomi zTechnologies yAudio fingerprinting yQuery by singing/humming ySpeech recognition zFounder yKeyvan Mohajer (PhD at Stanford, 2007)

-39- Two Stages in AFP zOffline: Database construction yRobust feature extraction (audio fingerprinting) yHash table construction yInverted indexing zOnline: Application yRobust feature extraction yHash table search yRanked list of the retrieved songs/music

-40- Robust Feature Extraction zVarious kinds of features for AFP yInvariance along time and frequency yLandmark of a pair of local maxima yWavelets y… zExtensive test required for choosing the best features

-41- Representative Approaches to AFP zPhilips yJ. Haitsma and T. Kalker, “A highly robust audio fingerprinting system”, ISMIR 2002. zShazam yA.Wang, “An industrial- strength audio search algorithm”, ISMIR 2003 zGoogle yS. Baluja and M. Covell, “Content fingerprinting using wavelets”, Euro. Conf. on Visual Media Production, 2006. yV. Chandrasekhar, M. Sharifi, and D. A. Ross, “Survey and evaluation of audio fingerprinting schemes for mobile query-by-example applications”, ISMIR 2011

-42- Philips: Thresholding as Features zObservation yThe sign of energy differences is robust to various operations xLossy encoding xRange compression xAdded noise zThresholding as Features Fingerprint F(t, f) Magnitude spectrum S(t, f) (Source: Dan Ellis)

-43- Philips: Thresholding as Features (II) zRobust to low-bitrate MP3 encoding (see the right) zSensitive to “frame time difference”  Hop size is kept small! Original fingerprinting BER=0.078 Fingerprinting after MP3 encoding

-44- Philips: Robustness of Features zBER of the features after various operations yGeneral low yHigh for speed and time- scale changes (which is not likely to occur under query by example)

-45- Philips: Search Strategies Via hashing Inverted indexing

-46- Shazam: Landmarks as Features (Source: Dan Ellis) Spectrogram Local peaks of spectrogram Pair peaks to form landmarks Landmark: [t1, f1, t2, f2] 20-bit hash key: f1: 8 bits Δf = f2-f1: 6 bits Δt = t2-t1: 6 bits Hash value: Song ID & offset time

-47- Shazam: Landmark as Features (II) zPick peaks based on local decaying surface zMatched landmarks (Source: Dan Ellis)

-48- Shazam: Time-justified Landmarks zValid landmarks based on offset time (which avoids hash collision)

-49- Our AFP Engine zDatabase (~2500) yPersonally collected MP3 (currently) yMusic collected from Youtube (in the future) zDriving forces yFundamental issues in CS (hashing, indexing…) yRequests from local companies zMethods yLandmarks as feature (Shazam) ySpeedup by hash tables zPlatform ySingle CPU over MS Windows

-50- Experiments zCorpora yDatabase: 2550 tracks yTest files: 5 songs recorded with mobiles (with noisy environment), and then chopped into segments of 5, 10, 15, and 20 seconds zAccuracy y5-second clips: 161/275 y10-second clips: 121/136 y15-second clips: 88/90 y20-second clips: 65/66

-51- Accuracy and Efficiency Accuracy vs. query duration Computing time. vs. query duration Accuracy vs. computing time

-52- Demos of Audio Fingerprinting zCommercial apps yShazamShazam ySoundhoundSoundhound zOurs yhttp://mirlab.org/demo/afpFarmer2550http://mirlab.org/demo/afpFarmer2550

-53- Conclusions For AFP zConclusions yLandmark-based methods are effective zFuture work yScale-up x15M tracks in database, 6M tags/day ySpeed-up xProgressive filtering xGPU

2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab.

Similar presentations

Presentation on theme: "2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab.

Similar presentations

Presentation on theme: "2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab."— Presentation transcript:

Similar presentations

About project

Feedback