2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab.

2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab CS Dept., Tsing Hua Univ., Taiwan http://mirlab.org/jang

-2- Outline zIntroduction to MIR zQBSH (query by singing/humming) yIntro, demos, conclusions zAFP (audio fingerprinting) yIntro, demos, conclusions

-3- Content-based Music Information Retrieval (MIR) via Acoustic Inputs zMelody yQuery by humming (usually “ta” or “da”) yQuery by singing yQuery by whistling zNote onsets yQuery by tapping (at the onsets of notes) zMetadata yQuery by speech (for meta-data, such as title, artist, lyrics) zAudio contents yQuery by examples (noisy versions of original clips) zDrums yQuery by beatboxing

-4- Introduction to QBSH zQBSH: Query by Singing/Humming yInput: Singing or humming from microphone yOutput: A ranking list retrieved from the song database zProgression yFirst paper: Around 1994 yExtensive studies since 2001 yState of the art: QBSH tasks at ISMIR/MIREXQBSH tasks at ISMIR/MIREX

-5- Two Stages in QBSH zOffline stage yDatabase preparation xFrom MIDI files xFrom audio music (eg., MP3) xFrom human vocals yIndexing (if necessary) zOnline stage yPerform pitch tracking on the user’s query yCompare the query pitch with songs in database yReturn the ranked list according to similarity

-6- Frame Blocking for Pitch Tracking Frame size=256 points Overlap=84 points Frame rate=11025/(256-84)=64 pitch/sec Zoom in Overlap Frame

-7- ACF: Auto-correlation Function Frame s(i): Shifted frame s(i+  ):  =30 30 acf(30) = inner product of overlap part  Pitch period

-8- Frequency to Semitone Conversion zSemitone : A music scale based on A440 zReasonable pitch range: yE2 - C6 y82 Hz - 1047 Hz ( - )

-9- Typical Result of Pitch Tracking Pitch tracking via autocorrelation for 茉莉花 (jasmine)

-10- Comparison of Pitch Vectors Yellow line : Target pitch vector

-11- Comparison Methods of QBSH zCategories of approaches to QBSH yHistogram/statistics-based yNote vs. note xEdit distance yFrame vs. note xHMM yFrame vs. frame xLinear scaling, DTW, recursive alignment

-12- Linear Scaling zScale the query pitch linearly to match the candidates Original input pitch Stretched by 1.25 Stretched by 1.5 Compressed by 0.75 Compressed by 0.5 Target pitch in database Best match Original pitch

-13- Challenges in QBSH Systems zSong database preparation yMIDIs, singing clips, or audio music zReliable pitch tracking for acoustic input yInput from mobile devices or noisy karaoke bar zEfficient/effective retrieval yKaraoke machine: ~10,000 songs yInternet music search engine: ~500,000,000 songs

-14- Goal and Approach zGoal: To retrieve songs effectively within a given response time, say 5 seconds or so zOur strategy yMulti-stage progressive filtering yIndexing for different comparison methods yRepeating pattern identification

-15- MIRACLE zMIRACLE yMusic Information Retrieval Acoustically via Clustered and paralleL Engines zDatabase (~13000) yMIDI files ySolo vocals (<100) yMelody extracted from polyphonic music (<100) zComparison methods yLinear scaling yDynamic time warping zTop-10 Accuracy y70~75% zPlatform ySingle CPU+GPU

-16- Current MIRACLE zSingle server with GPU yNVIDIA 560 Ti, 384 cores (speedup factor = 10) Master server Clients Single server PC PDA/Smartphone Cellular Master server Request: pitch vector Response: search result Database size: ~13,000

-17- QBSH for Various Platforms zPC yWeb version zEmbedded systems yKaraoke machines zSmartphones yiPhone/iPad yAndroid phone zToysToys

-18- QBSH Demo zDemo page of MIR Lab: yhttp://mirlab.org/mir_products.asphttp://mirlab.org/mir_products.asp zMIRACLE demo: yhttp://mirlab.org/demo/miraclehttp://mirlab.org/demo/miracle zExisting commercial QBSH systems ywww.midomi.comwww.midomi.com ywww.soundhound.comwww.soundhound.com

-19- Conclusions for QBSH zQBSH yFun and interesting way to retrieve music yCan be extend to singing scoring yCommercial applications getting mature zChallenges yHow to deal with massive music databases? yHow to extract melody from audio music?

-20- Audio Fingerprinting (AFP) zGoal yIdentify a noisy version of a given audio clips (query by example, not by “cover versions”) zTechnical barrier yRobustness yEfficiency (6M tags/day for Shazam) yEffectiveness (15M tracks for Shazam) zApplications ySong purchase yRoyalty assignment (over radio) yConfirmation of commercials (over TV) yCopyright violation (over web) yTV program ID

-21- Two Stages in AFP zOffline yRobust feature extraction (audio fingerprinting) yHash table construction yInverted indexing zOnline yRobust feature extraction yHash table search yRanked list of the retrieved songs/music

-22- Representative Approaches to AFP zPhilips yJ. Haitsma and T. Kalker, “A highly robust audio fingerprinting system”, ISMIR 2002. zShazam yA.Wang, “An industrial- strength audio search algorithm”, ISMIR 2003 zGoogle yS. Baluja and M. Covell, “Content fingerprinting using wavelets”, Euro. Conf. on Visual Media Production, 2006. yV. Chandrasekhar, M. Sharifi, and D. A. Ross, “Survey and evaluation of audio fingerprinting schemes for mobile query-by-example applications”, ISMIR 2011

-23- Shazam: Landmarks as Features (Source: Dan Ellis) Spectrogram Local peaks of spectrogram Pair peaks to form landmarks Landmark: [t1, f1, t2, f2] 20-bit hash key: f1: 8 bits Δf = f2-f1: 6 bits Δt = t2-t1: 6 bits Hash value: Song ID & offset time

-24- Shazam: Landmark as Features (II) zPeak picking after smoothing zMatched landmarks (green) (Source: Dan Ellis)

-25- Shazam: Time-justified Landmarks zValid landmarks based on offset time (which maintains robustness even with hash collision)

-26- Our AFP Engine zDatabase (~2500) y2500 tracks currently y50k tracks soon y1M tracks in the future zDriving forces yFundamental issues in computer science (hashing, indexing…) yRequests from local companies zMethods yLandmarks as feature (Shazam) ySpeedup by hash tables and inverted files zPlatform yCurrently: Single CPU yIn the future: Multiple CPU & GPU

-27- Experiments zCorpora yDatabase: 2550 tracks yTest files: 5 mobile- recorded songs chopped into segments of 5, 10, 15, and 20 seconds zAccuracy test y5-sec clips: 161/275=58.6% y10-sec clips: 121/136=89.0% y15-sec clips: 88/90=97.8% y20-sec clips: 65/66=98.5% Accuracy vs. durationComputing time. vs. durationAccuracy vs. computing time

-28- Demos of Audio Fingerprinting zCommercial apps yShazamShazam ySoundhoundSoundhound zOur demo yhttp://mirlab.org/demo/afpFarmer2550http://mirlab.org/demo/afpFarmer2550

-29- Conclusions For AFP zConclusions yLandmark-based methods are effective yMachine learning is indispensable for further improvement. zFuture work: Scale up yShazam: 15M tracks in database, 6M tags/day yOur goal: x50K tracks with a single PC and GPU x1M tracks with cloud computing of 10 PC

2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab.

Similar presentations

Presentation on theme: "2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab.

Similar presentations

Presentation on theme: "2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab."— Presentation transcript:

Similar presentations

About project

Feedback