2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab.

Slides:

Advertisements

Similar presentations

Dynamic Time Warping (DTW)

Advertisements

A Musical Data Mining Primer CS235 – Spring ’03 Dan Berger

National Taiwan University

Multimedia Database Systems

Content-based retrieval of audio Francois Thibault MUMT 614B McGill University.

Pitch Tracking (音高追蹤) Jyh-Shing Roger Jang (張智星) MIR Lab (多媒體資訊檢索實驗室)

Look Who’s Talking Now SEM Exchange, Fall 2008 October 9, Montgomery College Speaker Identification Using Pitch Engineering Expo Banquet /08/09.

Outline Introduction Music Information Retrieval Classification Process Steps Pitch Histograms Multiple Pitch Detection Algorithm Musical Genre Classification.

Onset Detection in Audio Music J.-S Roger Jang ( 張智星 ) MIR LabMIR Lab, CSIE Dept. National Taiwan University.

Retrieval Methods for QBSH (Query By Singing/Humming) J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval.

Extracting Noise-Robust Features from Audio Data Chris Burges, John Platt, Erin Renshaw, Soumya Jana* Microsoft Research *U. Illinois, Urbana/Champaign.

FINGER PRINTING BASED AUDIO RETRIEVAL Query by example Content retrieval Srinija Vallabhaneni.

Basic Features of Audio Signals ( 音訊的基本特徵 ) Jyh-Shing Roger Jang ( 張智星 ) MIR Lab, CS Dept, Tsing Hua Univ. Hsinchu, Taiwan.

2015/6/281 MIR: Status and Trends 音樂資訊檢索的現況與未來 J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab CS Dept., Tsing Hua Univ., Taiwan

The Chinese University of Hong Kong Department of Computer Science and Engineering Lyu0202 Advanced Audio Information Retrieval System.

GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Introduction to MIR Course Overview 1.

Information Retrieval in Practice

infinity-project.org Engineering education for today’s classroom 53 Design Problem - Digital Band Build a digital system that can create music of any.

GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1.

2015/9/111 Introduction to ISMIR/MIREX J.-S. Roger Jang （張智星） Multimedia Information Retrieval (MIR) Lab CSIE Dept, National Taiwan Univ.

Speech Assessment 語音評測 J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab CS Dept, Tsing.

2015/10/101 Query-by-Singing/Humming: An Overview 「哼唱選歌」綜述 J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab CS Dept., Tsing Hua Univ., Taiwan.

Student: Mike Jiang Advisor: Dr. Ras, Zbigniew W. Music Information Retrieval.

Implementing a Speech Recognition System on a GPU using CUDA

National Taiwan University

K. Selçuk Candan, Maria Luisa Sapino Xiaolan Wang, Rosaria Rossini

Audio Thumbnailing of Popular Music Using Chroma-Based Representations Matt Williamson Chris Scharf Implementation based on: IEEE Transactions on Multimedia,

2015/10/221 Progressive Filtering and Its Application for Query-by-Singing/Humming J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab CS Dept.,

加速以 GPU 為運算核心的二階段哼唱選歌系統 A CCELERATING A T WO -S TAGE Q UERY BY S INGING /H UMMING S YSTEM U SING GPU S Student:Andy Chuang ( 莊詠翔 )

2015/10/241 Query by Tapping 敲擊選歌 J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab CS Dept., Tsing Hua Univ., Taiwan

Demos for QBSH J.-S. Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.

2015/10/251 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab.

Similarity Matrix Processing for Music Structure Analysis Yu Shiu, Hong Jeng C.-C. Jay Kuo ACM Multimedia 2006.

Content-based Music Retrieval from Acoustic Input (CBMR)

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.

Music Information Retrieval Information Universe Seongmin Lim Dept. of Industrial Engineering Seoul National University.

2016/6/41 Recent Improvement Over QBSH and AFP J.-S. Roger Jang （張智星） Multimedia Information Retrieval (MIR) Lab CSIE Dept, National Taiwan Univ.

Shazam -Abdulshafil Ahmed -Steven Lewis -Rick Huang.

RuSSIR 2013 QBSH and AFP as Two Successful Paradigms of Music Information Retrieval Jyh-Shing Roger Jang ( 張智星 ) MIR Lab, CSIE Dept.

Music Information Retrieval: Overview and Challenges

MMDB-8 J. Teuhola Audio databases About digital audio: Advent of digital audio CD in Order of magnitude improvement in overall sound quality.

Singer Similarity Doug Van Nort MUMT 611. Goal Determine Singer / Vocalist based on extracted features of audio signal Classify audio files based on singer.

QBSH Corpus The QBSH corpus provided by Roger Jang [1] consists of recordings of children’s songs from students taking the course “Audio Signal Processing.

Audio Fingerprinting as a New Task for MIREX-2014 Chung-Che Wang Jyh-Shing Roger Jang.

Content-Based MP3 Information Retrieval Chueh-Chih Liu Department of Accounting Information Systems Chihlee Institute of Technology 2005/06/16.

Query by Singing and Humming System

Some Research Activities in MIR Lab J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab CS.

DTW for Speech Recognition J.-S. Roger Jang ( 張智星 ) MIR Lab ( 多媒體資訊檢索實驗室 ) CS, Tsing Hua Univ. ( 清華大學.

Distance/Similarity Functions for Pattern Recognition J.-S. Roger Jang ( 張智星 ) CS Dept., Tsing Hua Univ., Taiwan

Audio Fingerprinting MUMT 611 Philippe Zaborowski March 2005.

Discussions on Audio Melody Extraction (AME) J.-S. Roger Jang ( 張智星 ) MIR Lab, CSIE Dept. National Taiwan University.

LOGO Song Identification System Team members: Nguyen Ngoc Tan Ho Vinh Thinh Nguyen Huu Duy Nguyen Hoang Diep Nguyen Trong Dai Le Thanh Tung Supervisor:

Pitch Tracking in Time Domain Jyh-Shing Roger Jang ( 張智星 ) MIR Lab, Dept of CSIE National Taiwan University

Audio Fingerprinting Wes Hatch MUMT-614 Mar.13, 2003.

Introduction to Music Information Retrieval (MIR)

Introduction to ISMIR/MIREX

Onset Detection, Tempo Estimation, and Beat Tracking

MIR Lab: R&D Foci and Demos （ MIR實驗室：研發重點及展示）

Query by Singing/Humming via Dynamic Programming

Singing Voice Separation via Active Noise Cancellation 使用主動式雜訊消除於歌聲分離

A review of audio fingerprinting (Cano et al. 2005)

自我介紹學歷：研究方向：經歷： 1984：學士，台大電機系 1992：博士，加州大學柏克萊分校、電機電腦系

Introduction to Music Information Retrieval (MIR)

Cheng-Ming Huang, Wen-Hung Liao Department of Computer Science

Introduction to Music Information Retrieval (MIR)

Neuro-Fuzzy and Soft Computing for Speaker Recognition (語者辨識)

Query by Singing/Humming via Dynamic Programming

ADBOT Advertisement Recognition FROM television and radio broadcast

Measuring the Similarity of Rhythmic Patterns

Pre and Post-Processing for Pitch Tracking

Presentation transcript:

2015/9/151 Two Paradigms for Music IR: Query by Singing/Humming and Audio Fingerprinting J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab CS Dept., Tsing Hua Univ., Taiwan

-2- Outline zIntroduction to MIRIntroduction to MIR zQBSH (query by singing/humming) yIntro, demos, conclusions zAFP (audio fingerprinting) yIntro, demos, conclusions

-3- Introduction to QBSH zQBSH: Query by Singing/Humming yInput: Singing or humming from microphone yOutput: A ranking list retrieved from the song database zProgression yFirst paper: Around 1994 yExtensive studies since 2001 yState of the art: QBSH tasks at ISMIR/MIREXQBSH tasks at ISMIR/MIREX

-4- 「哼唱選歌」的流程  前處理：  收集單軌標準答案（通常是 MIDI 檔）  轉換成適合比對的中介格式  即時處理：  將使用者的音訊輸入轉成音高向量  由音高向量轉成音符（選擇性）  和標準答案進行比對  列出排名

-5- Pitch Tracking for QBSH zTwo categories for pitch tracking algorithms y Time domain ( 時域 ) xACF (Autocorrelation function) xAMDF (Average magnitude difference function) xSIFT (Simple inverse filtering tracking) y Frequency domain ( 頻域 ) xHarmonic product spectrum method xCepstrum method

-6- Frame Blocking for Pitch Tracking Frame size=256 points Overlap=84 points Frame rate=11025/(256-84)=64 pitch/sec Zoom in Overlap Frame

-7- ACF: Auto-correlation Function Frame s(i): Shifted frame s(i+  ):  =30 30 acf(30) = inner product of overlap part  Pitch period

-8- Pitch Tracking via ACF zSpecs ySampe rate = Hz yFrame size = 32 ms yOverlap = 0 yFrame rate = zPlayback ysoo.wavsoo.wav ysooPitch.wavsooPitch.wav

-9- Frequency to Semitone Conversion zSemitone : A music scale based on A440 zReasonable pitch range: yE2 - C6 y82 Hz Hz ( - )

-10- Typical Result of Pitch Tracking Pitch tracking via autocorrelation for 茉莉花 (jasmine)

-11- Comparison of Pitch Vectors Yellow line : Target pitch vector

-12- Comparison Methods of QBSH zCategories of approaches to QBSH yHistogram/statistics-based yNote vs. note xEdit distance yFrame vs. note xHMM yFrame vs. frame xLinear scaling, DTW, recursive alignment

-13- Linear Scaling (LS) zConcept yScale the query linearly to match the candidates zExample:

-14- Linear Scaling (II) zStrength yOne-shot for dealing with key transposition yEfficient and effective yIndexing methods available zWeakness yCannot deal with non- uniform tempo variations zTypical mapping path

-15- Linear Scaling (III) zDistance function for LS yNormalized L 1 -norm yNormalized L 2 -norm zRest handling yExtend previous non-zero note zAlignment example

-16- Dynamic Time Warping (DTW) zGoal: yAllows comparison of high tolerance to tempo variation zCharacteristics: yRobust for irregular tempo variations yTrial-and-error for dealing with key transposition yExpensive in computation yDoes not conform to triangle inequality ySome indexing algorithms do exist z#1 method for task 2 in QBSH/MIREX 2006

-17- Dynamic Time Warping: Type 1 i j t(i-1) r(j) t: input pitch vector (8 sec, 128 points) r: reference pitch vector Local paths: degrees DTW recurrence: r(j-1) t(i)

-18- Dynamic Time Warping: Type 2 i j t(i-1) r(j) r(j-1) t(i) t: input pitch vector (8 sec, 128 points) r: reference pitch vector Local paths: degrees DTW recurrence:

-19- Local Path Constraints zType 1: y local paths zType 2: y local paths

-20- DTW Path of “Match Beginning”

-21- DTW Path of “Match Anywhere”

-22- DTW Path of “Match Anywhere”

-23- Challenges in QBSH Systems zSong database preparation yMIDIs, singing clips, or audio music zReliable pitch tracking for acoustic input yInput from mobile devices or noisy karaoke bar zEfficient/effective retrieval yKaraoke machine: ~10,000 songs yInternet music search engine: ~500,000,000 songs

-24-

-25- Goal and Approach zGoal: To retrieve songs effectively within a given response time, say 5 seconds or so zOur strategy yMulti-stage progressive filtering yIndexing for different comparison methods yRepeating pattern identification

-26- MIRACLE zMIRACLE yMusic Information Retrieval Acoustically via Clustered and paralleL Engines zDatabase (~13000) yMIDI files ySolo vocals (<100) yMelody extracted from polyphonic music (<100) zComparison methods yLinear scaling yDynamic time warping zTop-10 Accuracy y70~75% zPlatform ySingle CPU+GPU

-27- MIRACLE Before Oct zClient-server distributed computing zCloud computing via clustered PCs Master server Clients Clustered servers PC PDA/Smartphone Cellular Slave Master server Slave servers Request: pitch vector Response: search result Database size: ~12,000

-28- Current MIRACLE zSingle server with GPU yNVIDIA 560 Ti, 384 cores (speedup factor = 66) Master server Clients Single server PC PDA/Smartphone Cellular Master server Request: pitch vector Response: search result Database size: ~13,000

-29- MIRACLE in the Future zMulti-modal retrieval ySinging, humming, speech, audio, tapping… Master server Clients Clustered servers PC PDA/Smartphone Cellular Slave Master server Slave servers Request: feature vector Response: search result

-30- Outlook of MIRACLE zWeb versionzStand-alone version

-31- QBSH for Other Platforms zEmbedded systems yKaraoke machines zSmartphones yiPhone/iPad yAndroid phone zToysToys

-32- Returned Results zTypical results of MIRACLE

-33- QBSH Demo zDemo page of MIR Lab: yhttp://mirlab.org/mir_products.asphttp://mirlab.org/mir_products.asp zMIRACLE demo: yhttp://mirlab.org/demo/miraclehttp://mirlab.org/demo/miracle zExisting commercial QBSH systems ywww.midomi.comwww.midomi.com ywww.soundhound.comwww.soundhound.com

-34- To Make QBSH More Efficient zAlgorithms yIndexing of LS/DTW yProgressive filtering zNew Platforms yGPU (10 times faster for QBSH!) yGrid/clustered computing yMulti-core platforms

-35- Conclusions for QBSH zQBSH yFun and interesting way to retrieve music yCan be extend to singing scoring yCommercial applications getting mature zChallenges yHow to deal with massive music databases? yHow to extract melody from audio music?

-36- Audio Fingerprinting (AFP) zGoal yIdentify a noisy version of a given audio clips (no “cover versions”) zTechnical barrier yRobustness yEfficiency (6M tags/day for Shazam) yDatabase collection (15M tracks for Shazam) zApplications ySong purchase yRoyalty assignment (over radio) yConfirmation of commercials (over TV) yCopyright violation (over web) yTV program ID

-37- Company: ShazamShazam zFacts yFirst commercial product of audio fingerprinting ySince 2002, UK zTechnology yAudio fingerprinting zFounder yAvery Wang (PhD at Standard, 1994)

-38- Company: SoundhoundSoundhound zFacts yFirst product with multi-modal music search yAKA: midomimidomi zTechnologies yAudio fingerprinting yQuery by singing/humming ySpeech recognition zFounder yKeyvan Mohajer (PhD at Stanford, 2007)

-39- Two Stages in AFP zOffline: Database construction yRobust feature extraction (audio fingerprinting) yHash table construction yInverted indexing zOnline: Application yRobust feature extraction yHash table search yRanked list of the retrieved songs/music

-40- Robust Feature Extraction zVarious kinds of features for AFP yInvariance along time and frequency yLandmark of a pair of local maxima yWavelets y… zExtensive test required for choosing the best features

-41- Representative Approaches to AFP zPhilips yJ. Haitsma and T. Kalker, “A highly robust audio fingerprinting system”, ISMIR zShazam yA.Wang, “An industrial- strength audio search algorithm”, ISMIR 2003 zGoogle yS. Baluja and M. Covell, “Content fingerprinting using wavelets”, Euro. Conf. on Visual Media Production, yV. Chandrasekhar, M. Sharifi, and D. A. Ross, “Survey and evaluation of audio fingerprinting schemes for mobile query-by-example applications”, ISMIR 2011

-42- Philips: Thresholding as Features zObservation yThe sign of energy differences is robust to various operations xLossy encoding xRange compression xAdded noise zThresholding as Features Fingerprint F(t, f) Magnitude spectrum S(t, f) (Source: Dan Ellis)

-43- Philips: Thresholding as Features (II) zRobust to low-bitrate MP3 encoding (see the right) zSensitive to “frame time difference”  Hop size is kept small! Original fingerprinting BER=0.078 Fingerprinting after MP3 encoding

-44- Philips: Robustness of Features zBER of the features after various operations yGeneral low yHigh for speed and time- scale changes (which is not likely to occur under query by example)

-45- Philips: Search Strategies Via hashing Inverted indexing

-46- Shazam: Landmarks as Features (Source: Dan Ellis) Spectrogram Local peaks of spectrogram Pair peaks to form landmarks Landmark: [t1, f1, t2, f2] 20-bit hash key: f1: 8 bits Δf = f2-f1: 6 bits Δt = t2-t1: 6 bits Hash value: Song ID & offset time

-47- Shazam: Landmark as Features (II) zPick peaks based on local decaying surface zMatched landmarks (Source: Dan Ellis)

-48- Shazam: Time-justified Landmarks zValid landmarks based on offset time (which avoids hash collision)

-49- Our AFP Engine zDatabase (~2500) yPersonally collected MP3 (currently) yMusic collected from Youtube (in the future) zDriving forces yFundamental issues in CS (hashing, indexing…) yRequests from local companies zMethods yLandmarks as feature (Shazam) ySpeedup by hash tables zPlatform ySingle CPU over MS Windows

-50- Experiments zCorpora yDatabase: 2550 tracks yTest files: 5 songs recorded with mobiles (with noisy environment), and then chopped into segments of 5, 10, 15, and 20 seconds zAccuracy y5-second clips: 161/275 y10-second clips: 121/136 y15-second clips: 88/90 y20-second clips: 65/66

-51- Accuracy and Efficiency Accuracy vs. query duration Computing time. vs. query duration Accuracy vs. computing time

-52- Demos of Audio Fingerprinting zCommercial apps yShazamShazam ySoundhoundSoundhound zOurs yhttp://mirlab.org/demo/afpFarmer2550http://mirlab.org/demo/afpFarmer2550

-53- Conclusions For AFP zConclusions yLandmark-based methods are effective zFuture work yScale-up x15M tracks in database, 6M tags/day ySpeed-up xProgressive filtering xGPU