Modeling Music with Words a multi-class naïve Bayes approach Douglas Turnbull Luke Barrington Gert Lanckriet Computer Audition Laboratory UC San Diego.

Modeling Music with Words a multi-class naïve Bayes approach Douglas Turnbull Luke Barrington Gert Lanckriet Computer Audition Laboratory UC San Diego ISMIR 2006 October 11, 2006 Image from vintageguitars.org.uk

1 People use words to describe music How would one describe “I’m a Believer” by The Monkees? We might use words related to: Genre: ‘Pop’, ‘Rock’, ‘60’s’ Instrumentation: ‘tambourine, ‘male vocals’, ‘electric piano’ Adjectives:‘catchy’, ‘happy’, ‘energetic’ Usage:‘getting ready to go out’ Related Sounds:‘The Beatles’, ‘The Turtles’, ‘Lovin’ Spoonful’ We learn to associate certain words with the music we hear. Image: www.twang-tone.de/45kicks.html

2 Modeling music and words Our goal is to design a statistical system that learns a relationship between music and words. Given such a system, we can: 1.Annotation: Given a audio-content of a song, we can ‘annotate’ the song with semantically meaningful words. song  words 2.Retrieval: Given a text-based query, we can ‘retrieve’ relevant songs based on the audio content of the songs. words  songs Image from: http://www.lacoctelera.com/

3 Modeling images and words Content-based image annotation and retrieval has been a hot topic in recent years [CV05, FLM04, BJ03, BDF+02, …]. This application has benefited from and inspired recent developments in machine learning. *Images from [CV05], www.oldies.com Retrieval Query String: ‘jet’ Annotation How can MIR benefit from and inspire new developments in machine learning?

4 Related work: Images from www.sixtiescity.com Modeling music and words is at the heart of MIR research. jointly modeling semantic labels and audio content genre, emotion, style, usage classification music similarity analysis Whitman et al. have produced a large body of work that is closely related to our work [Whi05, WE04, WR05]. Others have looked at joint model of words and sound effects. Most focus on non-parametric models (kNN) [SAR-Sla02, AudioClas-CK04]

5 Representing music and words Consider a vocabulary and a heterogeneous data set of song-caption pairs: Vocabulary - predefined set of words Song - set of audio feature vectors (X = {x 1,…, x T }) Caption - binary document vector (y) Example: “I’m a believer” by The Monkees is a happy pop song that features tambourine. Given the vocabulary {pop, jazz, tambourine, saxophone, happy, sad} X = set of MFCC vectors extracted from audio track y = [1, 0, 1, 0, 1, 0] Image from www.bluesforpeace.com

6 Overview of our system: Representation T T Caption Audio-Feature Extraction (X) Training Data Document Vectors (y) Data Features Vocabulary

7 Probabilistic model for music and words Consider a vocabulary and a set of song-caption pairs Vocabulary - predefined set of words Song - set of audio feature vectors (X = {x 1,…, x T }) Caption - binary document vector (y) For the i-th word in our vocabulary, we estimate a ‘word’ distribution, P(x|i). Probability distribution over audio feature vector space Modeled with a Gaussian Mixture Model (GMM) GMM estimated using Expectation Maximization (EM) Key idea: training data for each ‘word’ distribution is the set of all feature vectors from all songs that are labeled with that word. Multiple Instance Learning: includes some irrelevant feature vectors Weakly Labeled Data: excludes some relevant feature vectors Our probabilistic model is a set of ‘word’ distributions (GMMs) Image from www.freewebs.com

8 Overview of our system: Modeling T T Caption Audio-Feature Extraction (X) Parameter Estimation: EM Algorithm Parametric Model: Set of GMMs Training Data Document Vectors (y) Data Features Modeling Vocabulary

9 Overview of our system: Annotation T T Caption Audio-Feature Extraction (X) Parameter Estimation: EM Algorithm Parametric Model: Set of GMMs (annotation) Inference Caption Training Data Novel Song Document Vectors (y) Data Features Modeling Vocabulary

10 Given ‘word’ distributions P(x|i) and a query song (x 1,…,x T ), we annotate with word i*: Naïve Bayes Assumption: we assume x i and x j are conditionally independent, given i: Assuming a uniform prior and taking a log transform, we have Using this equation, we annotate the query song with the top N words. Inference: Annotation www.cascadeblues.org

11 Overview of our system: Annotation T T Caption Audio-Feature Extraction (X) Parameter Estimation: EM Algorithm Parametric Model: Set of GMMs (annotation) Inference Caption Training Data Novel Song Document Vectors (y) Data Features Modeling Vocabulary

12 Overview of our system: Retrieval T T Caption Audio-Feature Extraction (X) Parameter Estimation: EM Algorithm Parametric Model: Set of GMMs (annotation) Inference (retrieval) Text Query Caption Training Data Novel Song Document Vectors (y) Data Features Modeling Vocabulary

13 Inference: Retrieval We would like to rank test songs by the posterior probability P(x 1, …,x T |q) given a query word q. Problem: this results in almost the same ranking for all query words. There are two reasons: 1.Length Bias Longer songs will have proportionately lower likelihood resulting from the sum of additional log terms. This results from the naïve Bayes assumption of conditional independence between audio feature vectors [RQD00]. Image from www.rockakademie-owl.de

14 Inference: Retrieval We would like to rank test songs by the posterior probability P(x 1, …,x T |q) given a query word q. Problem: this results in almost the same ranking for all query words. There are two reasons: 1.Length Bias 2.Song Bias 1.Many conditional word distributions P(x|q) are similar to the generic song distribution P(x) 2.High probability (e.g. generic) songs under P(x) often have high probability under P(x|q) Solution: Rank by likelihood P(q|x 1, …,x T ) instead. Normalize P(x 1, …,x T |q) by P(x 1, …,x T ) Image from www.rockakademie-owl.de

15 Overview of our system T T Caption Audio-Feature Extraction (X) Parameter Estimation: EM Algorithm Parametric Model: Set of GMMs (annotation) Inference (retrieval) Text Query Caption Training Data Novel Song Document Vectors (y) Data Features Modeling Vocabulary

16 Overview of our system: Evaluation T T Caption Audio-Feature Extraction (X) Parameter Estimation: EM Algorithm Parametric Model: Set of GMMs (annotation) Inference (retrieval) Text Query Caption Training Data Novel SongEvaluation Document Vectors (y) Data Features Modeling Evaluation Vocabulary

17 Experimental Setup Data: 2131 song-review pairs – Audio: popular western music from the last 60 years DMFCC feature vectors [MB03] Each feature vector summarize 3/4 seconds of audio content Each song is represent by between 320-1920 feature vectors – Text: song reviews from AMG Allmusic database We create a vocabulary of 317 ‘musically relevant‘ unigrams and bigrams A review is a natural language document written by a musical expert Each review is converted into a binary document vector – 80% Training Set: used for parameters estimation – 20% Testing Set: used for model evaluation Image from www.chrisbarber.net

18 Experimental Setup Tasks: 1.Annotation: annotate each test song with 10 words 2.Retrieval: rank order all test songs given a query word Metrics: We adopt evaluation metrics developed for image annotation and retrieval [CV05]. 1.Annotation: mean per-word precision and recall 2.Retrieval: mean average precision mean area under the ROC curve Image from www.chrisbarber.net

19 Quantitative Results Our Model.072.119.1090.61 Baseline.032.060.0720.50 Annotation RecallPrecision Retrieval maPrec AROC Our model performs significantly better than random for all metrics. one-sided paired t-test with  = 0.1 recall & precision are bounded by a value less 1 AROC is perhaps the most intuitive metric Image from sesentas.ururock.com

20 Discussion 1.Music is inherently subjective Different people will use different words to describe the same song. 2.We are learning and evaluating using a very noisy text corpus Reviewer do not make explicit decisions about the relationships between individual words when reviewing a song. “This song does not rock.” Mining the web may not suffice. Solution: manually label data (e.g., MoodLogic, Pandora) Image from www.16-bits.com.ar

21 Discussion 3. Our system performs much better when we annotate & retrieve sound effects BBC sound effect library More objective task Cleaner text corpus Area under the ROC = 0.80 (compare with 0.61 for music) 4. Best results for content-based image annotation and retrieval are comparable to our sound effect results. Image from www.16-bits.com.ar

“Talking about music is like dancing about architecture” - origins unknown Please send your questions and comments to Douglas Turnbull - dturnbul@cs.ucsd.edu Image from vintageguitars.org.uk

23 References

24 References

25 Related work: Whitman et al. have produced a large body of work that is closely related to our work [Whi05, WE04, WR05]. Uses web documents associated with artists, not songs Focus on vocabulary selection learn a binary classifier for each word a word is ‘grounded’ if the classifier can separate the audio data. Produces some tools for a “query-by-description” system, but no quantitative results on a complete system How do we combine the outputs of the binary classifiers? Approach would be sensitive to ‘weakly labeled’ data Others have looked at joint model of words and sound effects. Most focus on non-parametric models (kNN) [AudioClas - CK04, Sla02]

26 Qualitative Annotation Results

27 Qualitative Retrieval Results

28 Text-Feature Extraction Let our vocabulary V be a set of unigram and bigram tokens. For each song review, we: 1)parse the review string into a set of tokens 2)apply a custom stemming algorithm to the tokens 3)create a binary document vector d in [0,1] |V|, where d i is 1 if the the i th token is present in the review 0 otherwise Example: Vocab = {blues, guitar, jazz, blues;guitar, banjo, lick} Review = “This is a great blues song filled with sweet guitar licks.” Document Vector = [1,1,0,0,0,1] Discussion: Latent Semantic Analysis (LSA) offers an alternative that captures notions of synonymy and polysemy.

29 Audio-Feature Extraction Each song is time series of samples that represent 1-12 minutes of high-fidelity audio content. –CD audio: If we consider a 5 minutes song sample at 44,100 sample per second, our song lives in a 13.2 million dimensional space. –We generally considered downsampled, single channel audio signals. We reduce the dimension of our song by 1.extracting a d-dimensional feature vector for each ¾ second window of audio 2.applying a linear transform to the d-dimensional feature vector and retain a d-dimensional feature vector. The linear transform is found using principle components analysis (PCA). The resulting representation is a matrix with d’ rows and a varying number of columns. –Returning to the example above, we extract about 60 features for each ¾ sec window, reduce the dimensionality to 12 features, and output a feature matrix in R 12 x 400

30 Audio-Feature Extraction We consider two perceptually-motivated feature sets that have show superior performance on the task of music classification by genre [McKinney & Breebaart ’03]: 1.Dynamic Mel-frequency cepstral coefficients (dMFCC) 2.Auditory filterbank temporal envelopes (AFTE) Not discussed today

31 Audio-Feature Extraction Dynamic MFCC (dMFCC) features: 1) For each short-time window (23-msec) extract MFCC [Logan ’00] 1.Find the spectrum using the Discrete Fourier Transform (DFT) Early stages of auditory system perform analysis in the frequency domain. 2.Calculate the log spectrum (know as the cepstrum) Perceptual loudness has been found to be related to the log(magnitude) of a signal. 3.Apply Mel-scaling Mapping between true frequency and perceived frequency 4.Separate frequency components into 40 bins 5.Apply discrete cosine transform (DCT) Reduces dimensionality – efficient coding 2) For time series of 64 13-D MFCC vectors, compute power spectrum and integrate power within frequency bands DC (0 Hz) – average of each feature 1-2 Hz – rate of musical beats 3-15 Hz – speech syllabic rates 20-43 Hz – range used to detect perceptual ‘roughness’ Result is a 52-D vector for each ¾ second window of audio content.

32 Auditory Filterbank Temporal Envelope (AFTE) features: 1)Each half-overlapping, ¾ second window of audio is passed through a bank of 18 GammaTone filters [Paterson et. al 1988] The gammatone filterbank output is an array of filtered waves that simulate the motion of the basilar membrane in the cochlea as a function of time. Center frequencies are spaced logarithmically from 26 to 11025 Hz 2)The windowed FFT spectrum of each gammatone filter is summarized by summing the energy in 4 bands: 0 Hz (DC) 3-15Hz 20-150Hz 150-1000Hz Result is a 72-D vector for each ¾ second window of audio content. Audio-Feature Extraction

33 Inference: Retrieval To correct for length & song bias, we normalize the posterior P(x 1, …,x T |q) by P(x 1, …,x T ). Since P(q) is constant for all songs, normalization can be interpreted as ranking songs by likelihood P(q|x 1, …,x T ): Intuition: Normalizing by P(x 1, …,x T ) allows each song to place emphasis (e.g. weight) on words that increase the likelihood of the audio features, {x 1, …,x T }. Image from www.rockakademie-owl.de

34 Intellectual Motivation 1.Novel Computer Audition Techniques Sound subdomains - sound effects, animal vocalizations, speech Audition problem - monitoring, identification, characterization 2.Musical Knowledge Discovery Finding semantically meaningful words that we use to describe music Learning compact representations of an audio track 3.Models of Human Audition Low-level feature extraction and high-level modeling 4.Introducing results from machine learning, computer vision, and natural language processing to audition researchers 5.Improving existing commercial applications.

35 Related Research This work has been inspired by research on image annotation and music classification by genre. Content-based Image Annotation 1)Segment an image in to blocks or ‘blobs’ 2)Extract image features from each segment 3)Model the joint probability of words and image features Recent work on this problem: Object Recognition as Machine Translation [Duygulu, Barnard, de Freitas, Forsyth ‘02] Correspondence-Latent Dirichlet Allocation [Blei & Jordan ‘03] Supervised M-ary Model [Carneiro and Vasconcelos ‘05]

36 Related Research Music classification by genre: Given a novel song, is the song a rock, rap, reggae, classical, country, blues, or disco song? –Research focus was on audio feature extraction 1) Feature Extraction- a low-dimensional representation of audio information. Short-time feature design - extracted over ~25 msec of audio –Fourier-based: Timbral, Pitch and Rhythm Features [Tzanetakis & Cook ‘02] –Wavelet-based audio features [Li, Ogihara, and Li ‘03] –Models based on Human perception - loudness, roughness, etc [McKinney & Breebaart ‘03] Feature Integration - merging short-time feature vectors over medium time (~1 sec) window –Simple Statistics - Mean, Variance, Skewness, Kurtois [Tzanetakis & Cook ‘02] –Filterbank Transform on time series of feature vectors [McKinney & Breebaart ’03] –Autocorrelation and Linear Predictive Coding [Meng, Ahrendt, and Larsen ‘05] Dimensionality reduction using Principle Component Analysis (PCA)

37 Related Research Music classification by genre 2) Supervised Learning: use labeled feature vectors (x,y) to train a model. The model can then be used to predict labels (ŷ) for an unlabeled song (x). –Labels: ‘rock’, ‘country’, ‘jazz’, ‘blues’, ’classical’, … –Models in practice include SVMs, KNNs, GMMs, LDA, etc. Warning: The concept of genre is ill-defined since it is a subjective concept. Since authors make varying assumptions about genre (number of genres, names of genres, hierarchical vs. flat taxonomy), it is hard to directly compare classification results.

38 Query-by-text: a novel approach Music Information Retrieval (MIR) research involves the retrieval, classification, and management of music. [Goto & Hirata ‘04] Retrieval methods have focused on 1.Query by humming 2.Query by fragment 3.Query by similarity 4.Query using collaborative filtering Our approach can be described as “query by text”. One other MIR research that use a heterogeneous dataset of text reviews and audio-content is [Whitman & Ellis ‘04]. They focus on finding –semantically meaningful words. –unbiased sentences that describing audio content.

39 Four Parameter Estimation Techniques Direct Model for word w: 1)Merge all of the ¾ second feature vectors S from all the songs that have w in the associated song reviews. 2)Learn a GMM using this set of feature vectors Naïve Average Model for word w: 1)Estimate a ‘song-level’ GMM for each song that has w in the associated song review 2)Merge set of GMM components and rescale component priors. = Word-Level GMM Distribution (Width represents # of components) = Feature vectors for a song (width represents # of vectors = Song-Level GMM Distribution (Width represents # of components) EM for GMM

40 Four Parameter Estimation Techniques Center Model for word w: 1)Estimate song-level GMMs 2)Estimate word-level model using centers from song-level GMMs. Mixture Hierarchy Model for word w: 1)Estimate a ‘song-level’ GMMs 2)Estimate ‘word-level’ model using means and covariances from song mixtures. Use ‘mixture hierarchies EM’ to soft cluster mixture components [Vasconcelos ’01]. = component centers from a song mixture (width represents # of vectors) MixHier EM EM for GMM

41 Evaluation Evaluating the performance of our system is difficult because music is inherently subjective. Annotation: Ask two people to review the same song, and you will find that the reviews may have little in common. [Whitman & Ellis ‘04] Did the reviewer “forget” to use a relevant word in a human review? Retrieval: How do we measure “song similarity”? A jazz purist may think that “Oops I did it again” by Britney Spears and “Hanging Tough” by NKOTB are similar when, in fact, they are not.

42 Retrieval Evaluation Annotation: mean per-word precision and recall 1.Annotate each test song with N words 2.For each word w in our vocabulary compute |w H | = # of human annotations with word w |w A | = # of automatic annotations with word w |w C | = # of correct automatic annotations 3.Calculate average over all words in V Recall = |w C | / |w H | Precision =|w C | / |w A |

43 Retrieval Evaluation Retrieval: mean per-word area under the ROC curve (mAROC) and mean per- word average precision (mAP). 1.Using each word as a query q, rank songs in test set. 2.Calculate: a.Area Under the ROC curve: ROC plots true positive rate as a function of the false positive rate. b.Average Precision: record precision each time we correctly identify a song that matches the query. Average the precisions. 3.Calculate average over all words in V

44 Discussion Representing Music with Words: Our reviews represent a “noisy” version of our ideal human annotations –valid words are missing –erroneous words appear (e.g ‘this song does not rock’) What is a good annotation? –commercial databases: moodlogic, pandora –psychological experiments Vocabulary Discover Semantically Meaningful Words [Whitman Ellis ‘04][Barnard ‘06] We may be able to “artificially” enlarge our vocabulary using synonyms and antonyms (i.e WordNet) Dataset: We extract features from our dataset of compressed MP3 files. This is like extracting image features from JPEG files. Are we introducing a potential “data leaker” based on the compression algorithm?

45 Discussion Modeling/Inference For existing GMM-based models, how do we best estimate the parameters. –Specific: four parameter estimation techniques for estimating each word model –General: Numerous heuristics for learning a GMM using EM There are other ways to model audio and text features: 1.Hidden Markov models (HMMs) - by using GMMs, we are ignoring longer-term temporal information. One idea is to use HMMs to model the trajectories of acoustic features over time. 2.Audio Segmentation - We used “block-based” music decomposition, but employing automatic segmentation may prove useful. 3.Latent Semantic Analysis - alternative representation that is useful for uncovering synonymy and polysemy.

46 Template Text

Modeling Music with Words a multi-class naïve Bayes approach Douglas Turnbull Luke Barrington Gert Lanckriet Computer Audition Laboratory UC San Diego.

Similar presentations

Presentation on theme: "Modeling Music with Words a multi-class naïve Bayes approach Douglas Turnbull Luke Barrington Gert Lanckriet Computer Audition Laboratory UC San Diego."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Modeling Music with Words a multi-class naïve Bayes approach Douglas Turnbull Luke Barrington Gert Lanckriet Computer Audition Laboratory UC San Diego.

Similar presentations

Presentation on theme: "Modeling Music with Words a multi-class naïve Bayes approach Douglas Turnbull Luke Barrington Gert Lanckriet Computer Audition Laboratory UC San Diego."— Presentation transcript:

Similar presentations

About project

Feedback