Presentation is loading. Please wait.

Presentation is loading. Please wait.

People use words to describe music

Similar presentations


Presentation on theme: "People use words to describe music"— Presentation transcript:

0 Modeling Music with Words a multi-class naïve Bayes approach
Douglas Turnbull Luke Barrington Gert Lanckriet Computer Audition Laboratory UC San Diego ISMIR 2006 October 11, 2006 Image from vintageguitars.org.uk

1 People use words to describe music
How would one describe “I’m a Believer” by The Monkees? We might use words related to: Genre: ‘Pop’, ‘Rock’, ‘60’s’ Instrumentation: ‘tambourine, ‘male vocals’, ‘electric piano’ Adjectives: ‘catchy’, ‘happy’, ‘energetic’ Usage: ‘getting ready to go out’ Related Sounds: ‘The Beatles’, ‘The Turtles’, ‘Lovin’ Spoonful’ We learn to associate certain words with the music we hear. Image:

2 Modeling music and words
Our goal is to design a statistical system that learns a relationship between music and words.  Given such a system, we can: Annotation: Given a audio-content of a song, we can ‘annotate’ the song with semantically meaningful words. song  words Retrieval: Given a text-based query, we can ‘retrieve’ relevant songs based on the audio content of the songs. words  songs Image from:

3 Modeling images and words
Content-based image annotation and retrieval has been a hot topic in recent years [CV05, FLM04, BJ03, BDF+02, …]. This application has benefited from and inspired recent developments in machine learning. Retrieval Query String: ‘jet’ Annotation How can MIR benefit from and inspire new developments in machine learning? *Images from [CV05],

4 Related work: Modeling music and words is at the heart of MIR research. jointly modeling semantic labels and audio content genre, emotion, style, usage classification music similarity analysis Whitman et al. have produced a large body of work that is closely related to our work [Whi05, WE04, WR05]. Others have looked at joint model of words and sound effects. Most focus on non-parametric models (kNN) [SAR-Sla02, AudioClas-CK04] Images from

5 Representing music and words
Consider a vocabulary and a heterogeneous data set of song-caption pairs: Vocabulary - predefined set of words Song - set of audio feature vectors (X = {x1 ,…, xT}) Caption - binary document vector (y) Example: “I’m a believer” by The Monkees is a happy pop song that features tambourine. Given the vocabulary {pop, jazz, tambourine, saxophone, happy, sad} X = set of MFCC vectors extracted from audio track y = [1, 0, 1, 0, 1, 0] Image from

6 Overview of our system: Representation
Data Features Training Data Vocabulary T T Caption Document Vectors (y) Audio-Feature Extraction (X)

7 Probabilistic model for music and words
Consider a vocabulary and a set of song-caption pairs Vocabulary - predefined set of words Song - set of audio feature vectors (X = {x1 ,…, xT}) Caption - binary document vector (y) For the i-th word in our vocabulary, we estimate a ‘word’ distribution, P(x|i). Probability distribution over audio feature vector space Modeled with a Gaussian Mixture Model (GMM) GMM estimated using Expectation Maximization (EM) Key idea: training data for each ‘word’ distribution is the set of all feature vectors from all songs that are labeled with that word. Multiple Instance Learning: includes some irrelevant feature vectors Weakly Labeled Data: excludes some relevant feature vectors Our probabilistic model is a set of ‘word’ distributions (GMMs) Image from

8 Overview of our system: Modeling
Data Features Modeling Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X)

9 Overview of our system: Annotation
Data Features Modeling Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song (annotation) Inference Caption

10 Inference: Annotation
Given ‘word’ distributions P(x|i) and a query song (x1,…,xT), we annotate with word i*: Naïve Bayes Assumption: we assume xi and xj are conditionally independent, given i: Assuming a uniform prior and taking a log transform, we have Using this equation, we annotate the query song with the top N words.

11 Overview of our system: Annotation
Data Features Modeling Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song (annotation) Inference Caption

12 Overview of our system: Retrieval
Data Features Modeling Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song (annotation) Inference (retrieval) Caption Text Query

13 Inference: Retrieval We would like to rank test songs by the posterior probability P(x1, …,xT|q) given a query word q. Problem: this results in almost the same ranking for all query words. There are two reasons: Length Bias Longer songs will have proportionately lower likelihood resulting from the sum of additional log terms. This results from the naïve Bayes assumption of conditional independence between audio feature vectors [RQD00]. Image from

14 Inference: Retrieval We would like to rank test songs by the posterior probability P(x1, …,xT|q) given a query word q. Problem: this results in almost the same ranking for all query words. There are two reasons: Length Bias Song Bias Many conditional word distributions P(x|q) are similar to the generic song distribution P(x) High probability (e.g. generic) songs under P(x) often have high probability under P(x|q) Solution: Rank by likelihood P(q|x1, …,xT) instead. Normalize P(x1, …,xT|q) by P(x1, …,xT) Image from

15 Overview of our system Inference Data Features Modeling Parameter
Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song (annotation) Inference (retrieval) Caption Text Query

16 Overview of our system: Evaluation
Data Features Modeling Evaluation Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song Evaluation (annotation) Inference (retrieval) Caption Text Query

17 Experimental Setup Data: 2131 song-review pairs
Audio: popular western music from the last 60 years DMFCC feature vectors [MB03] Each feature vector summarize 3/4 seconds of audio content Each song is represent by between feature vectors Text: song reviews from AMG Allmusic database We create a vocabulary of 317 ‘musically relevant‘ unigrams and bigrams A review is a natural language document written by a musical expert Each review is converted into a binary document vector 80% Training Set: used for parameters estimation 20% Testing Set: used for model evaluation Image from

18 Experimental Setup Tasks:
Annotation: annotate each test song with 10 words Retrieval: rank order all test songs given a query word Metrics: We adopt evaluation metrics developed for image annotation and retrieval [CV05]. Annotation: mean per-word precision and recall Retrieval: mean average precision mean area under the ROC curve Image from

19 Quantitative Results Annotation Retrieval Our Model .072 .119 .109
Recall Precision maPrec AROC Our Model .072 .119 .109 0.61 Baseline .032 .060 0.50 Our model performs significantly better than random for all metrics. one-sided paired t-test with  = 0.1 recall & precision are bounded by a value less 1 AROC is perhaps the most intuitive metric Image from sesentas.ururock.com

20 Music is inherently subjective
Discussion Music is inherently subjective Different people will use different words to describe the same song. We are learning and evaluating using a very noisy text corpus Reviewer do not make explicit decisions about the relationships between individual words when reviewing a song. “This song does not rock.” Mining the web may not suffice. Solution: manually label data (e.g., MoodLogic, Pandora) Image from

21 Discussion 3. Our system performs much better when we annotate & retrieve sound effects BBC sound effect library More objective task Cleaner text corpus Area under the ROC = 0.80 (compare with 0.61 for music) 4. Best results for content-based image annotation and retrieval are comparable to our sound effect results. Image from

22 Douglas Turnbull - dturnbul@cs.ucsd.edu
“Talking about music is like dancing about architecture” origins unknown Please send your questions and comments to Douglas Turnbull - Image from vintageguitars.org.uk

23 References

24 References


Download ppt "People use words to describe music"

Similar presentations


Ads by Google