Presentation is loading. Please wait.

Presentation is loading. Please wait.

Identifying Words that are Musically Meaningful David Torres, Douglas Turnbull, Luke Barrington, Gert Lanckriet Computer Audition Lab UC San Diego ISMIR.

Similar presentations


Presentation on theme: "Identifying Words that are Musically Meaningful David Torres, Douglas Turnbull, Luke Barrington, Gert Lanckriet Computer Audition Lab UC San Diego ISMIR."— Presentation transcript:

1 Identifying Words that are Musically Meaningful David Torres, Douglas Turnbull, Luke Barrington, Gert Lanckriet Computer Audition Lab UC San Diego ISMIR September 25, 2007

2 1 Introduction Our Goal: Create a content-based music search engine for natural language queries. –CAL Music Search Engine [SIGIR07] Problem: picking a vocabulary of musically meaningful words? – Word is present  pattern in audio content Solution: find words that are correlated with a set of acoustic signals

3 2 Two-View Representation Consider a set of annotated songs. Each song is represented by: 1.Annotation vector in a Semantic Space 2.Audio feature vector(s) in an Acoustic Space Acoustic Space (2D) x y Semantic Space (2D) ‘funky’ ‘Ireland’ Mustang Sally The Commitments Riverdance Bill Whelan Hot Pants James Brown

4 3 Semantic Representation Vocabulary of words: 1.CAL500: 174 phrases from a human survey Instrumentation, genre, emotion, usages, vocal characteristics 2.LastFM: ~15,000 tags from social music site 3.Web Mining: 100,000+ words mined from text documents Annotation Vector, denoted s –Each element represents the ‘semantic association’ between a word and the song. –Dimension (D S ) = size of vocabulary –Example: Frank Sinatra’s ‘Fly Me to the Moon” Vocabulary = {funk, jazz, guitar, female vocals, sad, passionate } Annotation (s i ) = [0/4, 3/4, 4/4, 0/4, 2/4, 1/4] Data is represented by a N x D S Matrix S = - s 1 -. - s i -. - s N -

5 4 Acoustic Representation Each song is represented by an audio feature vector a that is automatically extracted from the audio-content. Data is represented by NxD A matrix A = - a 1 -. - a i -. - a N - Acoustic Space (2D)Semantic Space (2D) ‘funky’ ‘Ireland’ Mustang Sally The Commitments x y

6 5 Canonical Correlation Analysis (CCA) CCA is a technique for exploring dependencies between two related spaces. –Generalization of PCA to multiple spaces –Constrained optimization problem Find vectors weight vectors w s and w a : –1-D projection of data in the semantic space - Sw s –1-D projection of data in the acoustic space - Aw a Maximize correlation of the projections –max (Sw s ) T (Aw a ) Constrain w s and w a to prevent infinite correlation max (Sw s ) T (Aw a ) w a, w s subject to: (Sw s ) T (Sw s ) = 1 (Aw a ) T (Aw a ) = 1

7 6 CCA Visualization Audio feature spaceSemantic space ‘funky’ ‘Ireland’ a a a b b b b c c c c d d d 1 1 0 -1 S 1 -1 1 1 -1 -1 -1 1 A 1 0 2 0 -2 = = 1010 wsws 1 wawa (Sw s ) T (Aw a ) 2 0 -2 1 0 0 -1 = 4 x y Sparse Solution

8 7 What Sparsity means… In the previous example, w s, ’funky’  0  ‘funky’ is correlated w/ audio signals  a musically meaningful word w s, ’Ireland’ = 0  ‘Ireland’ is not correlated  No linear relationship with the acoustic representation In practice, w s is dense even if most words are uncorrelated –‘dense’ means many non-zero values –due to random variability in the data Key Idea: reformulate CCA to produce a sparse solution.

9 8 Introducing Sparse CCA [ICML07] Plan: penalize the objective function for each non-zero semantic dimensions Pick a penalty function f(w s ) Penalizes each non-zero dimension Take 1: Cardinality of w s : f(w s ) = |w s | 0 Combinatorial problem - np-hard Take 2: L1 relaxation: f(w s ) = |w s | 1 Non-convex, not very tight approximation Take 3: SDP relaxation Prohibitive expensive for large problem Solution: f(w s ) =  i log |w s,i | Non-convex problem, but Can be solved efficiently with DC program Tight approximation

10 9 Introducing Sparse CCA [ICML07] Plan: penalize the objective function for each non-zero semantic dimensions Pick a penalty function f(w s ) Penalizes each non-zero dimension f(w s ) =  i log |w s,i | Use tuning parameter  to control importance of sparsity Increasing   smaller set of ‘musically relevant’ words max (Sw s ) T (Aw a ) w a, w s subject to: (Sw s ) T (Sw s ) = 1 (Aw a ) T (Aw a ) = 1 -  f(w s )

11 10 Experimental Setup CAL500 Data Set [SIGIR07] –500 songs by 500 Artists –Semantic Representation 173 words – genre, instrumentation, usages, emotions, vocals, etc… Annotation vector is average from 3+ listeners Word Agreement Score –measures how consistently listeners apply a word to songs –Acoustic Representation Bag of Dynamic MFCC Vectors [McKinney03] – 52-D vector spectral modulation intensities –160 vectors per minute of audio content Duplicate annotation vector for each Dynamic MFCC

12 11 Experiment 1: Qualitative Results Words with high acoustic correlation hip-hop, arousing, sad, drum machine, heavy beat, at a party, rapping Words with no acoustic correlation classic rock, normal, constant energy, going to sleep, falsetto

13 12 Experiment 2: Vocabulary Pruning AMG2131 Text Corpus [ISMIR06] –AMG Allmusic song reviews for most of CAL500 songs –315 word vocabulary –Annotation vector based on the presence or absence of a word in the review –More noisy word-song relationships then CAL500 Experimental Design: 1.Merge vocabularies: 173+315 = 488 words 2.Prune noisy words as we increase amount of sparsity in CCA Hypothesis: –AMG words will be pruned before CAL500 words

14 13 Experiment 2: Vocabulary Pruning Experimental Design: 1.Merge vocabularies: 488 words 2.Prune noisy words as we increase amount of sparsity in CCA Result: As Sparse CCA is more aggressive, more AMG words are pruned. Vocabulary Size # CAL500 Words # AMG2131 Words % Web2131 Words 488 173 315 0.64 249 118 131 0.52 149 85 64 0.42 50 39 11 0.22

15 14 Experiment 3: Vocabulary Selection Experimental Design: 1.Rank words by how aggressive Sparse CCA is before word gets pruned. how consistently humans use a word across CAL500 corpus. 2.As we decrease vocabulary size, calculate Average AROC Result: Sparse CCA does predict words that have better AROC.68.76 AROC 17312020 Vocab Size 70

16 15 Recap Constructing a ‘meaningful vocabulary’ is the first step in building a content-based, natural-language search engine for music. Given a semantic representation and acoustic representation Sparse CCA can be used to find ‘musically meaningful’ words. –i.e., semantic dimensions linearly correlated with audio features Automatically pruning words is important when using noisy sources of semantic information –e.g., LastFM Tags or Web Documents

17 16 Future Work Theory: moving beyond linear correlation with kernel methods Application: Sparse CCA can be used to find ‘musically meaningful’ audio features by imposing sparsity in the acoustic space Practice: handling large, noisy semantically annotated music corpora

18 Identifying Words that are Musically Meaningful David Torres, Douglas Turnbull, Luke Barrington, Gert Lanckriet Computer Audition Lab UC San Diego ISMIR September 25, 2007

19 18 Experiment 3: Vocabulary Selection Our content-based music search engine rank orders songs given a text-based query [SIGIR 07] –Area under the ROC curve (AROC) measures quality of each ranking 0.5 is random, 1.0 is perfect 0.68 is average AROC for all 1-word queries Can Sparse CCA pick words that will have higher AROC? –Idea: words with high correlation have more signal in the audio representation and will be easier to model. –How does it compare picking words that humans consistently use to label songs.


Download ppt "Identifying Words that are Musically Meaningful David Torres, Douglas Turnbull, Luke Barrington, Gert Lanckriet Computer Audition Lab UC San Diego ISMIR."

Similar presentations


Ads by Google