TEMPLATE DESIGN © 2008 www.PosterPresentations.com Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.

TEMPLATE DESIGN © 2008 www.PosterPresentations.com Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2. Department of Automation, Tsinghua University, Beijing, China. Summary Experiments Semantic Annotation Future Work  Two collective semantic annotation methods of music, modeling not only individual labels, but also label correlations. 50 musically relevant labels are manually selected for music annotation, covering 10 aspects of music perception. Normalized mutual information is employed to measure the correlation between two semantic labels.  Label pairs with strong correlation are selected and modeled. Generative: Gaussian Mixture Model (GMM)-based method Discriminative: Conditional Random Field (CRF)-based method  Experimental results show slight but consistent improvements compared with individual annotation methods. Results:  Per category performance: the performance for each category 1.CRF-based methods outperform GMM-based methods; 2.Collective annotation methods slightly but consistently improve the performance of their individual counterpart, both for GMM-based and CRF-based.  Per song performance: the average performance for a song 1.While the recalls are similar, the precision is improved significantly from the generative models to discriminative models; 2.The collective methods slightly outperform their individual counterparts. Open question:  The performance improvements from individual modeling to collective modeling is not so much. Possible reason: In individual modeling methods, the labels which are “correlated” share many songs in their training set (since each song has multiple labels). This makes the trained models of “correlated” labels are also “correlated”, or in other words, the correlation is implicitly modeled. Motivation  Semantic annotation of music is an important research direction. Semantic labels (text, words) is a more compact and efficient representation than raw audio or low-level features. Potentially facilitates applications, e.g. music retrieval and recommendation.  Disadvantages of previous methods: Vocabulary without structured labels -> annotation without sufficient musical aspects. Model audio-label relations only, without label-label relations. E.g. “hard rock” & “electronic guitar”, “ happy” & “minor key”  Therefore, we divide the semantic vocabulary into categories, and attempt to model label correlations. Semantic Vocabulary Properties: 0 <= NormMI( X ; Y ) <= 1; NormMI( X ; Y ) = 0 when X and Y is statistically independent; NormMI( X ; X ) = 1. 5.Only the label pairs whose NormMI values are larger than a threshold are selected to be modeled. Audio Feature Extraction A bag of beat-level feature vectors are used to represent a song: 1.Each song is divided into beat segments. 2.Each segment contains a number of frames of 20ms length and 10ms overlap. 3. Timbre features (94-d) and rhythm features (8-d) are extracted to compose a 102-d feature vector in each segment. 4.PCA to reduce the dimensionality to 65, reserving 95% energy.  Timbre features: means and standard deviations of 8-order MFCCs, spectral shape features and spectral contrast features  Rhythm features: average tempo, average onset frequency, rhythm regularity, rhythm contrast, rhythm strength, average drum frequency, amplitude and confidence [1] Proposed methods: consider the relations between labels. 1)Collective GMM-based method: approximates the posterior (4) where is the set of selected label pairs; and are labels of a pair; is a trade-off between label posterior and label pair posterior. The Likelihood and are estimated using a 8- kernel GMM from training data. 2)Collective CRF-based method: Conditional Random Field (CRF): an undirected graphical model, nodes: label variables; edges: relations between labels. Multi-label classification CRF model [2]: (5) where : a sample (a song), represented by an input feature vector; : an output label vector; : the normalizing factor. & : features of CRF, predefined real-value functions. & : parameters to be estimated using training data. Note: Different from the GMM-based method, “bag of features” cannot be used here; instead, each song is represented by a 115-d feature vector. 115-d = 65-d (mean of beat-level features) + 50-d (word likelihoods) Data set:  ~5,000 Western popular songs;  Manually annotated with semantic labels from the vocabulary in Table 1, according to the label number limitations;  25% for training, 75% for testing;  49 label pairs are selected to model, whose NormMI > 0.1. Compared Methods: 1.Collective GMM-based method 2.Individual GMM-based method 3.Collective CRF-based method 4.Individual CRF-based method : use the CRF framework in Eq. (5) without considering the “overall potential of edges”. 1.Consists of 50 labels, manually selected from web-parsed musically relevant words 2.10 semantic categories (aspects) 3.A label number limitation in each category for annotation Problem: find some semantic words to describe a song. It can be viewed as a multi-label binary classification problem. Input: a vocabulary consisting of labels (or words) ; a bag of feature vectors of a song. Output: an annotation vector, where is a binary variable of, 1: presence, -1: absence. Solution: Maximum A Posterior (MAP) Previous methods: labels are treated independent.  Individual GMM-based method: (2) where(3) The likelihood can be estimated using GMM from training data. The prior probability can be set to a uniform distribution. single label posterior label pair posterior overall potential of nodes overall potential of edges Table 1 Vocabulary Table 2 Selected pairs Table 3 Table 4 1.Further exploit better methods to model label correlations. 2.Exploit better features, especially the song-level feature vector for CRF-based methods. 3.Try to apply the obtained annotations in various applications, such as music similarity measure, music search and recommendation. single label posterior [1] Lu, L., Liu, D. and Zhang, H.J. ”Automatic mood detection and tracking of music audio signals”, IEEE Trans. on Audio, Speech and Lang. Process., vol. 14, no. 1, pp. 5-18, 2006. [2] Ghamrawi, N. and McCallum, A. “Collective multilabel classification,” in Proc. the 14th ACM International Conference on Information and Knowledge Management (CIKM), 2005, pp. 195-200. 4.Normalized Mutual Information (NormMI) is used to measure the correlation of each label pair. (1) : entropy of X, : mutual information between X and Y. References Collective Annotationof Musicfrom MultipleSemantic Categories

TEMPLATE DESIGN © 2008 www.PosterPresentations.com Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.

Similar presentations

Presentation on theme: "TEMPLATE DESIGN © 2008 www.PosterPresentations.com Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

TEMPLATE DESIGN © 2008 www.PosterPresentations.com Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.

Similar presentations

Presentation on theme: "TEMPLATE DESIGN © 2008 www.PosterPresentations.com Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2."— Presentation transcript:

Similar presentations

About project

Feedback