Jordi Pons, Olga Slizovskaia, Rong Gong, Emilia Gómez, Xavier Serra

Timbral Analysis of Music Audio Signals with Convolutional Neural Networks
Jordi Pons, Olga Slizovskaia, Rong Gong, Emilia Gómez, Xavier Serra Hello, everyone, I am Gong Rong, PhD student in the music technology group, universitat pompeu fabra of barcelona, and the third author of the presenting paper Timbral analysis of music audio signals with convolutional neural networks. My first author, my colleague Jordi Pons is currently doing an internship in US, so he is not able to come to present the paper.

Research goal To discover novel deep learning architectures which can efficiently learn timbre representations Previous work on learning temporal features: Jordi Pons, Xavier Serra, Designing efficient architectures for modeling temporal features with convolutional neural networks, ICASSP 2017 Our goal is to discover novel deep learning architectures which can efficiently learn timbre representations. Beforehand, my colleague has designed the efficient convolutional neural networks to model temporal features, such as tempo and rhythm. So here is its reference. In this work, we focus on learning timbre representations.

Presentation structure
Motivation State-of-the-arts Architecture design strategy Three experiments and results We will follow this presentation structure. We will introduce the motivation, and explain why to use convolutional neural networks to learn timbral representations. Then I will introduce the state-of-the-arts CNNs design strategy for timbral representation learning. After that, I will explain the proposed design strategy. Finally, we will present three experiments and their results to prove that our strategy is able to learn timbral representations efficiently.

Motivation

Timbral descriptions – traditional approaches
Bag-of-features Statistics of frame-based features Spectral centroid, flatness, MFCC, etc. NOT consider the temporal evolution Temporal modeling Hidden Markov models Time-frequency patterns learned by NMD basis Descriptors and temporal models NOT jointly designed. In the old days, we use bag-of-features to describe the musical timbre. They are the statistics of the frame-based features, such as spectral centroid, flatness, MFCC. Its drawback is it ignore the temporal evolution of timbre. On the other hand, we have some temporal modeling methods for these frame-base features, For example, Hidden Markov models or the time-frequency patterns learned by the NMD (non-negative matrix deconvolution) basis. Its drawback is that the features and the temporal models are not jointly learned.

Deep learning and convolutional neural networks
Advantage of Deep learning No strong assumptions over the input descriptor—log-mel magnitude spectrogram Able to learn spectral-temporal descriptors—input patch > 1 frame Convolutional neural networks (CNNs) Able to learn spectral-temporal filters Can exploit invariance by sharing parameters We consider to use Deep learning, especially Convolutional neural networks to represent timbre because Firstly, it doesn’t require much feature engineering, we can use perceptual-based log-mel magnitude spectrogram as the input to the network. Secondly, it is able to learn spectral and temporal descriptors in case we use more than 1 frame patch as the input. Besides, the CNNs provides additional benefits, such as, It is able to learn spectral temporal filters It can exploit invariance, such as time or frequency invariance in the mel spectrogram by sharing parameters.

State-of-the-arts

CNNs filter design Small-rectangular filters High filters
Ex. 3x3, 5x5 filters For the first layer, NOT able to learn the spectral-temporal patterns with a large frequency spread. Ex. Cymbals, snare drum High filters A lot of weights—prone to overfit or fit noise As we have decided to use CNNs as the timbral representation learning method, now we take a look at its state-of-the-arts: We have identified two general trends in designing CNNs architectures, especially designing the CNN filter shapes In the first trend, we use small rectangular filters, such as 3*3 or 5*5 filters. The drawback of this strategy, is that it is not able to learn the spectral-temporal patterns with a large frequency spread in the first layer. Another trend is to use high filters with a large frequency spread. Although these high filters could successfully learning most timbral patterns, they are prone to overfit or noise fit.

Fit noise example - high filters
12x8 filters learn redundant information Left: onset, redundancy in frequency axis, 1x3 small filter is enough for onset Right: harmonics, redundancy in temporal axis, only captured three harmonics. Here is a noise fit example for using high filters. In below are two learned filters of a size 12*8. We can see the left one is a onset pattern, however, it contains many redundancy in frequency axis. A small 1*3 filter as the small red box we show would be enough in this case. And the right one learned a harmonic pattern but with many redundancy in the temporal axis. Besides, this filter is too short and only captured three harmonics.

Architecture design strategy

Timbre definition Timbre defined by what is NOT:
a set of auditory attributes of sound events IN ADDITION TO pitch, loudness, duration and spatial position. After presenting the drawbacks of the previous architecture designing strategy, let’s introduce the proposed one: Firstly, we take a look at the timbre definition. In the literature, the timbre is defined by what is not: a set of auditory attributes of sound events IN ADDITION TO pitch, loudness, duration and spatial position. Which means here the timbre is invariant to these attributes. Accordingly, we can design CNN architectures to capture these invariant.

Invariance Pitch invariance: convolve and max-pooling over the frequency axis Loudness invariance: L2-normalization of the filter weights Duration invariance: m x n filter learn fixed-duration patterns Spatial position invariance: use monoaural down-mixed input We argue that to convolve over the frequency axis of a mel spectrogram will learn pitch independent timbre representations. Max-pooling over the frequency axis will eliminate the frequency resolution of the feature map. We use L2-norm regularization to normalize the filter weights into low energy. This will achieve loudness invariance. We use m*n filters to learn representations with a fixed duration. We use monoaural down-mixed input to remove the spatial trait.

and different filter shapes
Use domain knowledge and different filter shapes Ex. to capture phoneme patterns, we use 70 x 10 filter: unvoiced consonants 50 x 1 filter: low pitch harmonics 50 x 5 filter: voiced consonants 70 x 1 filter: high pitch harmonics The first core element in our design strategy is to use musical domain knowledge to design the filter shapes. Here is an example to use different filter shapes to capture phoneme patterns. The second core element in our strategy is to use different filter shapes in the first layer. This can also be clarify by using this example, that we use different filter shapes in the first layer to capture different timbre characteristics.

Experiments

Experiments Three experiments to validate the design strategy
Singing voice phoneme classification Musical instrument recognition Music auto-tagging We will prove that the shallow network architecture with multi-filter shapes in the first layer have a great expressiveness. We assess the proposed design strategy by conducting 3 timbre modeling experiments: By these three experiments, we will prove that shallow network architecture with multi-filter shapes in the first layer can be expressive.

Common configuration Input: Monoaural log-mel magnitude spectrogram
Activation function: Exponential linear units (ELUs) Regularization: L2-norm of the weights Loss function: cross-entropy Here are the common configurations used in the three experiments:

Singing voice phoneme classification
Music style: Beijing opera singing (Chinese) Small data set: 2 hours audio, split into train, validation and test sets Two role-types: dan (young woman) and laosheng (old man) Input patch size: 80x21 32 phoneme class Single layer: 128 filters 50x1 and 70x1 64 50x5 and 70x5 32 50x10 and 70x10 Max-pooling coef. 2 over frequency axis The goal of this experiment is to classify the Beijing opera singing phoneme excerpts into 32 phoneme class. As we can see, this is the small dataset problem, only 2 hours audio are used as training, validation and test set. It contains 2 role-types, we can understand the concept of role-types as two different speakers. So we need to divide this dataset into two parts – one for each role-type, and this will further reduce amount the training set. We use a single layer architecture where the only CNN layer contains 6 different filter shapes to capture different phoneme patterns. We do max-pooling over frequency axis.

Singing voice phoneme classification
Small-rectangular: 3x3 filters, 5-layers VGG-net GMMs: 40 components, MFCCs input MLP: 2 layers As comparison, we take a small-rectangular filter architecture which is a 5 layers VGG-net, a GMMs with 40 components and a Mutlilayers perceptron with 2 layers. We can see from the table, the proposed multi-filters architecture achieved the best performance for both role-types. And the parameter number matters! The benefit of using less parameters is that we are less prone to overfitting given the same amount of training set.

Musical instrument recognition
Dataset: IRMAS, 6705 training samples, 3s length each Input path size: 96x128 Batch normalization after each convolutional layer 11 instrument class Multi layers: Same multi-filters layer Max-pooling 12,16 Two 128 filters 3x3 layers Max-pooling 2,2 256 nodes dense layer Single layer: 128 filters 5x1 and 80x1 64 5x3 and 80x3 32 5x5 and 80x5 Max-pooling over frequency axis The goal of this experiment is to recognize the predominant musical instruments. The training dataset contains 6705 samples, each sample is label with only one instrument. The samples in the test set have multi labels. We experiment two different architectures, The first one has one single CNN layer with 6 different filter shapes The second one we call it multi layers which contains the same multi-filters layers plus 2 small-rectangular filter layers and one dense layer.

Musical instrument recognition
Bosch: bag-of-features + SVM Han: 9-layers, 3x3 filters, VGG-net To compare, we take two baselines: The first line use bag-of-features plus SVM classifier. The second one is the state-of-the-art 9 layers small-rectangular VGG-net We use two types of evaluation metrics. The micro metric takes into account of the class sample support, We see for this metric, the proposed method performed almost equal to the state-of-the-art Han method, however with only half of the parameters. Which means that the proposed architecture is as powerful as the state-of-the-art, however, much less prone to overfitting. However, if we look at the Macro metric, which doesn’t take into account the class support, the proposed architecture performs the best Again with only half of the parameters.

Max-pooling over frequency axis
Music auto-tagging Dataset: MagnaTagATune, clips, 30s each Predicting the top-50 tags, instruments, genres and others Input patch size: 96x187 Batch normalization after each convolutional layer Multi layer: 10 filters 100x1, 6 100x3, 3 100x5, 3 100x7 15 75x1, 10 75x3, 5 75x5, 5 75x7 15 25x1, 10 25x3, 5 25x5, 5 25x7 Max-pooling over frequency axis 100 nodes dense layer The goal of the last experiment is to predict the top-50 tags for the MagnaTagATune dataset. Each sample in this dataset has multiple tags, Which results in a multi-labels auto-tagging problem. We use one CNN layer with 12 different filter shapes plus one dense layer

Music auto-tagging Choi: 5 CNN layers, 3x3 filters
Small-rectangular: adaptation of Choi, less parameters Dieleman: 2 CNN layers, high filters We run two experiments with different parameter numbers, On the left of the table, we fix the parameter numbers of the three architecture to 75k, we can see the proposed architecture performs the best. Then we increase the filter numbers in the proposed architecture and compare it with the original Choi’s architecture, When we use two times of filter numbers, the performance is almost equivalent to Choi however with much less parameters. While keep increasing the filter numbers, the performance goes worse.

Conclusion The proposed architecture uses
Different filter shapes in the first layer Filter shapes designed by domain knowledge Achieved the state-of-the-art result for a small dataset Achieved equivalent state-of-the-art results for the larger datasets, with less parameters.

Thank you! Any questions?

Jordi Pons, Olga Slizovskaia, Rong Gong, Emilia Gómez, Xavier Serra

Similar presentations

Presentation on theme: "Jordi Pons, Olga Slizovskaia, Rong Gong, Emilia Gómez, Xavier Serra"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jordi Pons, Olga Slizovskaia, Rong Gong, Emilia Gómez, Xavier Serra

Similar presentations

Presentation on theme: "Jordi Pons, Olga Slizovskaia, Rong Gong, Emilia Gómez, Xavier Serra"— Presentation transcript:

Similar presentations

About project

Feedback