Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS591k - 20th November - Fall 2003 1. 2 Content-Based Retrieval of Music and Audio Seminar : CS591k Multimedia Systems By Rahul Parthe Anirudha Vaidya.

Similar presentations


Presentation on theme: "CS591k - 20th November - Fall 2003 1. 2 Content-Based Retrieval of Music and Audio Seminar : CS591k Multimedia Systems By Rahul Parthe Anirudha Vaidya."— Presentation transcript:

1 CS591k - 20th November - Fall 2003 1

2 2 Content-Based Retrieval of Music and Audio Seminar : CS591k Multimedia Systems By Rahul Parthe Anirudha Vaidya Instructor : Dr. Donald Adjeroh

3 CS591k - 20th November - Fall 2003 3 Introduction Introduction An audio search engine able to retrieve sound files from a large database similar to the input query sound. Sounds are characterized by templates" derived from a tree-based vector quantizer trained to maximize mutual information (MMI).

4 CS591k - 20th November - Fall 2003 4 Basic Operation Corpus with different classes of audio files is parameterized into feature vectors Construction of a tree-based quantizer. Generation of the audio template using the parameterized data. The template is thus generated by capturing the salient characteristics of the input audio. Construction of a template for the query audio and matching it with the templates in the database.

5 CS591k - 20th November - Fall 2003 5 Basic Operation [cont..] Fig 1: Audio Template Construction

6 CS591k - 20th November - Fall 2003 6 Audio Parameterization The basic objective is to parameterize the audio-files into mel-scaled cepstral coefficients (MFCCs). The audio waveform sampled at 16 kHz is transformed into a sequence of 13-dimensional feature vectors (12 MFCC coefficients + energy term).

7 CS591k - 20th November - Fall 2003 7 Audio Parameterization -Steps The audio is hamming-windowed in overlapping steps. The window is 25mS wide, hence 1 sec. Audio contains 500 overlapped windows. Calculate the log of power spectrum for each window using DFT. Mel–scaling. This emphasizes the mid-frequency bands in order of their perceptual importance. Transform the mel-scaled coefficients into cepstral coefficients using another DFT resulting in dimensionally uncorrelated features. The audio waveform is thus transformed into 13 dimensional feature vectors (12 MFCCs + energy).

8 CS591k - 20th November - Fall 2003 8 Audio Parameterization[cont..] Fig 2:Audio parameterization into Mel cepstral coefficients

9 CS591k - 20th November - Fall 2003 9 Tree-Structured Quantization The Q-tree is grown offline using the max amount of training data possible. Supervised tree-based quantization. i.e. the quantizer learns the critical distance between classes while ignoring the other variability. The advantage of this technique is that it can find similarities between similar slides, intervals or scales despite lumping the time-dependant vectors into one time –independent template.

10 CS591k - 20th November - Fall 2003 10 Tree Construction The quantizer tree partitions the feature space into distinct regions. Each threshold in the tree is chosen to maximize the mutual information I(X;C) between the data X and the associated class C. The best MMI split is found by considering all possible thresholds in all possible directions. Consider an MMI split for the dimension d which it intercepts at value t. The hyper plane divides the set N of training vectors X into 2 sets First split – root node, left child Xb gets the training samples less than the threshold while the right child inherits those greater than the threshold. Splitting process repeated recursively on each child resulting in more modes and splits in the tree.

11 CS591k - 20th November - Fall 2003 11 Tree Construction [cont..] Each node in the tree corresponds to a hyper-rectangular cell in the feature space. The leaves of the tree partition the feature space into non- overlapping regions as shown. Fig 4: Nearest neighbor MMI Tree

12 CS591k - 20th November - Fall 2003 12 Estimating I(X;C) H2 is the binary entropy function The probabilities Pr(ci) & Pr(ai) can be defined as follows

13 CS591k - 20th November - Fall 2003 13 Stopping condition The stopping rule decides that further splits are unnecessary and stops the recursive splitting process. The best-split mutual information is weighted by the probability mass inside the cell to be split. The stopping metric stop for the cell lj is given as: Nj is the data points in cell j and N is the total number of data points

14 CS591k - 20th November - Fall 2003 14 Template generation The Tree Partitions the feature space into L non-overlapping regions or cells each of which correspond to the leaf of the Tree. One approach of using it is to label the leafs with class name and then use it as classifier, but this wont be robust since classes will be overlapping containing data from many classes. Another approach the paper suggested is to mark the ensemble of leaf probabilities from the quantized class data. In short to use the histogram of leaf probabilities for a sequence of frames. The resulting histogram captures the essential class qualities so that it can be compared with other histograms.

15 CS591k - 20th November - Fall 2003 15 Template generation [cont] Since the size of the tree determines the size of the templates it can be easily pruned to give us variable free parameters as per the application allowing better characterization of data. The processing being in 1-Dimension the quantization is rapid and takes only log(N) time for N-leaf tree. Visual approximation of the Vectors.

16 CS591k - 20th November - Fall 2003 16 Distance Metrics The templates generated need to be compared to references in order to determine to which class they belong. Comparing them determines the acoustic similarity. Several distance measures have been proposed but the main two in use are Euclidean Distance and Cosine Distance. Euclidean treats the histogram vectors as N-dimensional vectors and computes the L2 norm between them. The cosine also treats histogram as a N-dimensional vectors but measures the relative angles between them thus is more effective since it is not independent of the magnitude of the vectors.

17 CS591k - 20th November - Fall 2003 17 Distance Metrics [cont..] Euclidean Distance Measure Cosine Distance Measure

18 CS591k - 20th November - Fall 2003 18 Classification Query template is matched with corpus templates using the Distance measures as discussed previously. The results are sorted in the form of a list with order of similarity. They can be imagined as the output results of a search engine like google. The search has to be through the full data base hence is a big search as for comparison all the distances have to be compared.

19 CS591k - 20th November - Fall 2003 19 Experiments & Results 1. Sound Retrieval A simple test was conducted to check the performance of the system with the Muscle Fish System on web. Two types of trees were used one was quasi-supervised and the other was supervised. Quasi supervised means that the tree was used to classify the whole sample space in distinct classes. This results in number of cells in the feature space equal to the size of the sample space. The supervised was used to classify the sample into a subclass or a group with similar properties. Which obviously gives the better results.

20 CS591k - 20th November - Fall 2003 20 Experiments & Results [con] Distance Q Tree (D c ) unsupervised supervised Muscle Fish (no DPL) Muscle Fish (+ DPL) Laughter (M) 0.680.821.001.00 Oboe0.110.430.690.94 Agogo1.001.000.530.58 Speech (F) 0.770.870.690.94 Touchtone0.611.000.440.73 Rain/Thunder0.220.350.300.42 Mean AP 0.580.7720.6080.768 Retrieval Average Precision (AP) for different schemes. Quantization tree results used un weighted cosine distance measure. Distance measures of both kind were used but as mentioned cosine performed a lot better.

21 CS591k - 20th November - Fall 2003 21 Experiments & Results [cont..] 2. Music Retrieval In this application music clips were used for classification. Genres used were jazz, pop, rock, rap etc Clips from the same artist were considered as from the same class. Each artist had 5 clips in the corpus. The corpus consisted of 255 7-second clips 5 clips a artist with 40 artists. Distance Euclidean (D E ) supervised unsupervise d Cosine (D c ) supervisedVectordistance AP0.350.320.400.31 Retrieval Average Precision (AP) for music retrieval experiment.

22 CS591k - 20th November - Fall 2003 22 Conclusions The retrieval works effective for complex data and measures acoustic similarity. The sorted comparison results give the order of similarity between query data and references in the corpus. The computational requirements and storage requirements are also modest. Since the feature vectors are just array of integers and the Q- tree quantization and classification is in one dimension. This method can be used to automatically segment the multimedia data based on changes in the speaker, pause, musical interludes etc Finally the variability of the number of free parameters and ignoring the dimension which are never used the templates can be optimized as per the application requirements.

23 CS591k - 20th November - Fall 2003 23 Limitations The classifier used is simple straight plane classifier which can distinguish between to subspaces with a simple plane. In real life the vector distribution may not be distributed in such a way. They have used simple acoustical parameters to be used for matching, for more sophisticated systems other parameters like pitch and speaker dependent properties can be used. We need to have recorded musical clips to find the genres. In case of distortion or losses in the clips the system will not work well. There is no option for dynamic training. i.e. the system is not self- updating.

24 CS591k - 20th November - Fall 2003 24 Suggestions Neural networks can be used to divide the feature space in a more complex fashion. Where curved, concave and hybrid planes can be formed. If the dimensions of the feature vectors are increased then simple Euclidean distance will not work and we need to go for other reasoning methods. Non acoustical features such as pitch, brightness and speaker dependent parameters can be used to find good classification with a less data base. Dynamic training needs to be added to the system to include new samples in database as it encounters them. Possibly speech recognition can be added to search for data based on uttered query by the user.

25 CS591k - 20th November - Fall 2003 25 References [1] S. Pfeier, S. Fischer, and W. Eelsberg, “Automatic audio content analysis," Tech. Rep. TR-96-008, University of Mannheim, D-68131 Mannheim, Germany, April 1996. [2] E. Wold, T. Blum, D. Keslar, and J. Wheaton, “Content-based classication, search, and retrieval of audio,“ IEEE Multimedia, pp. 27{36, Fall 1996. [3] T. Blum, D. Keslar, J. Wheaton, and E. Wold, “Audio analysis for content-based retrieval” tech. rep., Muscle Fish LLC, 2550 Ninth St., Suite 207B, Berkeley, CA 94710, USA, May 1996. [4] B. Feiten and S. Gunzel, “Automatic indexing of a sound database using self-organizing neural nets," Computer Music Journal 18(3), pp. 53{65, 1994.

26 CS591k - 20th November - Fall 2003 26 Questions & Comments

27 CS591k - 20th November - Fall 2003 27 Thanks


Download ppt "CS591k - 20th November - Fall 2003 1. 2 Content-Based Retrieval of Music and Audio Seminar : CS591k Multimedia Systems By Rahul Parthe Anirudha Vaidya."

Similar presentations


Ads by Google