MFCC for Music Modeling

Slides:

Advertisements

Similar presentations

Acoustic/Prosodic Features

Advertisements

Digital Signal Processing

Time-Frequency Analysis Analyzing sounds as a sequence of frames

Copyright 2001, Agrawal & BushnellVLSI Test: Lecture 181 Lecture 18 DSP-Based Analog Circuit Testing  Definitions  Unit Test Period (UTP)  Correlation.

Franz de Leon, Kirk Martinez Web and Internet Science Group  School of Electronics and Computer Science  University of Southampton {fadl1d09,

DFT/FFT and Wavelets ● Additive Synthesis demonstration (wave addition) ● Standard Definitions ● Computing the DFT and FFT ● Sine and cosine wave multiplication.

Content-based retrieval of audio Francois Thibault MUMT 614B McGill University.

Department of Computer Science University of California, San Diego

Digital Representation of Audio Information Kevin D. Donohue Electrical Engineering University of Kentucky.

SIMS-201 Characteristics of Audio Signals Sampling of Audio Signals Introduction to Audio Information.

IT-101 Section 001 Lecture #8 Introduction to Information Technology.

Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight.

School of Computing Science Simon Fraser University

F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)

Principal Component Analysis

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Dimensional reduction, PCA

Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.

Digital Audio Multimedia Systems (Module 1 Lesson 1)

Representing Acoustic Information

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Topic 7 - Fourier Transforms DIGITAL IMAGE PROCESSING Course 3624 Department of Physics and Astronomy Professor Bob Warwick.

LE 460 L Acoustics and Experimental Phonetics L-13

GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1.

Basics of Signal Processing. SIGNALSOURCE RECEIVER describe waves in terms of their significant features understand the way the waves originate effect.

Summarized by Soo-Jin Kim

Chapter 2 Dimensionality Reduction. Linear Methods

Motivation Music as a combination of sounds at different frequencies

Presented by Tienwei Tsai July, 2005

Transforms. 5*sin (2  4t) Amplitude = 5 Frequency = 4 Hz seconds A sine wave.

Preprocessing Ch2, v.5a1 Chapter 2 : Preprocessing of audio signals in time and frequency domain  Time framing  Frequency model  Fourier transform 

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

Jacob Zurasky ECE5526 – Spring 2011

1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:

MUMT611: Music Information Acquisition, Preservation, and Retrieval Presentation on Timbre Similarity Alexandre Savard March 2006.

TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.

Basics of Neural Networks Neural Network Topologies.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Definitions Random Signal Analysis (Review) Discrete Random Signals Random.

Speaker Recognition by Habib ur Rehman Abdul Basit CENTER FOR ADVANCED STUDIES IN ENGINERING Digital Signal Processing ( Term Project )

Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

CS Spring 2009 CS 414 – Multimedia Systems Design Lecture 3 – Digital Audio Representation Klara Nahrstedt Spring 2009.

Introduction to Digital Signals

Singer similarity / identification Francois Thibault MUMT 614B McGill University.

Singer Similarity Doug Van Nort MUMT 611. Goal Determine Singer / Vocalist based on extracted features of audio signal Classify audio files based on singer.

EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,

MSc Project Musical Instrument Identification System MIIS Xiang LI ee05m216 Supervisor: Mark Plumbley.

PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.

MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

CS Spring 2014 CS 414 – Multimedia Systems Design Lecture 3 – Digital Audio Representation Klara Nahrstedt Spring 2014.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

Chapter 13 Discrete Image Transforms

Lifecycle from Sound to Digital to Sound. Characteristics of Sound Amplitude Wavelength (w) Frequency ( ) Timbre Hearing: [20Hz – 20KHz] Speech: [200Hz.

PATTERN COMPARISON TECHNIQUES

Ch. 2 : Preprocessing of audio signals in time and frequency domain

Spectrum Analysis and Processing

The Physics of Sound.

LECTURE 11: Advanced Discriminant Analysis

ARTIFICIAL NEURAL NETWORKS

Multimedia Systems and Applications

PCA vs ICA vs LDA.

Statistical Models for Automatic Speech Recognition

EE513 Audio Signals and Systems

AUDIO SURVEILLANCE SYSTEMS: SUSPICIOUS SOUND RECOGNITION

Presenter: Simon de Leon Date: March 2, 2006 Course: MUMT611

NON-NEGATIVE COMPONENT PARTS OF SOUND FOR CLASSIFICATION Yong-Choon Cho, Seungjin Choi, Sung-Yang Bang Wen-Yi Chu Department of Computer Science &

Lecture 16. Classification (II): Practical Considerations

Measuring the Similarity of Rhythmic Patterns

Presentation transcript:

MFCC for Music Modeling Brief summary of the paper Goals, algorithms, conclusions Introduction on some key concepts in DSP Sampling, FT, DFT, loudness, dB Frequency vs pitch, mel-scal Literature review, Motivation Go through paper in detail

Paper Summary Examine the effectiveness of using MFCCs to model music Mel-scale is "at least not harmful" for speech/music classification More tests needed to show if the above is due to better modeling for speech or for music, or both Examine the use of DCT to decorrelate the Mel- spectral vectors Effectively reduces dimensions in data A good approximation of PCA, or KL-transform Similarity in decorrelated vectors for speech and music (cosine waves as basis functions)

Some Concepts Sampling, discrete signals Sound waves = continuous signals Digital signal = discrete signals Aliasing: If a sampler is only reading in values at particular times, it can become confused if the input frequency is too fast. Nyquist frequency: 2 x the highest frequency of the input signal. Why 44kHz: human can hear 20 Hz to 20 kHz

Some Concepts dB: unit for intensity of sound Intensity proportional to distance^(-2) where Pref is the reference sound pressure and Prms is the rms sound pressure being measured Jack hammer at 1 m 2 Pa 100 dB Leaves rustling, calm breathing 10 dB Auditory threshold at 1 kHz 0 dB

Some Concepts loudness Subjective measure Log scaled A widely used "rule of thumb" for the loudness of a particular sound is that the sound must be increased in intensity by a factor of ten for the sound to be perceived as twice as loud. A common way of stating it is that it takes 10 violins to sound twice as loud as one violin

Some Concepts Frequency vs Pitch a linear pitch space in which octaves have size 12, semitones (the distance between adjacent keys on the piano keyboard) have size 1, and A440 is assigned the number 69

Some Concepts Mel-scale proposed by Stevens, Volkman and Newman in 1937 a perceptual scale of pitches A 1000 Hz tone, 40 dB above the listener's threshold = 1000 mels.

Some Concepts Mel vs Hz

Some Concepts Discrete Fourier Transform (DFT) Maps time domain function to frequency domain The sequence of N complex numbers x0, ..., xN−1 is transformed into the sequence of N complex numbers X0, ..., XN−1 by the DFT according to the formula: Number of components = number of signals

Some Concepts Discrete Fourier Transform (DFT) Time domain function = sum of (complex coefficient x wave function) Easier to visualize spectral information. See demo

Some Concepts DFT demo y=sine_1+sine_2+noise(std normal) 2 known sine waves y=sine_1+sine_2+noise(std normal) Use FFT to recover the frequency of the 2 sine waves.

Some Concepts Hamming Window DFT Assumes input signals form exactly one period wavelength that do not divide the frame size appear in DFT. This error can be reduced by multiplying the signals by a Hamming window

from: ROBUST MFCC FEATURE EXTRACTION ALGORITHM USING EFFICIENT. ADDITIVE AND CONVOLUTIONAL NOISE REDUCTION PROCEDURES. -Bojan Kotnik, Damjan Vlaj, Zdravko Kačič,

Relevant Work and Motivation Keith Martin et el 1998: Music Content Analysis through Models of Audition Conventional music-analysis systems relies notes, chords, rhythm and harmonic progressions. So far, not very successful Calls for a change in direction: focus on how non-musicians listen to music, turn to psychoacoustics and auditory scene analysis (perception) and DSP Case studies: speech/music discrimination (identified useful features) Acoustic beat and tempo tracking Timbre classification Music perception systems (make machines judge music like an untrained listener)

Relevant Work and Motivation Scheirer, Slaney 1997: Construction and evaluation of a robust multifeature speech/music discriminator A real-time computer system to distinguish speech vs music Use frame-by-frame data 13 features: 5 of which are VARIANCE features Measure how fast a feature changes among 1 second frames Others include: spectral centroid, zero-crossing rate etc Use Gaussian mixture models and MAP for classification High accuracy

Relevant Work and Motivation Martin 199: Toward automatic sound source recognition: identifying musical instruments Experiment based on a set of orchestral musical instruments Use frame-by-frame data Features: pitch, frequency modulation,spectral centroid, intensity, spectral envelope... Log-lag Correlogram is a good representation that encodes most of the features' information

Relevant Work and Motivation Foote, 1997: Content based retrieval of music and audio One of the first to retrieve audio docs by acoustic similarity Does not depend on subjective features: brightness, pitch... Data driven, statistical methods vs matching audio characteristics Inexpensive in computation and storage. Use MFCCs to represent audio files Supervised tree-based quantizer (decision trees?) Experiments: Retrieve simple sounds: laughter, thunder, animal cries... Retrieve sounds from a corpus of musical clips. Supervised cosine distance performed best for both

MFCC features MFCC feature extraction Divide signal into frames (~20ms) Discrete Fourier Transform (DFT) Take the log of amplitude spectrum (pull up) Mel-scaling and smoothing (pull to right) Discrete Cosine Transform (DCT) Obtain MFCC features Each frame of signals in time domain will be represented/encoded by a vector of 13 features

MFCC features Demo, ma_mfcc(wav, p), MA TOOLBOX INPUT wav (vector) obtained from wavread or ma_mp3read (use mono input! 11kHz recommended) p (struct) parameters e.g. p.fs = 11025; %% sampling frequency of given wav (unit: Hz) * p.visu = 0; %% create some figures * p.fft_size = 256; %% (unit: samples) 256 are about 23ms @ 11kHz * p.hopsize = 128; %% (unit: samples) aka overlap * p.num_ceps_coeffs = 20; * p.use_first_coeff = 1; %% aka 0th coefficient (contains information %% on average loudness) * p.mel_filt_bank = 'auditory-toolbox'; %% mel filter bank choice %% {'auditory-toolbox' | [f_min f_max num_bands]} %% e.g. [20 16000 40], (default) %% note: auditory-toobox is optimized for %% speech (133Hz...6.9kHz) * p.dB_max = 96; %% max dB of input wav (for 16 bit input 96dB is SPL)

MFCC features Cosine basis functions:

MFCC features Basis functions in the graph: White-black = half a cycle 1: no cycle. 2: half cycle. 3: 1 cycle etc. Normally use 13 coefficients.

MFCC features Questions? Strengths? Weaknesses?

MFCC features Natural to use the mel-scale and log amplitude since it relates to how we perceive sounds Model small (20ms) windows that are statistically stationary Assumption: phase info is less important than amplitude DFT assumes each frame of signals here is exactly one period

Mel vs Linear via Speech/Music classification 2hr training data and 40min testing data Music: 10% in train, 14% in test Bag of frames => Bunch of feature vectors per song EM algorithm to train Gaussian classifiers Compare likelyhood of a new point X: P(X|music) vs P(X|speech), choose max

Mel vs Linear Speech and music modeled using GMM Both Mel-ed and linear features are 13 dimensional: Mel: 40 bins-->DCT-->13 features Linear: 256 bin-->DCT-->13 features In training data, speech frames and music frames are used to train GMM for speech and music respectively, via EM algorithm

EM algorithm expectation-maximization (EM) algorithm is used for finding maximum likelihood estimates of parameters in probabilistic models, where the model depends on unobserved latent variables. expectation (E) step: compute an expectation of the log likelihood with respect to the current estimate of the distribution for the latent variables maximization (M) step: compute the parameters which maximize the expected log likelihood found on the E step. These parameters are then used to determine the distribution of the latent variables in the next E step. http://upload.wikimedia.org/wikipedia/commons/a/a7/E m_old_faithful.gif

Mel vs Linear speech/music discriminator GMM in 13-D space Given a new data point to predict, find: P(x|X~speech_1), P(x|X~speech_2), ... P(x|X~music_1), P(x|X~music_2), ... Find P(x|speech) and P(x|music) by summing products of coefficients and P(x|X~some model) X belongs to Y if Y = argmax P(x|X~Y), Y=speech or music

Mel vs Linear Questions? Strengths? weaknesses?

Mel vs Linear Use of well-algorithms, GMM, EM Consider avg likelihood over a test segment (many frames) – but how long is appropriate for a segment? Explanation in paragraph 2 was very confusing How is segmentation error computed? (table 1)

DCT to approximate PCA Known: KL decorrelates speech data Try: DCT to decorrelate speech data DCT to decorrelate music data Results: Similarity in basis functions for speech and data

DCT and PCA DCT: breaks function into sum of cosine basis functions PCA is a common technique to find patterns in data of high dimension, used in face recognition, image compression, etc. PCA transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components. Reduces dimensions

PCA Start with LINEARLY correlated data Adjust to mean Find eigenvectors of the covariance matrix

PCA Eigenvector with the highest eigenvalue is the principal component: accounts for most of the variation in the data Translate to new coordinates If original data is MultiVarGaussian, then we obtain a singleVar distribution

DCT and PCA c=Du u is of higher dimension, DFT coefficients? c=MFCC features, column vector Each row in D is a set of cosine basis functions Analogous to orthanormalized eigenvectors in O?

DCT and PCA For speech data: For music data: Questions? Strengths? KL transform gives 'cos-like' basis functions Thus DCT approximates PCA in speech data For music data: Thus DCT approximates PCA in music data as well Questions? Strengths? Weaknesses?