Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE.

Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE

Audio processing apps Speech recognition (e.g., Google voice search) Situation awareness – Conversation detection – Location (environment) classification E.g., home, office, bar, beach, car, street, etc. – Dietary intake of a person (eating habit) – Everyday sound logging (and event detection) as in SoundSense

Discrete representation of signal Represent continuous signal into discrete form.

Time domain audio waveform Vertical axis: amplitude, relative sound pressure typical unit:  Pa (micro-pascals) (digital signal usually unitless) quantization (-32768 to 32767) Horizontal axis: time typical unit: msec (milliseconds) sampling (8000, 16000, 44.1K samp/sec)

Digitizing audio signal Sampling – Measuring amplitude of signal at time t – 16,000 Hz (samples/sec) Microphone (“Wideband”): – 8,000 Hz (samples/sec) Telephone – Why? Need at least 2 samples per cycle (Nyquist) Max measurable frequency is half sampling rate Human speech < 10,000 Hz, so need max 20K Telephone filtered at 4K, so 8K is enough

Digitizing audio signal Quantization – Representing real value of each amplitude as integer – 8-bit (-128 to 127) or 16-bit (-32768 to 32767) Formats: – 16 bit PCM – 8 bit mu-law; log compression LSB (Intel) vs. MSB (Sun, Apple) – Little-endian vs. Big-endian Headers: meta info such as sampling rates, recording condition – Raw (no header) – Microsoft wav – Sun.au 40 byte header

Visualization of audio signals What makes one phoneme, /aa/, sound different from another phoneme, /iy/? Different shapes of the vocal tract… /aa/ is produced with the tongue low and in the back of the mouth; /iy/ is produced with the tongue high and toward the front. The different shapes of the vocal tract produce different “resonant frequencies”, or frequencies at which energy in the signal is concentrated. (Simple example of resonant energy: a tuning fork may have resonant frequency equal to 440 Hz or “A”) Resonant frequencies in speech (or other sounds) can be displayed by computing a “power spectrum” or “spectrogram,” showing the power in the signal at different frequencies

Power spectrum Time-domain signal can be expressed in terms of sinusoids at a range of frequencies using the Fourier transform: Power Spectral Density (PSD) shows power distribution in frequency domain – PSD = Fourier transform of the autocorrelation function of the signal Continuous Discrete (N samples)

Power spectrum The power spectrum can be plotted like this (vowel /aa/): time- domain amplitude spectral power (dB) (512 samp) 0 Hz4000 Hzfrequency (Hz)

Automatic speech recognition: noisy channel analogy Search through space of all possible “source” sentences Choose the one which has the highest probability of generating the “noisy” sentence If music be the food of love.. Noisy Channel

Noisy channel model What is the most likely sentence out of all sentences in the language L given some acoustic input O? Treat acoustic input O as sequence of individual observations – O = o 1,o 2,o 3,…,o t Define a sentence as a sequence of words: – W = w 1,w 2,w 3,…,w n

Noisy channel model Probabilistic implication: Pick the highest prob: We can use Bayes rule to rewrite this: Since denominator is the same for each candidate sentence W, we can ignore it for the argmax:

Noisy channel model likelihoodprior

Noisy channel model Ignoring the denominator leaves us with two factors: P(Source) and P(Signal|Source) If music be the food of love.. Noisy Channel

Speech recognition Feature extraction (or signal processing): – Acoustic waveform is sampled into frames (usually of 10, 15, or 20 milliseconds) ; transformed into spectral features (mostly MFCC) Acoustic model (or phone recognition): – Compute the likelihood of the observed spectral feature vectors given linguistic units (e.g., words, phones) [used for decoding (and also training)] – Example: Gaussian Mixture Model (GMM) classifier For each HMM state q, corresponding to a phone or subphone, the likelihood of a given feature vector given this phone p(o|q). Decoding: – Take: Acoustic model (sequence of acoustic likelihoods) + HMM dictionary of word pronunciation + Language model: P(W) – Output: the most likely sequence of words Acoustic Model Decoding Language Model Feature Extraction O P(O|W) W P(W) Speech recognizer block diagram

Feature extraction Feature Extraction

Feature extraction Mel-Frequency Cepstral Coefficient (MFCC) – Most widely used spectral representation

Feature extraction Window size: 25ms Window shift: 10ms Pre-emphasis coefficient: 0.97 MFCC: – 12 MFCC (mel frequency cepstral coefficients) – 1 energy feature – 12 delta MFCC features – 12 double-delta MFCC features – 1 delta energy feature – 1 double-delta energy feature Total 39-dimensional features

Hidden Markov models

Bakis network Ergodic (fully-connected) network Left-to-right network

Hidden Markov models: example State: Hot or Cold day Emission: # of ice creams

Three basic problems for HMMs Problem 1 (Evaluation): Given the observation sequence O=(o 1 o 2 …o T ), and an HMM model  = (A,B), how do we efficiently compute P(O|  ), the probability of the observation sequence, given the model Problem 2 (Decoding): Given the observation sequence O=(o 1 o 2 …o T ), and an HMM model  = (A,B), how do we choose a corresponding state sequence Q=(q 1 q 2 …q T ) that is optimal in some sense (i.e., best explains the observations) Problem 3 (Learning): How do we adjust the model parameters  = (A,B) to maximize P(O|  )?

Evaluation Evaluation: how likely is the sequence 3 1 3? Computing observation likelihood for a given hidden state sequence – Suppose we knew the weather and wanted to predict how much ice cream Jason would eat: i.e., P( 3 1 3 | H H C) Summing over all possible hidden state sequences – But N states + T observations  O(N^T) combinations (intractable..)  Dynamic programming: use a table to store intermediate values.. (called forward algorithm)

Decoding Given – an observation sequence; e.g., 3 1 3 – an HMM (N states each has T outcomes) The task of the decoder is to find the best hidden state sequence Again # possible sequences: N^T (intractable) – E.g., P(3 1 3| * * * ) Instead: – Viterbi algorithm (dynamic programming) – Uses a very similar technique as in Evaluation

Digit recognition example Based on lexicon, build a HMM for each digit – HMM states trained, e.g., likelihood: p(o|q), transition Given input observations, use Viterbi to find the best matching digit Lexicon

Speech recognition: summary Feature Extraction Acoustic Model Decoding Extracted Features

Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE.

Similar presentations

Presentation on theme: "Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE.

Similar presentations

Presentation on theme: "Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE."— Presentation transcript:

Similar presentations

About project

Feedback