Analysis of multidimensional signals for classification and recognition Politecnico di Bari Corresponding persons Pietro Guccione – Assistant Professor.

Analysis of multidimensional signals for classification and recognition Politecnico di Bari Corresponding persons Pietro Guccione – Assistant Professor (guccione@poliba.it )guccione@poliba.it

Purpose Many different signals are involved and infer on ordinary everyday life. Examples are speech, music, video, pictures, climate data, biomedical images, sociological data, data from household appliances, etc. From a scientific point of view the processing of these signals with the purpose of transformation for interpretation, investigation, forecast, storage, fruition, and other applications is an actual challenge. Purpose of this talk is to:  propose few examples of signals with which our group is/shall be involved in and their main features,  display the tools with which such signals are actually handled and processed,  describe the actual state of the art and the applications implemented by proper processing of such signals,  describe a line of research in which it could be interesting to investigate on them.

Summary Speech Signal  Basis and main features  Spectral/Time domains representation  Parameterization and feature extraction  A general model for recognition of speaker / speech / emotional state

Speech Basis /1 The human voice production mechanism can be roughly divided into three parts: lungs, vocal folds, and vocal tract. The lungs function as a source of air flow and pressure. When voiced speech sound is being produced, the vocal folds (vocal cords) open and close periodically and thus convert the air flow from the lungs into a train of flow pulses, which functions as an acoustic excitation and source of voiced speech. The vocal tract is a set of cavities above the vocal folds up to the mouth and nostrils. It functions as an acoustic filter that shapes the spectrum of the sound. Finally, sound is radiated to the surrounding air at the lips and nostrils.

Speech Basis /2 The vibration of the vocal folds provides a spectrally rich acoustic excitation that is shaped by the cavities above the glottal source. The tube formed by larynx, pharynx, and oral cavity, is called the vocal tract. Sometimes the nasal cavity is included in the definition of vocal tract. The vocal tract is an adjustable acoustic filter that modifies the spectrum of the excitation signal. Each vowel sound has its characteristic spectral profile produced by vocal tract resonances, or formants. The formant frequencies depend on the shape of the vocal tract, which in turn is determined by the positions of the soft palate, tongue, jaw, and lips.

Speech Basis /3 Fant (1960) introduced the source-filter theory of human speech production. The theory states that the voice production mechanism can be modeled as a series connection of an excitation source and a filter system. The source and filter are considered independent of each other. In the case of voiced speech sounds, the excitation is provided by the air flow through the vibrating vocal folds, the voice source. The vocal tract works as a phone-dependent filter. Acoustic analysis of the speech production mechanism commonly utilizes two physical variables: sound pressure and volume velocity of air flow. The glottal flow is usually expressed in terms of volume velocity, whereas speech is typically recorded at some distance from the speaker using a pressure microphone. The volume velocity waveform at the mouth determines the pressure signal propagating into the surrounding free field. Analysis Speech recognition Speaker recognition …

Speech Basis /4 Important parts of the discrete-time speech production model, in the field of speech recognition and signal processing, are: u(n), gain b 0 and H(z). The impulse generator acts like the lungs, exciting the glottal filter G(z), resulting in u(n). The G(z) is to be regarded as the vocal cords in the human vocal mechanism. The signal u(n) can be seen as the excitation signal entering the vocal tract and the nasal cavity and is formed by exciting the vocal cords by air from the lungs. The gain b 0 is a factor that is related to the volume of the speech being produced. Larger gain b 0 gives louder speech and vice versa. The vocal tract filter H(z) is a model over the vocal tract and the nasal cavity. The lip radiation filter R(z) is a model of the formation of the human lips to produce different sounds

Speech Basis /5 Non continuant: the airstream enters the oral cavity and is stopped for a short period, called stops. Stops are transient, non-continuant sounds that are produced by building up pressure behind a total constriction somewhere along the vocal tract, and suddenly release this pressure. This sudden explosion and aspiration of air characterizes the stop consonants. Semivowels are classified as either liquids or glides. Liquids have spectral characteristics similar to vowels, but are normally weaker than most vowels due to their more constricted vocal tract. A glide consists of one target position, with associated formant transitions toward and away from the target. Glides can be viewed as transient sounds as they maintain the target position for much less time than vowels. Nasal consonants are voiced sounds produced by the waveform exciting an open nasal cavity and closed oral cavity. Their waveform resemble vowels, but are normally weaker in energy due to limited ability of the nasal cavity to radiate sound Affricates are formed by transitions from a stop to a fricative. Fricatives are produced by exciting the vocal tract with steady airstream that becomes turbulent at some point of constriction. Vowels are phonated and are normally among the phonemes with highest amplitude. Vowels can vary widely in duration (typically 40-400 ms) and are spectrally well defined. Vowels are produced by exciting an fixed vocal tract shape with quasi- periodic pulses of air caused by the vibration of the vocal cords. Vowels are differentiated by the position of the tongue-hump.

Speech Basis /6 The pitch is the fundamental frequency of the voiced phonemes (the excitation element). It is the frequency with which the vocal chords oscillate during the passage of the air flow. The pitch is usually included within the interval [100÷300]Hz; variations are due to sex (male/female/child), length of the vocal chords, intonation of the speech (interrogative, assertive/…). [Intonation really is a function also of the emotional state, which infers even on intensity and speed of utterance]. Pitch can be used as basic element for a raw classification of speaker. Unvoiced emissions are usually generated by the passage of the air through the opened vocal chords. The relative opening/closure of the vocal tract, the relative position of tongue, lips, the use of the nose tract produce the different unvoiced phonemes glottis The vocal channel modeled as a riverbering structure with about 10 parts disadapted at output (lips) and adapted at input lips<20cm Sound speed =330 m/s Vocal channel crossing time ~0.6 ms Vocal signal max frequency ~ 4KHz Minimum sampl freq 8KHz

Speech Basis /7 Vowels are basically periodic sounds; their spectrum incudes a continuous part and a set of spectral peaks. The peculiar feature of a vowel is given by its spectral peaks, where the most of the energy of the vowel is concentrated. The peaks are in correspondence with the main frequency resonance of the vocal tract (formants). 3-4 formants are usually sufficient to characterize the vowels. Establishing thresholds for discriminating formants is important for determining the baseline ability of the auditory system to process small differences between vowels. Format frequencies of the English language vowels.

Speech signal representation The human speech is known to be modeled as a non-stationary random process. Its first and second order statistics (pdf) have been suggested many decades ago (Gaussian, Laplacian, Gamma distributions) for optimal quantization but are actually unable to describe the richness of speech signal. The non-stationary characteristics is usually overcome by segmenting the signal in short-time frames (up to few hundreds of ms) into which the speech can be considered stationary. The speech signal and all its characteristics can be represented in two different domains, the time and the frequency domain. A speech signal is a slowly time varying signal in the sense that, when examined over a short period of time (between 5 and 100 ms), its characteristics are short-time stationary. This is not the case if we look at a speech signal under a longer time perspective (approximately t>0.5 s). In this case the signals characteristics are non-stationary, meaning that it changes to reflect the different sounds spoken by the talker, different talker and emotional state. To be able to use a speech signal and interpret its characteristics in a proper manner some kind of representations of the speech signal are preferred. The speech representation can exist in either the time or frequency domain, and in three different ways. These are a three-state representation (voiced, unvoiced, silence), a spectral representation and the last representation is a parameterization of the spectral activity.

Spectral representation /1 Speech spectral envelope The rate at which power spectral density is generated is called the frame rate. The typical frame rate is between 100 and 200 frames per second, which corresponds to a frame every 5 to 10 ms. Example of short-time Fourier transform (STFT) 1.Data in a frame are windowed 2.PSD is computed (spectrogram/Welch) 3.PSD of various frame are summed to reduce the variance of the estimate (the problem of non-stationarity is accounted for)

Spectral representation /2 Problems 1.Pre-processing to discard the unvoiced / silent parts of the speech 2.Optimal segmentation (correct size of a frame) 3.Optimal number of segments to sum to reduce variance of estimate Reduce the nuisance frames S A Onset detection (i.e. recognition of the transient phases Difficulty in recognition of the different phonemes SS O

Spectral representation /3 Problems 1.Pre-processing to discard the unvoiced / silent parts of the speech 2.Optimal segmentation (correct size of a frame) 3.Optimal number of segments to sum to reduce variance of estimate Required a given resolution for the PSD Thresholds on: - Short-time averaged energy - LPC coefficients Solution: the spectrogram. A spectrogram is a time-varying spectral representation that shows how the PSD of a signal varies with time.

Spectral representation /4 Spectrogram can aid in selecting the corruptive noise to remove before application of ASR (Automatic Speech Recognition systems). Mel spectrogram of the same utterance when all regions with a local SNR less than 0 dB have been deleted (on the right). The white regions in the figure represent the deleted regions of the spectrogram. Mel spectrogram of an utterance of speech corrupted by an additive white noise.

Spectral representation /5 Linear Prediction Coefficients The main idea behind linear prediction is to extract the vocal tract parameters. Given a speech sample at time n, s(n) can be modeled as a linear combination of the past p speech samples: All these coefficients are assumed constant over the speech analysis frame (stationarity of the signal within the frame is assumed).

Spectral representation /6 Pre-emphasis Segmentation (10-30ms) LPC analysis of coeff Pitch est. LPC coding Filter coeffs. Pitch/noise s(n) LPC coded Decoding Pitch/noise generator Filter Pitch/noise Filter coeffs gain s out (n) De-emphasis

Spectral representation /7 Cepstrum (or cepstral) coefficients The cepstrum method is a way of investigate on the vocal tract filter H(z) with ”homomorphic transformation”. A homomorphic signal processing is generally concerned with the transformation to linear domain of signals combined in a nonlinear way. In this case the two signals are not combined linearly (a convolution can’t be described as a simple linear combination) Since the excitation, U(z), and the vocal tract, H(z), are combined by a multiplicative law, it is difficult to separate them. If the log operation is applied the task will become additive: The additive property of the log spectrum also applies when an inverse transforming is applied to it. The result of this operation is called a cepstrum. To avoid taking logarithm of complex numbers, an abs operation is applied to S(z).

Spectral representation /8 Cepstrum coefficients. Motivations The vocal tract filter has ”slow” spectral variations and the excitation signal has ”fast” variations. This property corresponds to low-quefrency for the vocal tract filter and high- quefrency for the excitation signal. The P in the cepstrum for the excitation is the corresponding pitch. Liftering is filtering in cepstrum domain and can help in selecting and detecting the pitch frequency

Spectral representation /9 Mel-frequency cepstral coefficients A number of “variants” of the cepstrum exist. Psychophysical studies (studies of human auditory perception) has shown that human perception of the frequency contents of sound, for speech signals does not follow a linear scale. Thus for each tone with an actual frequency, F, measured in Hz, a subjective pitch is measured on a scale called the “mel” scale. As reference for the mel scale, 1000 Hz is usually said to be 1000 mels. The nonlinear transformation of the frequency scale

Spectral representation /10 Mel-frequency cepstral coefficients Equispaced filters in mel-f corresponds to exponentially increasing filters in ordinary f.

Spectral representation /11 The band-limited MFCC space may be modeled as a mixture of K Gaussian classes Cepstrum / MFCC features are generally assumed uncorrelated. In fact, this is one of the key points for their extended use in ASR systems – they allow using diagonal covariance matrices in Gaussian mixture models without significant performance loss. A good parametric representation for a speech recognition system tries to eliminate the influence of the source (the system must give the same "answer" for a high pitch female voice and for a low pitch male voice), and characterize the filter. The problem is: source and filter impulse response are convoluted. Then we need deconvolution in speech recognition applications. This deconvolution is achieved by cepstral coefficients.

Spectral representation /12 Possible application Accurate estimation of the PSD of a speech signal. LPC and LPCC (Linear Prediction Cepstral Coefficients) can be related each other. The LPCC are less correlated along their dimensions and the spectral envelope has undergone a logarithmic compression. Mel-frequency cepstral coefficients warp the Fourier frequency axis in addition to the magnitude during computation. The shape of the warping function tends to resolve spectral peaks in speech with fewer coefficients MFCCs are extremely effective as a basis for phonetic classification in speech recognizers. This can be explained by the way that they compactly describe the broad shape of the short-time spectrum using just a few values and that these values tend to all be relatively independent, meaning there is little redundant information in the feature vectors.

Parameterization of the Speech Activity Parameterization of the speech consists in transformation of speech in a set of parameters that describe it. Since speech is basically non-stationary, usually these parameters are extracted on a frame-by-frame basis. This way a set of parameters (modeled as r.v.) are provided and a set of samples are available for them. Parameters can be divided into: - Frequency-domain parameters (LPC, LPCC, MFCC, STFT coeff, FrFT coeff, bandwidth, spectral roll-off, etc) - time (short-time average energy or magnitude, zero-crossing rate, autocorrelation function, magnitude difference function, pitch, formants, etc) - Low-level features (I-th harmonic total amplitude, residual amplitude, i-th harmonic amplitude, dis- harmonicity, spectrum brightness or centroid, etc) We usually have a feature for each frame or group of frames: Varies with frame … the different feature Multivariate approach [linear/nonlinear PCA, ICA, CCA, HMM] Dimensionality reduction Speech recognition Emotion recognition

 Recognition of a speaker, regardless of its talk or the language used  Recognition of a speech, regardless of the speaker  Classification of emotional states, regardless of the speech  Classification of emotional states, regardless of the speech and the speaker General model on speech / speaker recogn. Speaker / wordPre-processing Parametrization GMMICAHMM Classifier ANN Training set Speaker recognition / Word / phoneme recognition Language recognition Emotion classification

Analysis of multidimensional signals for classification and recognition Politecnico di Bari Corresponding persons Pietro Guccione – Assistant Professor.

Similar presentations

Presentation on theme: "Analysis of multidimensional signals for classification and recognition Politecnico di Bari Corresponding persons Pietro Guccione – Assistant Professor."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Analysis of multidimensional signals for classification and recognition Politecnico di Bari Corresponding persons Pietro Guccione – Assistant Professor.

Similar presentations

Presentation on theme: "Analysis of multidimensional signals for classification and recognition Politecnico di Bari Corresponding persons Pietro Guccione – Assistant Professor."— Presentation transcript:

Similar presentations

About project

Feedback