Presentation on theme: "(Artificial Neural Networks)"— Presentation transcript:
1 (Artificial Neural Networks) Term ProjectClose-Set, Text-Dependent, Twelve(12) people, Speaker Identification Using ANN(Artificial Neural Networks)JAY DESAIKUANG-TAO CHIAO
2 The structure of Pattern Recognition System IntroductionOverviewClosed Set/Open SetText Dependent/Text IndependentSpeaker Identification/Speaker VerificationSpeaker Recognition and speech recognition are subsets of a more general area known as pattern recognition. Given the features that describe the properties of an object, a pattern recognition system aims to recognize the object based on its previous knowledge of the object. Three stages are generally involved in building a pattern-recognition system:training, testing and implementation. In the training stage, a set of parameters of the model is estimated so that in some sense the model learns the correspondence between the features and the labels of the objects. One such learning criterion is to minimize the overall estimation error. In the testing stage, the parameters of the model are then adjusted using a set of cross-validation data to achieve a good generalization of the performance of the system. The cross-validation data usually consists of a set of features and labels that are different from the training data. The task of recognition is carried out in the implementation stage, where the feature with an unknown label is passed through the system and assigned a label at the output.A pattern recognition system basically consists of a front end feature extractor and a classifier. The feature extractor normalizes the collected data and transforms them to the feature space. In feature space, the data are compressed and represented in such an effective way that objects from the same class behave similarly and a clear distinction among objects from different class exists. The classifier takes the feature computed by the feature extractor and performs either template matching or probabilistic likelihood computation of the features, depending on the type of algorithm employed. Before it can be used for classification, the classifier has to be trained so that a mapping from the feature to the label of a particular class is established. The implicit assumption of this approach is that the training and testing conditions are comparable.The close set problem is to identify a speaker from a group of N known speakers. Naturally the larger N the more difficult the task. Our system can identify 12 speakers. The speaker that scores best on the test utterance is identified. Alternatively, one may want to decide whether the speaker of a a test utterance belongs to a group of N known speakers. This is called the open-set problem, since the speaker to be identified may not be one of the N known speakers. If a speaker scores well enough on the basis of a test utterance, then the target speaker is accepted as being known.By text independent we mean that the identification procedure should work for any text in either training or testing. This is a different problem than text-dependent recognition, where the text in both training and testing is the same or is known.Speaker Verification is a special case of open-set problem and refers to the task of deciding whether a speaker is who he or she claims to be. Often, however, speaker verification systems must not only verify the voice, but also the text with a speech recognizer in order to prevent imposters from using recordings.datafeaturerecognized labelFeature ExtractorClassifierThe structure of Pattern Recognition System
3 System Architecture Block Diagram First time he speaksSpeaker #1Tenth time he speaks3 frames/vowel(12 frames)ParameterOptimizationofNeuralNetworksFrameExtractionbasedonshort-timeenergy14 coefficients/frame(14*12=168*1)LPCCepstralCoefficientsAn important step in the speaker identification process is to extract sufficient information for good discrimination, and, at the same time, to have captured the information in a form and size that is amenable to effective modeling. The amount of data generated by short utterances is quite large. Speech is digitized at a rate of 8 kHz or higher using 8 bits or more sample, requiring tens of thousands of bytes for a few seconds. Whereas these large amounts of information are needed to characterize the speech waveform, the essential characteristics of the speech process changes relatively slowly, permitting a representation requiring significantly less data. Speech signals can be parameterized over relatively long time periods of ms called frames. If the speech from a 20 ms frame can be reduced to a 14 dimensional vector, say then a data reduction ratio of 160/14=11.4 is achieved at 8 KHz sampling rate. The process of reducing data while retaining classification information falls under the general heading of feature extraction. The vectors extracted are termed features. The n-dimensional feature space is referred as speaker space.First time he speaks(p=14)Speaker #12Tenth time he speaks
7 Password /u/ /i/ /æ/ /a/ Why the choice of password? Vowel Plane The Phoneticians vowel trapeziumAfter trial and error we found that 4 vowels are required for 12 speakers. We choose the above password as seen from the Vowel Plane that the four vowels are in different planes an d hence easy to identify.The most important class of articulatory gestures in any language is the vowels. These are static gestures, that is fixed positions of the articulators, normally made using voiced excitation. It has been shown that two degrees of freedom of tongue movement can account for most of the variation in vowels. This simplification implies that for vowels the tongue is used only in its forward/backward and raised/lowered modes of movement, the tongue shape remaining fixed. With these two degrees of freedom the vowels can be plotted as points in a plane whose axes represents actual tongue positions.The phoneticians version of this diagram is the vowel trapezium. The axes of the figure are back/front and open/close-referring to the position of the narrowest part of the oral tract. The vowel trapezium is based on a subjective assessment, and is a useful representation of relative vowel quality.There are 2 additional factors which complicate the characterization of vowels. These are lip position and degree of nasality. These constitute in effect two extra dimension. Fortunately these 2 variables do not seem to be continuous variables to any great extent. Thus vowels tend to be perceived as either nasalised or not. Similarly only 3 lips positions seem to have perceptual significance;rounded, lax and tense. Thus for each point in the graphs there are 6 possible conditions of the lip/nasality variable.Back RaisingFrontCentralBack.uCloseiu.iHalf-CloseFront LoweringFront Raising.aHalf-open.aeaaeOpenBack Lowering
8 Linear Predictive Coding Why LP analysis?Feature ExtractionComputational aspectsLPC CepstrumIt is well known that vowels involve vocal-tract configurations that are acoustically resonant and are therefore appropriately modeled by all pole structures. An interesting fact-Human ear is fundamentally phase deaf. Whatever information is aurally gleaned from the speech is extracted from its magnitude spectrum. Further, magnitude but not phase spectrum can be modeled with stable poles. Therefore the LP, model can exactly preserve the magnitude spectral dynamics in speech, but might not retain the phase characteristics. If the objective is to code, store, resynthesize and so on, LP model is perfectly valid and useful.The general feature extraction step of interest here can be divided into two parts. First, LP analysis of speech is carried out to produce a set of predictor coefficients. Second, the predictor coefficients are transformed into feature vectors.The single all-pole transfer function is, H(z)=G/(1-Σaz^(-i)), i=1 to pWith this transfer function we get a difference equation for synthesizing the speech samples s(n),s(n)= Σas(n-i) +Gu(n), i=1 to ps(n) is predicted as a linear combination of previous p samples. Therefore, the speech production model is often called the LP model. The predictor coefficients describing the auto regressive model must be computed from the speech signal. Since speech is time varying in that the vocal-tract configurations changes over time, an accurate set of predictor coefficients is adaptively determined over short intervals called frames, during which time invariance is assumed. The gain G is usually ignored to allow the parameterization to be independent of the signal intensity. The auto correlation approach is based on minimizing the mean-square value of the estimation error, e(n)=s(n)- Σas(n-i), i=1 to pMSE is minimized over N samples. It is assumed that speech samples are identically zero outside the frame of interest. If the autocorrelation of the signal isr(k)= Σs(n)s(n+k), n=0 to N-1-kR is the autocorrelation matrixa is the predictor coefficient vectorr is the vector of autocorrelation coefficientsRa=rLevinson-Durbin recursion is used to solve the system of equations. Upon solving for H(z), the magnitude response represents the spectral envelope of speech.LPC cepstrum is the cepstrum of the autocorrelation sequence of a speech frame.
11 Potential Applications Meetings, Conferences, ConversationsLaw enforcementSecurity applicationHuman-Machine InterfaceGender recognitionOthersThe potential for applications of speaker recognition systems exists any time speakers are unknown and their identities are important.In meetings, conferences, or conversations the technology makes machine identification of participants possible.In law enforcement, speaker recognition systems can be used to help identify suspects.Security applications abound. Access to cars, buildings, back accounts and other services may be voice controlled in future.The technology has applications to human-machine interfaces, where intelligent machines would be programmed to adapt and respond to the current user.Gender recognition based on a variant of speaker recognition techniques is already in use in many speaker independent speech recognizers to improve performance.The above list is by no means complete, but provides an indication of the types and variety of applications.
12 Scope of Improvement Robustness Additive Noise Co-channel Interference Increasing the number of usersAdditive NoiseThe random noise arising from the background and the fluctuation of the transmission channel is generally assumed to be additive white noise(AWN). The noisy observation of the original speech signal is,s’(n)=s(n)+q(n)E[q(n)]=0; E[q2(n)]=σ2Predictor coefficients of the noise corrupted speech is,a’=(Rs+ σ2I) -1RsaThus addition of AWN noise to the speech is equivalent to taking a linear transformation of the predictor coefficients. The linear transformation depends on the auto correlation of the speech and thus, in a spectrum based model all the spectrally similar predictors will be mapped by a similar linear transform.Co channel interferenceThe co-channel interference due to a second speaker can also be interpreted as a affine transformation. In the case of interference due to another speaker talking on the same channel, the observed signal will be, s=s1+s2Again the co-channel interference carries out an affine transformation on the predictor coefficients.Our method is not very robust to a very wide variety of environmental conditions. However a better model that is robust and computationally tractable can be realized. The LP coefficients are converted to cepstral features. Effort can be directed in finding features for achieving very high recognition performances(especially under severe channel conditions and very low signal to noise ratio.