Presentation is loading. Please wait.

Presentation is loading. Please wait.

Speaker Authentication Qi Li and Biing-Hwang Juang, Pattern Recognition in Speech and Language Processing, Chap 7 Reporter : Chang Chih Hao.

Similar presentations


Presentation on theme: "Speaker Authentication Qi Li and Biing-Hwang Juang, Pattern Recognition in Speech and Language Processing, Chap 7 Reporter : Chang Chih Hao."— Presentation transcript:

1 Speaker Authentication Qi Li and Biing-Hwang Juang, Pattern Recognition in Speech and Language Processing, Chap 7 Reporter : Chang Chih Hao

2 2 Outline Introduction Pattern Recognition Technique Speaker Verification System Verbal Information Verification Combined with SV and VIV

3 3 Introduction To ensure the security of and proper access to private information, important transactions, and the computer and communication networks, passwords or personal identification numbers (PIN) have been used extensively in our daily life. To further enhance the level of security as well as convenience, biometric feature have also been considered. Among all biometric features, a person’s voice is the most convenient one for personal identification purposes because it is easy to produce, capture, or transmit over the ubiquitous telephone network. (supported without requiring special devices)

4 4 Introduction Speaker recognition –Speaker verification Verify an unknown speaker whether s/he is the person as claimed Ex. A yes-no hypothesis testing problem –Speaker identification Process of associating an unknown speaker with a member in a pre- registered Ex. A multiple-choice classification problem

5 5 Introduction The are two approaches to speaker authentication : –The speaker verification (SV) verify a speaker’s identity based on his/her voice characteristics. –The verbal information verification (VIV) verifies a speaker’s identity through verification of the content of his/her utterance.

6 6 Introduction Speaker Verification Two session (direct method) –Enrollment session The User’s identity, together with a pass-phrase, is assigned to the speaker. Train a speaker-dependent (SD) model that registers the speaker’s speech characteristics. –Test session Claims his/her identity. The system then prompts the speaker to say the pass-phrase. The pass-phrase utterance is compared against the stored SD model. Obviously, successful verification of a speaker relies upon a correct recognition of the speech input.

7 7 Introduction Speaker Verification

8 8 Enrollment is an inconvenience to the user as well as the system developer who often has to supervise and ensure the quality of the collected data. The quality of the collected training data has a critical effect on the performance of an SV system –Speaker may make a mistake –Acoustic mismatch between the training and testing environments

9 9 Introduction Verbal Information Verification The VIV method is the process of verifying spoken utterances against the information stored in given personal data profile. A VIV system may use a dialogue procedure to verify a user by asking questions. (indirect method) The difference between SV and VIV –Model SD model & acoustic-phonetic models –Pre-data voice data & personal data profile –Reject an imposter Pre-trained SD model & user’s responsibility

10 10 Introduction Verbal Information Verification

11 11 Pattern Recognition Bayesian Decision Theory M-class recognition problem –Given an observation o and a set of classes designated as {C 1, C 2,..C M } –Asked to make a decision, to classify o into class C i, denote this as an action α i

12 12 Pattern Recognition Bayesian Decision Theory The zero-one loss function describing the loss incurred for taking action α i when the true class is C j The expected loss associated with taking action α i

13 13 Pattern Recognition Bayesian Decision Theory Minimum-error-rate classification –To Minimize the loss, we take action α i that maximizes the posterior probability A sequence of observations –Assume the observations are independent and identically distributed (i.i.d.)

14 14 Pattern Recognition Stochastic Models for Stationary Process Gaussian mixture model (GMM) –Characterize a speech probability density functions (pdfs)

15 15 Pattern Recognition Stochastic Models for Stationary Process The GMM parameters can be destimated iteratively using the Baum-Welch or EM algorithm

16 16 Pattern Recognition Stochastic Models for Stationary Process One application of the above model is context-independent speaker identification, where assume that each speaker’s speech characteristics only acoustically and is represented by one class. When a spoken utterance is long enough, it is reasonable to assume that the acoustic characteristic is independent of its content.

17 17 Pattern Recognition Stochastic Models for Non-Stationary Process Hidden Markov Model (HMM) –Applied to characterize both the temporal structure and the corresponding statistical variations along the parameter trajectory of an utterance. –N-state, left-to-right model –Within each state, a GMM is used to characterize the observed speech feature vector as a multivariate distribution –Three parameters A : state transition probabilities B : observation densities π: initial state probabilities

18 18 Pattern Recognition Stochastic Models for Non-Stationary Process

19 19 Speaker Verification System

20 20 Speaker Verification System

21 21 Speaker Verification System Test session –After a speaker claims his/her identity, the system expects the user to speak the same phrase as in the enrollment session. –The voice waveform is converted to the feature representation. –The forced alignment block A sequence of speaker-independent phoneme models is constructed. The model sequence is then used to segment and align the feature vector sequence through use of the Viterbi algorithm. –The cepstral mean subtraction block Silence frames are removed Mean vector is computed based on the remaining speech frames

22 22 Speaker Verification System Fixed-phrase system –User-selected phrase is easy to remember –Has a better performance than a text-prompted system Model –SD left-to-right HMM –Whole-word or whole phrase model Feature extraction –Sampled at 8 kHz –Frame 30 ms, overlapping 10 ms –Pre-emphasized, Hamming window –10-th order LPC –Converted to cepstral coefficients, delta cepstral coefficients (24d)

23 23 Speaker Verification System Test session –Target score computation –Background score computation –Likelihood-ratio test

24 24 Speaker Verification System Experimentation Data base –Train 100 speakers : 51male and 49 female. Average utterance length of 2 seconds. Five utterances from each speaker recorded in one enrollment session. –Testing 50 utterances recorded from a true speaker in different sessions 200 utterances recorded from 51 or 49 impostors of the same gender in different sessions

25 25 Speaker Verification System Experimentation Data base –SD Model left-to-right HMMs The number of states depends on the total number of phonemes in the phrases. There are 4 GMM associated with each state. –SI Model 43 HMMs, corresponding to 43 phonemes 3 state per model, 32 GMM per state The common variance to all GMM –Adaptation The second, fourth, sixth, and eighth test utterances from the true speaker, which were recorded at different times, are used to update the means and mixture weights of the SD HMM for verifying successive test utterances.

26 26 Speaker Verification System Experimentation In general, the longer the pass-phrase, the higher the accuracy. The actual system performance would be better when users choose their own and most likely different pass-phrase.

27 27 Verbal Information Verification –Automatic speech recognition (ASR) The spoken input is transcribed into a sequence of words. The transcribed words are then compared to the information pre-stored in the claimed speaker’s personal profile. –Utterance verification (UV) The spoken input is verified against an expected sequence of words or subwords, which is taken from a personal data profile of the claimed individusl.

28 28 Verbal Information Verification Utterance verification (Single Question) –Keyword spotting and non-keyword rejection Three key modules –Utterance segmentation by forced decoding –Subword testing –Utterance level confidence measure

29 29 Verbal Information Verification

30 30 Verbal Information Verification Utterance Segmentation –Each piece of the key information is represented by a sequence of words, S, which in turn is equivalently characterized by a concatenation of a sequence of phonemes or subwords, and N is the total number of subwords in the key word sequence.

31 31 Verbal Information Verification Subword Hypothesis Testing –H 0 means that the observed speech O n consists of the actual sound of subword S n –H 1 is the alternative hypothesis –Target model : is trained using the data of subword S n –Anti-HMMs : is trained using the data of a set of subwords

32 32 Verbal Information Verification Confidence Measure Calculation –Make a decision at both the subword and the utterance level. At the subword level, a likelihood-ratio test can be conducted to reach a decision to accept or reject each subword. At the utterance level, a simple utterance score can be computed to represent the percentage of acceptable subwords. –Normalized confidence measure

33 33 Verbal Information Verification Sequential Utterance Verification –Definition 1 : False rejection error on J utterances is the error when the system rejects a correct response in any one of J hypothesis subtests. –Definition 2 : False acceptance error on J utterances is the error when the system accepts an incorrect set of responses after all of J hypothesis subtests. –Definition 3 : Equal-error rate on J utterances is the rate at which the false rejection error rate and the false acceptance error rate on J utterances are equal.

34 34 Verbal Information Verification Example : –A bank operator usually asks two kinds of personal questions when verifying a customer. When automatic VIV is applied to the procedure, the average individual error rates on these two subtests are ε r (1)=0.1%, ε a (1)=5%; and ε r (2)=0.2%, ε a (2)=6%, respectively. Then, a sequential test are E r (2)=0.3% and E a (2)=0.3%.

35 35 Verbal Information Verification VIV Experimentation –Data base 26% of the speakers have birth year in the 1950s and 24% are in the 1960s. In city and state names, 39% are “New Jersey”, and 5% of the speakers used exactly the same answer. 38% of the telephone numbers start from “908 582...”, which means that at least 60% of the digits in their answer for the telephone number are identical. The same speaker is used as an impostor when the utterances are verified against other speakers’ profiles. Thus, for each true speaker, we have three utterances from the speaker and 99*3 utterances from other 99 speakers as impostors.

36 36 Verbal Information Verification VIV Experimentation –Feature 12 LPC cepstral coefficients + 12 delta + 12 delta-delta (39) –Model The target phone models : 1117 right context-dependent HMMs Anti-models : 41 context-independent anti-phone HMMs Three sequential subtests (J=3) –“In which year were you born?” –“In which city and state did you grow up?” –“May I have your telephone number, please?”

37 37 Verbal Information Verification

38 38 Speaker Authentication by Combining SV and VIV Cause –SV : users often make mistakes during enrollment. –VIV : no speaker-specific voice characteristics are used in the verification process. Procedure –The uttered pass-phrase must pass VIV tests; otherwise, the user is prompted to repeat. –Verified utterances of the pass-phrase are then saved, and used to train a SD model for SV. –The authentication system can then be switched from VIV to SV.

39 39 Speaker Authentication by Combining SV and VIV

40 40 Speaker Authentication by Combining SV and VIV Experimentation –Train Data 100 speaker, 51 male and 49 female. The fixed phrase, common to all speaker, is “I pledge allegiance to the flag” (2 sec) Five utterances of the pass-phrase recorded from five separate VIV sessions were used to train the SD HMM. (different environments and channels) –Test Data 40 utterances recorded from a true speaker in different sessions 192 utterances recorded from 50 impostors of the same gender in different sessions. For model adaptation, the second, fourth, sixth, and eighth test utterances from the tested true speaker were used to update the associated HMMs for verifying subsequent test utterances incrementally.

41 41 Speaker Authentication by Combining SV and VIV

42 42 Speaker Authentication by Combining SV and VIV Advantage –The system is convenient to users. –The acoustic mismatch problem is to a certain degree mitigated –The quality of the training data are ensured. –Better authentication performances.

43 43 Summary The theoretical foundation –Bayesian decision –Hypothesis testing Speaker Verification –Verifying speakers by their voice characteristics –Fixed-phrase has good performance. (easy to remember and convenient to use) Verbal Information Verification –Verify a speaker by the verbal content. –Have very good accuracy by applying a sequential verification technique. –It is the users’ responsibility to protect their personal information from impostors.

44 44 Summary SV + VIV –User convenience Without going through a formal enrollment session and waiting for model training. –System performance Collects verified training data Different channels and environments. The acoustic mismatch problem is mitigated. A good speaker authentication system for real applications could come from a proper integration of speaker verification, verbal information verification, speech recognition, and text-to-speech systems.

45 45 Thanks


Download ppt "Speaker Authentication Qi Li and Biing-Hwang Juang, Pattern Recognition in Speech and Language Processing, Chap 7 Reporter : Chang Chih Hao."

Similar presentations


Ads by Google