Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Similar presentations


Presentation on theme: "CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to."— Presentation transcript:

1 CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to Douglas Reynolds

2 Why speaker recognition? Access Control physical facilities websites, computer networks Transaction Authentication telephone banking remove credit card purachse Law Enforcement forensics surveillance Speech Data Mining meeting summarization lecture transcription slide text from Douglas Reynolds

3 Three Speaker Recognition Tasks slide from Douglas Reynolds

4 Two kinds of speaker verification Text-dependent Users have to say something specific easier for system Text-independent Users can say whatever they want more flexible but harder

5 Two phases to speaker detection slide from Douglas Reynolds

6 Detection: Likelihood Ratio Two-class hypothesis test: H0: X is not from the hypothesized speaker H1: X is from the hypothesized speaker Choose the most likely hypothesis Likelihood ratio test: slide from Douglas Reynolds

7 Speaker ID Log-Likelihood Ratio Score LLR= Λ =log p(X|H1) log p(X|H0) Need two models Hypothesized speaker model for H1 Alternative (background) model for H0 slide from Douglas Reynolds

8 How do we get H1? Pool speech from several speakers and train a single model: a universal background model (UBM) can train one UBM and use as H1 for all speakers Should be trained using speech representing the expected impostor speech Same type speech as speaker enrollment (modality, language, channel) Slide adapted from Chu, Bimbot, Bonastre, Fredouille, Gravier, Magrin-Chagnolleau, Meignier, Merlin, Ortega-Garcia, Petrovska- Delacretaz, Reynolds

9 How to compute P(H|X)? Gaussian Mixture Models (GMM) The traditional best model for text- independent speaker recognition Support Vector Machines (SVM) More recent use of discriminative model

10 Form of GMM/HMM depends on application slide from Douglas Reynolds

11 GMMs for speaker recognition A Gaussian mixture model (GMM) represents features as the weighted sum of multiple Gaussian distributions Each Gaussian state i has a Mean Covariance Weight Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner Dim 1Dim 2

12 Recognition Systems Gaussian Mixture Models Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner Parameters Dim 1Dim 2

13 Recognition Systems Gaussian Mixture Models Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner Model Components Parameters Dim 1Dim 2

14 GMM training During training, the system learns about the data it uses to make decisions A set of features are collected from a speaker (or language or dialect) Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner Training Features Dim 1Dim 2 Dim 1Dim 2 Model

15 Recognition Systems for Language, Dialect, Speaker ID Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner Model Components Languages, Dialects, or Speakers Parameters Dim 1Dim 2 In LID, DID, and SID, we train a set of target models for each dialect, language, or speaker

16 Recognition Systems Universal Background Model Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner Model Components Parameters We also train a universal background model representing all speech Dim 1Dim 2

17 Recognition Systems Hypothesis Test Given a set of test observations, we perform a hypothesis test to determine whether a certain class produced it Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner Dim 1Dim 2

18 Recognition Systems Hypothesis Test Given a set of test observations, we perform a hypothesis test to determine whether a certain class produced it Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner Dim 1Dim 2 Dim 1Dim 2 Dim 1Dim 2

19 Recognition Systems Hypothesis Test Given a set of test observations, we perform a hypothesis test to determine whether a certain class produced it Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner Dim 1Dim 2 Dim 1Dim 2 Dim 1Dim 2 Dan? UBM (not Dan)?

20 More details on GMMs Instead of training speaker model on only speaker data Adapt the UBM to that speaker takes advantage of all the data MAP adaptation: new mean of each Gaussian is a weighted mix of the UBM and the speaker Weigh the speaker more if we have more data: μ i =α E i (x) + (1α) μ i α=n/(n+16)

21 Gaussian mixture models Features are normal MFCC can use more dimensions (20 + deltas) UBM background model: 512–2048 mixtures Speakers GMM: 64–256 mixtures Often combined with other classifiers in mixture-of-experts

22 SVM Train a one-versus-all discriminative classifier Various kernels Combine with GMM

23 Other features Prosody Phone sequences Language Model features

24 Doddington (2001) Word bigrams can be very informative about speaker identity

25 Evaluation Metric Trial: Are a pair of audio samples spoken by the same person? Two types of errors: False reject = Miss: incorrectly reject a true trial Type-I error False accept: incorrectly accept false trial Type-II error Performance is trade-off between these two errors Controlled by adjustment of the decision threshold slide from Douglas Reynolds

26 ROC and DET curves slide from Douglas Reynolds P(false reject) vs. P(false accept) shows system performance

27 DET curve slide from Douglas Reynolds Application operating point depends on relative costs of the two errors

28 Evaluation tasks slide from Douglas Reynolds Performance numbers depend on evaluation conditions

29 Rough historical trends in performance slide from Douglas Reynolds

30 Milestones in the NIST SRE Program 1992 – DARPA: limited speaker id evaluation 1996 – First SRE in current series 2000 – AHUMADA Spanish data, first non-English speech 2001 – Cellular data 2001 – ASR transcripts provided 2002 – FBI forensic database 2005 – Mutiple languages with bilingual speakers 2005 – Room mic recordings, cross-channel trials 2008 – Interview data 2010 – New decision cost function: lower FA rate region 2010 – High and low vocal effort, aging 2011 –broad range of conditions, included noise and reverb From Alvin Martins 2012 talk on the NIST SR Evaluations

31 Metrics Equal Error Rate Easy to understand Not operating point of interest FA rate at fixed miss rate E.g. 10% May be viewed as cost of listening to false alarms Decision Cost Function From Alvin Martins 2012 talk on the NIST SR Evaluations

32 Decision Cost Function C Det Weighted sum of miss and false alarm error probabilities: C Det = C Miss × P Miss|Target × P Target + C FalseAlarm × P FalseAlarm|NonTarget × (1-P Target ) Parameters are the relative costs of detection errors, C Miss and C FalseAlarm, and the a priori probability of the specified target speaker, P target : C Miss C FalseAlarm P Target 96-081010.01 201011.001 From Alvin Martins 2012 talk on the NIST SR Evaluations

33 Accuracies From Alvin Martins 2012 talk on the NIST SR Evaluations

34 How good are humans? Survey of 2000 voice IDs made by trained FBI employees select similarly pronounced words use spectrograms (comparing formants, pitch, timing) listen back and forth Evaluated based on "interviews and other evidence in the investigation" and legal conclusions No decision65.2% (1304) Non-match18.8% (378)FR = 0.53% (2) Match15.9% (318)FA = 0.31% (1) Bruce E. Koenig. 1986. Spectrographic voice identification: A forensic survey. J. Acoust. Soc. Am, 79(6)

35 Speaker diarization Conversational telephone speech 2 speakers Broadcast news Many speakers although often in dialogue (interviews) or in sequence (broadcast segments). Meeting recordings Many speakers, lots of overlap and disfluencies Tranter and Reynolds 2006

36 Speaker diarization Tranter and Reynolds 2006

37 Step 1: Speech Activity Detection Meetings or broadcast: Use supervised GMMs two models: speech/non-speech or could have extra models for music, etc. Then do Viterbi segmentation, possibly with minimum length constraints or smoothing rules Telephone Simple energy/spectrum speech activity detection State of the art: Broadcast: 1% miss, 1-2% false alarm Meeting: 2% miss, 2-3% false alarm Tranter and Reynolds 2006

38 Step 2: Change Detection 1. Look at adjacent windows of data 2. Calculate distance between them 3. Decide whether windows come from same source Two common methods: To look for change points within window use likelihood ratio test to see if better modeled by one distribution or two. If two, insert change and start new window there If one, expand window and check again represent each window by a Gaussian, compare neighboring windows with KL distance, find peaks in distance function, threshold Tranter and Reynolds 2006

39 Step 3: Gender Classification Supervised GMMs If doing Broadcast news, also do bandwidth classification (studio wideband speech versus narrowband telephone speech) Tranter and Reynolds 2006

40 Step 4: Clustering Hierarchical agglomerative clustering 1. initialize leaf clusters of tree with speech segments; 2. compute pair-wise distances between each cluster; 3. merge closest clusters; 4. update distances of remaining clusters to new cluster; 5. iterate steps 2-4 until stopping criterion is met Tranter and Reynolds 2006

41 Step 5: Resegmentation Use final clusters and non-speech models To resegment data via Viterbi decoding Goal: refine original segmentation fix short segments that may have been removed Tranter and Reynolds 2006

42 TDOA features For meetings, with multiple-microphones Time-Delay-of-Arrival (TDOA) features correlate signals from mikes and figure out time shift used to sync up multiple microphones and as a feature for speaker localization assume the speaker doesnt move, so they are near the same microphone

43 Evaluation Systems give start-stop times of speech segments with speaker labels nonscoring collar of 250 ms on either side DER (Diarization Error Rate) missed speech (% of speech in the ground-truth but not in the hypothesis) false alarm speech (% of speech in the hypothesis but not in the ground-truth) speaker error (% of speech assigned to the wrong speaker) Recent mean DER for Multiple Distant Mikes (MDM): 8-10% Recent mean DER for SDM: 12-18%

44 Summary: Speaker Recognition Tasks slide from Douglas Reynolds


Download ppt "CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to."

Similar presentations


Ads by Google