CS 224S / LINGUIST 285 Spoken Language Processing

CS 224S / LINGUIST 285 Spoken Language Processing
Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to Douglas Reynolds

Why speaker recognition?
Access Control physical facilities websites, computer networks Transaction Authentication telephone banking remove credit card purachse Law Enforcement forensics surveillance Speech Data Mining meeting summarization lecture transcription slide text from Douglas Reynolds

Three Speaker Recognition Tasks
slide from Douglas Reynolds

Two kinds of speaker verification
Text-dependent Users have to say something specific easier for system Text-independent Users can say whatever they want more flexible but harder

Two phases to speaker detection

Detection: Likelihood Ratio
Two-class hypothesis test: H0: X is not from the hypothesized speaker H1: X is from the hypothesized speaker Choose the most likely hypothesis Likelihood ratio test: slide from Douglas Reynolds

Speaker ID Log-Likelihood Ratio Score
LLR= Λ =log p(X|H1) − log p(X|H0) Need two models Hypothesized speaker model for H1 Alternative (background) model for H0 slide from Douglas Reynolds

Pool speech from several speakers and train a single model:
How do we get H1? Pool speech from several speakers and train a single model: a universal background model (UBM) can train one UBM and use as H1 for all speakers Should be trained using speech representing the expected impostor speech Same type speech as speaker enrollment (modality, language, channel) Slide adapted from Chu, Bimbot, Bonastre, Fredouille, Gravier, Magrin-Chagnolleau, Meignier, Merlin, Ortega-Garcia, Petrovska-Delacretaz, Reynolds

Gaussian Mixture Models (GMM)
How to compute P(H|X)? Gaussian Mixture Models (GMM) The traditional best model for text- independent speaker recognition Support Vector Machines (SVM) More recent use of discriminative model

Form of GMM/HMM depends on application

GMMs for speaker recognition
A Gaussian mixture model (GMM) represents features as the weighted sum of multiple Gaussian distributions Each Gaussian state i has a Mean Covariance Weight In modern recognition systems, it is typical to model the data as a weighted sum of multiple Gaussian distributions, each having a mean, covariance, and weight. Dim 2 Dim 1 Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner

Recognition Systems Gaussian Mixture Models
Parameters Each Gaussian of the sum is represented by a parameter vector, as shown. Dim 2 Dim 1 Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner

Recognition Systems Gaussian Mixture Models
Parameters Model Components An entire GMM is represented by a matrix, having a parameter vector for each model state. Dim 2 Dim 1 Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner

GMM training Training Features During training, the system learns about the data it uses to make decisions A set of features are collected from a speaker (or language or dialect) Dim 2 Dim 1 Model A model is then generated from these features. As an example, we can see here the data modeled with a single Gaussian distribution. Dim 2 Dim 1 Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner

Recognition Systems for Language, Dialect, Speaker ID
Languages, Dialects, or Speakers Parameters Model Components Using the GMMs that we have described, modern recognition systems store models for each language, dialect, or speaker of interest. These are called the target models. In LID, DID, and SID, we train a set of target models for each dialect, language, or speaker Dim 2 Dim 1 Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner

Recognition Systems Universal Background Model
Parameters Model Components It is also conventional to create a model called the universal background model, representing the universe of all speech. This model is typically trained on a wide variety of training data representing the scope of the languages, dialects, or speakers that are possible. We also train a universal background model representing all speech Dim 2 Dim 1 Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner

Recognition Systems Hypothesis Test
Given a set of test observations, we perform a hypothesis test to determine whether a certain class produced it Armed with the target and background models, the process of performing recognition become a hypothesis test. In this test, we ask whether a set of observed data more likely represents a hypothesized language, dialect, or speaker, or instead some other class. Dim 2 Dim 1 Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner

Given a set of test observations, we perform a hypothesis test to determine whether a certain class produced it Dim 2 Dim 1 As shown, this test is performed by considering the likelihood that a set of observed features was produced by a certain target model versus the probability that it comes for the universal background model. Dim 2 Dim 1 Dim 2 Dim 1 Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner

Given a set of test observations, we perform a hypothesis test to determine whether a certain class produced it Dan? Dim 2 Dim 1 As shown, this test is performed by considering the likelihood that a set of observed features was produced by a certain target model versus the probability that it comes for the universal background model. UBM (not Dan)? Dim 2 Dim 1 Dim 2 Dim 1 Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner

More details on GMMs Instead of training speaker model on only speaker data Adapt the UBM to that speaker takes advantage of all the data MAP adaptation: new mean of each Gaussian is a weighted mix of the UBM and the speaker Weigh the speaker more if we have more data: μi =α Ei(x) + (1−α) μi α=n/(n+16)

Gaussian mixture models
Features are normal MFCC can use more dimensions (20 + deltas) UBM background model: 512–2048 mixtures Speaker’s GMM: 64–256 mixtures Often combined with other classifiers in mixture-of-experts

SVM Train a one-versus-all discriminative classifier Various kernels
Combine with GMM

Other features Prosody Phone sequences Language Model features

Doddington (2001) Word bigrams can be very informative about speaker identity

Evaluation Metric Trial: Are a pair of audio samples spoken by the same person? Two types of errors: False reject = Miss: incorrectly reject a true trial Type-I error False accept: incorrectly accept false trial Type-II error Performance is trade-off between these two errors Controlled by adjustment of the decision threshold slide from Douglas Reynolds

ROC and DET curves P(false reject) vs. P(false accept) shows system performance slide from Douglas Reynolds

DET curve Application operating point depends on relative costs of the two errors slide from Douglas Reynolds

Evaluation tasks Performance numbers depend on evaluation conditions

Rough historical trends in performance

Milestones in the NIST SRE Program
1992 – DARPA: limited speaker id evaluation 1996 – First SRE in current series 2000 – AHUMADA Spanish data, first non-English speech 2001 – Cellular data 2001 – ASR transcripts provided 2002 – FBI “forensic” database 2005 – Mutiple languages with bilingual speakers 2005 – Room mic recordings, cross-channel trials 2008 – Interview data 2010 – New decision cost function: lower FA rate region 2010 – High and low vocal effort, aging 2011 –broad range of conditions, included noise and reverb From Alvin Martin’s 2012 talk on the NIST SR Evaluations

Metrics Equal Error Rate FA rate at fixed miss rate
Easy to understand Not operating point of interest FA rate at fixed miss rate E.g. 10% May be viewed as cost of listening to false alarms Decision Cost Function From Alvin Martin’s 2012 talk on the NIST SR Evaluations

Decision Cost Function CDet
Weighted sum of miss and false alarm error probabilities: CDet = CMiss × PMiss|Target × PTarget + CFalseAlarm× PFalseAlarm|NonTarget × (1- PTarget) Parameters are the relative costs of detection errors, CMiss and CFalseAlarm, and the a priori probability of the specified target speaker, Ptarget: CMiss CFalseAlarm PTarget ‘96-’08 10 1 0.01 2010 .001 From Alvin Martin’s 2012 talk on the NIST SR Evaluations

Accuracies From Alvin Martin’s 2012 talk on the NIST SR Evaluations

How good are humans? No decision 65.2% (1304) Non-match 18.8% (378)
Bruce E. Koenig Spectrographic voice identification: A forensic survey. J. Acoust. Soc. Am, 79(6) Survey of 2000 voice IDs made by trained FBI employees select similarly pronounced words use spectrograms (comparing formants, pitch, timing) listen back and forth Evaluated based on "interviews and other evidence in the investigation" and legal conclusions No decision 65.2% (1304) Non-match 18.8% (378) FR = 0.53% (2) Match 15.9% (318) FA = 0.31% (1)

Speaker diarization Conversational telephone speech Broadcast news
2 speakers Broadcast news Many speakers although often in dialogue (interviews) or in sequence (broadcast segments). Meeting recordings Many speakers, lots of overlap and disfluencies Tranter and Reynolds 2006

Speaker diarization Tranter and Reynolds 2006

Step 1: Speech Activity Detection
Meetings or broadcast: Use supervised GMMs two models: speech/non-speech or could have extra models for music, etc. Then do Viterbi segmentation, possibly with minimum length constraints or smoothing rules Telephone Simple energy/spectrum speech activity detection State of the art: Broadcast: 1% miss, 1-2% false alarm Meeting: 2% miss, 2-3% false alarm Tranter and Reynolds 2006

Step 2: Change Detection
Look at adjacent windows of data Calculate distance between them Decide whether windows come from same source Two common methods: To look for change points within window use likelihood ratio test to see if better modeled by one distribution or two. If two, insert change and start new window there If one, expand window and check again represent each window by a Gaussian, compare neighboring windows with KL distance, find peaks in distance function, threshold Tranter and Reynolds 2006

Step 3: Gender Classification
Supervised GMMs If doing Broadcast news, also do bandwidth classification (studio wideband speech versus narrowband telephone speech) Tranter and Reynolds 2006

Step 4: Clustering Hierarchical agglomerative clustering
initialize leaf clusters of tree with speech segments; compute pair-wise distances between each cluster; merge closest clusters; update distances of remaining clusters to new cluster; iterate steps 2-4 until stopping criterion is met Tranter and Reynolds 2006

Step 5: Resegmentation Use final clusters and non-speech models
To resegment data via Viterbi decoding Goal: refine original segmentation fix short segments that may have been removed Tranter and Reynolds 2006

TDOA features For meetings, with multiple-microphones
Time-Delay-of-Arrival (TDOA) features correlate signals from mikes and figure out time shift used to sync up multiple microphones and as a feature for speaker localization assume the speaker doesn’t move, so they are near the same microphone

Evaluation Systems give start-stop times of speech segments with speaker labels nonscoring “collar” of 250 ms on either side DER (Diarization Error Rate) missed speech (% of speech in the ground-truth but not in the hypothesis) false alarm speech (% of speech in the hypothesis but not in the ground-truth) speaker error (% of speech assigned to the wrong speaker) Recent mean DER for Multiple Distant Mikes (MDM): 8-10% Recent mean DER for SDM: 12-18%

Summary: Speaker Recognition Tasks

CS 224S / LINGUIST 285 Spoken Language Processing

Similar presentations

Presentation on theme: "CS 224S / LINGUIST 285 Spoken Language Processing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 224S / LINGUIST 285 Spoken Language Processing

Similar presentations

Presentation on theme: "CS 224S / LINGUIST 285 Spoken Language Processing"— Presentation transcript:

Similar presentations

About project

Feedback