Douglas A. Reynolds, PhD Senior Member of Technical Staff

Name: Douglas A. Reynolds, PhD Senior Member of Technical Staff
Uploaded: 2017-10-19T14:42:50+00:00
Duration: PTM20S12
Channel: Brenda McCarthy
Description: Douglas A. Reynolds, PhD Senior Member of Technical Staff

Automatic Speaker Recognition Recent Progress, Current Applications, and Future Trends
Douglas A. Reynolds, PhD Senior Member of Technical Staff M.I.T. Lincoln Laboratory Larry P. Heck, PhD Speaker Verification R&D Nuance Communications Describe the positions of the speakers of this talk.

Outline Introduction and applications General theory Performance
Conclusion and future directions The aim of this talk is to provide an overview of automatic speaker recognition. We will first present the definition and terminology for the core tasks underlying all speaker recognition applications and then provide three (?) concrete examples of speaker recognition applications. Next we will present some general details of the underlying technology behind speaker recognition systems and provide an overview of performance of automatic systems. Finally we will present some ideas of current limitations and future directions for research and applications. 1 1

Extracting Information from Speech
Goal: Automatically extract information transmitted in speech signal Speech Recognition Language Speaker Words Language Name Speaker Name “How are you?” English James Wilson Speech Signal Speaker conveys several levels of information. In addition to the message being spoken (words) there is also information about the language and the speaker. The aim in all automatic speech processing techniques is to extract these levels of information for further processing (database query, info for synthesis, etc.) Although shown as independent extraction paths, knowledge gained from the different levels of information can be used together for different application goals.

Introduction Identification
Determines who is talking from set of known voices No identity claim from user (many to one mapping) Often assumed that unknown voice must come from set of known speakers - referred to as closed-set identification ? Whose voice is this? ? Definition of identification task and closed set condition.. ? ?

Introduction Verification/Authentication/Detection
Determine whether person is who they claim to be User makes identity claim: one to one mapping Unknown voice could come from large set of unknown speakers - referred to as open-set verification Adding “none-of-the-above” option to closed-set identification gives open-set identification Definition of verification/authentication/detection task and open-set condition. Is this Bob’s voice? ?

Introduction Speech Modalities
Application dictates different speech modalities: Text-dependent recognition Recognition system knows text spoken by person Examples: fixed phrase, prompted phrase Used for applications with strong control over user input Knowledge of spoken text can improve system performance Text-independent recognition Recognition system does not know text spoken by person Examples: User selected phrase, conversational speech Used for applications with less control over user input More flexible system but also more difficult problem Speech recognition can provide knowledge of spoken text Definition of speech modalities of input speech.

Introduction Voice as a Biometric
Biometric: a human generated signal or attribute for authenticating a person’s identity Voice is a popular biometric: natural signal to produce does not require a specialized input device ubiquitous: telephones and microphone equipped PC Voice biometric with other forms of security Strongest security Something you have - e.g., badge Are Definition of voice as a biometric. Potential lead in to slide describing the integration of voice and knowledge verification. Something you know - e.g., password Know Have Something you are - e.g., voice

Introduction Applications
Access control Physical facilities Data and data networks Transaction authentication Toll fraud prevention Telephone credit card purchases Bank wire transfers Monitoring Remote time and attendance logging Home parole verification Prison telephone usage Information retrieval Customer information for call centers Audio indexing (speech skimming device) Forensics Voice sample matching List of application areas of speaker recognition technology. Define two or three that will be described in more detail.

Outline Introduction and applications General theory Performance
Conclusion and future directions 1 1

General Theory Components of Speaker Verification System
Bob’s “Voiceprint” Bob Speaker Model ACCEPT ACCEPT Feature extraction Input Speech “My Name is Bob” S Decision REJECT Outline the three main components of all speaker recognition systems. Feature extraction; speaker modeling (voiceprint creation) and verification decision. Impostor Model Impostor “Voiceprints” Identity Claim

General Theory Phases of Speaker Verification System
Two distinct phases to any speaker verification system Enrollment Phase Enrollment speech for each speaker Voiceprints (models) for each speaker Sally Bob Bob Feature extraction Model training Model training Sally There are two distinct phases for automatic speaker verification systems - enrollment to create a voiceprint (model) for the specific speaker and verification to verify the unknown voice with the proffered identity.. Feature extraction Verification decision Claimed identity: Sally Verification Phase Verification decision Accepted!

General Theory Features for Speaker Recognition
Humans use several levels of perceptual cues for speaker recognition Easy to automatically extract Difficult to automatically extract High-level cues (learned traits) Low-level cues (physical traits) Hierarchy of Perceptual Cues Describe understanding of how humans recognize speakers from speech and how this leads to information that is suitable for automatic systems. There are no exclusive speaker identifiably cues Low-level acoustic cues most applicable for automatic systems

Desirable attributes of features for an automatic system (Wolf ‘72) Occur naturally and frequently in speech Easily measurable Not change over time or be affected by speaker’s health Not be affected by reasonable background noise nor depend on specific transmission characteristics Not be subject to mimicry Practical Robust Secure Desirable attributes for automatic features. No feature has all these attributes Features derived from spectrum of speech have proven to be the most effective in automatic systems

General Theory Speech Production
Speech production model: source-filter interaction Anatomical structure (vocal tract/glottis) conveyed in speech spectrum Glottal pulses Vocal tract Speech signal Describe link of speech signal to speaker specific information (anatomical structure)

Speech is a continuous evolution of the vocal tract Need to extract time series of spectra Use a sliding window ms window, 10 ms shift ... Fourier Transform Magnitude Produces time-frequency evolution of the spectrum Briefly describe how automatic systems extract spectral information from speech signal using Fourier transform of sliding window over continuous speech. (Same technique as used for speech and language recognition).

General Theory Speaker Models
General Theory Components of Speaker Verification System General Theory Speaker Models Speaker Model Bob’s “Voiceprint” Bob ACCEPT Feature extraction “My Name is Bob” Impostor Model Identity Claim Decision REJECT S Impostor “Voiceprints” Speaker Model Bob’s “Voiceprint” Bob Outline the three main components of all speaker recognition systems. Feature extraction; speaker modeling (voiceprint creation) and verification decision.

Speaker models (voiceprints) represent voice biometric in compact and generalizable form Modern speaker verification systems use Hidden Markov Models (HMMs) HMMs are statistical models of how a speaker produces sounds HMMs represent underlying statistical variations in the speech state (e.g., phoneme) and temporal changes of speech between the states. Fast training algorithms (EM) exist for HMMs with guaranteed convergence properties. h-a-d Describe basic model used to capture the speaker characteristics - HMM

Form of HMM depends on the application “Open sesame” Fixed Phrase Word/phrase models /s/ /i/ /x/ Prompted phrases/passwords Phoneme models Describe the basic forms of how HMMs are used for different speech modalities General speech Text-independent single state HMM

General Theory Verification Decision
General Theory Components of Speaker Verification System General Theory Verification Decision S Bob’s “Voiceprint” Bob ACCEPT Feature extraction “My Name is Bob” Identity Claim Speaker Model Impostor Model Decision REJECT Impostor “Voiceprints” Speaker Model Bob’s “Voiceprint” Bob Impostor Model Decision REJECT Impostor “Voiceprints” S ACCEPT Outline the three main components of all speaker recognition systems. Feature extraction; speaker modeling (voiceprint creation) and verification decision.

General Theory Verification Decision
Verification decision approaches have roots in signal detection theory 2-class Hypothesis test: H0: the speaker is an impostor H1: the speaker is indeed the claimed speaker. Statistic computed on test utterance S as likelihood ratio: Likelihood S came from speaker HMM Likelihood S did not come from speaker HMM L = log L < q reject Feature extraction Speaker Model Impostor Model Decision S + - > q accept Describe basic verification decision - likelihood ratio

Outline Introduction and application General theory Performance

Verification Performance Evaluating Speaker Verification Systems
There are many factors to consider in design of an evaluation of a speaker verification system Speech quality Channel and microphone characteristics Noise level and type Variability between enrollment and verification speech Speech modality Fixed/prompted/user-selected phrases Free text Speech duration Duration and number of sessions of enrollment and verification speech Speaker population Size and composition Describe some of the dimensions of core speaker verification technology evaluation. Application specific evaluation, however, will depend on other practical considerations of throughput, ease of enrollment, etc. Most importantly: The evaluation data and design should match the target application domain of interest

Verification Performance Evaluating Speaker Verification Systems
Example Performance Curve Wire Transfer: False acceptance is very costly Users may tolerate rejections for security Application operating point depends on relative costs of the two error types High Security PROBABILITY OF FALSE REJECT (in %) Equal Error Rate (EER) = 1 % Balance Example of DET (ROC) curve and where different applications may operate. Toll Fraud: False rejections alienate customers Any fraud rejection is beneficial High Convenience PROBABILITY OF FALSE ACCEPT (in %)

Verification Performance NIST Speaker Verification Evaluations
Annual NIST evaluations of speaker verification technology (since 1995) Aim: Provide a common paradigm for comparing technologies Focus: Conversational telephone speech (text-independent) Linguistic Data Consortium Data Provider Evaluation Coordinator Comparison of technologies on common task Evaluate Technology Developers Describe NIST text-independent speaker recognition evaluations.This is an example of the core research evaluation. Improve

Verification Performance Range of Performance
Text-independent (Read sentences) Military radio Data Multiple radios & microphones Moderate amount of training data Increasing constraints Text-independent (Conversational) Telephone Data Multiple microphones Moderate amount of training data Probability of False Reject (in %) Text-dependent (Combinations) Clean Data Single microphone Large amount of train/test speech Summary of performance variability of core verification system with respect to constraints (vocabulary and environment). This leads into application performance where various constraints and auxiliary knowledge are used to produce performance required. Text-dependent (Digit strings) Telephone Data Multiple microphones Small amount of training data Probability of False Accept (in %)

Verification Performance Human vs. Machine
Motivation for comparing human to machine Evaluating speech coders and potential forensic applications Schmidt-Nielsen and Crystal used NIST evaluation (DSP Journal, January 2000) Same amount of training data Matched Handset-type tests Mismatched Handset-type tests Used 3-sec conversational utterances from telephone speech Humans 44% better Humans 15% worse Error Rates

Verification Performance Application Deployments
Benefits Security Personalization Application Voice authentication based on spoken phone number Provides secure access to customer record & credit card information Volume 250k customers enrolled calls/day 5 million customers will enroll by Q2 calls/day Application specific performance where constraints and other knowledge sources are applied to produce required operating levels. Implementation Edify telephony platform EER

Verification Performance Speaker + Knowledge Verification
Voice Prints Please enter your account number “ ” Say your date of birth “October 13, 1964” You’re accepted by the system Authenticate Voice Accept Reject Biometric Knowledge Voice over Telephone Authenticate Knowledge Application specific performance where constraints and other knowledge sources are applied to produce required operating levels. Data

Outline Introduction General theory Performance

Conclusions Speaker verification is one of the few recognition areas where machines can outperform humans Speaker verification technology is a viable technique currently available for applications Speaker verification can be augmented with other authentication techniques to add further security Provide basic wrap up of current state of area.

Future Directions Research will focus on using speaker verification techniques for more unconstrained, uncontrolled situations Audio search and retrieval Increasing robustness to channel variabilities Incorporating higher-levels of knowledge into decisions Speaker recognition technology will become an integral part of speech interfaces Personalization of services and devices Unobtrusive protection of transactions and information Provide basic wrap up of current state of area.

Douglas A. Reynolds, PhD Senior Member of Technical Staff

Similar presentations

Presentation on theme: "Douglas A. Reynolds, PhD Senior Member of Technical Staff"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Douglas A. Reynolds, PhD Senior Member of Technical Staff

Similar presentations

Presentation on theme: "Douglas A. Reynolds, PhD Senior Member of Technical Staff"— Presentation transcript:

Similar presentations

About project

Feedback