Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scott Settembre CSE 734 : Cyber Physical Spaces

Similar presentations


Presentation on theme: "Scott Settembre CSE 734 : Cyber Physical Spaces"— Presentation transcript:

1 Scott Settembre ss424@cse.buffalo.edu CSE 734 : Cyber Physical Spaces
Speaker Recognition Scott Settembre CSE 734 : Cyber Physical Spaces

2 Scott Settembre [ss424@cse.buffalo.edu]
Overview Speaker Identification Speaker Validation Two types of Recognition methods Text dependent vs. Text independent Speaker Recognition steps Conclusion / References March 16, 2009 Scott Settembre

3 Speaker Identification
Determines the speaker from a set of registered speakers This is called a “closed” set identification Result is the best speaker matched What if the speaker is not in the database? This is called an “open” set identification Result can be a speaker or a no-match result March 16, 2009 Scott Settembre

4 Speaker Identification Diagram
Speaker Database Actual Speaker Input Enrollment Calculate similarity to each speaker template or model Identification of Speaker Normalization Feature Extraction Select best match March 16, 2009 Scott Settembre

5 Scott Settembre [ss424@cse.buffalo.edu]
Overview Speaker Identification Speaker Validation Two types of Recognition methods Text dependent vs. Text independent Speaker Recognition steps Conclusion / References March 16, 2009 Scott Settembre

6 Scott Settembre [ss424@cse.buffalo.edu]
Speaker Validation Also called “Verification” or “Authentication” Determines if the voice matches a particular registered speaker Result is the probability of a match or a similarity measure Similarity must exceed a particular threshold Higher threshold produces more false negatives Lower threshold produces more false positives Voice variability and security issues make this a difficult threshold value to determine (more later) March 16, 2009 Scott Settembre

7 Speaker Validation Diagram
Speaker Database Speaker template or model Speaker ID Actual Speaker Input Enrollment Calculate similarity to given template or model Verification (Accept/Reject) Normalization Feature Extraction Does similarity exceed threshold? March 16, 2009 Scott Settembre

8 Scott Settembre [ss424@cse.buffalo.edu]
Overview Speaker Identification Speaker Validation Two types of Recognition methods Text dependent vs. Text independent Speaker Recognition steps Conclusion / References March 16, 2009 Scott Settembre

9 Scott Settembre [ss424@cse.buffalo.edu]
Recognition Methods Text Dependent Requires user to speak text spoken at enrollment Usually a name, password, or phrase Text Prompting is used to combat deception The system requires the user to repeat back a random phrase or list of numbers Video example from “CSAIL” - Spoken Language Systems group at MIT. March 16, 2009 Scott Settembre

10 Scott Settembre [ss424@cse.buffalo.edu]
March 16, 2009 Scott Settembre

11 Recognition Methods, cont.
Text Independent Non-invasive, does not require user to actively answer prompts Longer enrollment phase required, more training data needed Focuses on a subset of audio/phonetic features Video example from Nathan Harrington at IBM developerWorks. March 16, 2009 Scott Settembre

12 Scott Settembre [ss424@cse.buffalo.edu]
March 16, 2009 Scott Settembre

13 Scott Settembre [ss424@cse.buffalo.edu]
Overview Speaker Identification Speaker Validation Two types of Recognition methods Text dependent vs. Text independent Speaker Recognition steps Conclusion / References March 16, 2009 Scott Settembre

14 Speaker Recognition Steps
Input Speech Normalize captured speech Feature extraction Similarity matching Decision/Threshold March 16, 2009 Scott Settembre

15 Scott Settembre [ss424@cse.buffalo.edu]
Step 1. Input Speech Various fidelity from inputs Telephone, computer microphone, noise cancelling headset, dedicated capture microphone, room microphones Noise Background noise, room echoes Variability in voice Speaking manner (rate and volume), sickness, aging, emotions, morning vs. evening voice March 16, 2009 Scott Settembre

16 Step 2. Normalize Captured Speech
Intersession variability and variability over time cause speech features to fluctuate Use of “filter bank” is common Normalization helps remove these variations, but at a price Parameter-Domain normalization Distance/Similarity-Domain normalization March 16, 2009 Scott Settembre

17 Step 2.a. Normalization Techniques
Parameter-Domain normalization Spectral equalization (i.e. signal processing) Dampens large variations in features by averaging over time, useful for long utterances Removes some speaker specific features Distance/Similarity-Domain normalization Various techniques that use probabilities of known speakers that have already been enrolled Useful if you are doing validation March 16, 2009 Scott Settembre

18 Step 3. Feature Extraction
The input utterance is converted to a set of feature vectors Time alignment may need to be done Calculate similarity between each captured vector with the registered speaker template or model Hello h he e el l lo o h he e el l lo o h he e el l lo o h h .90 similarity he he .60 similarity, .75 overall March 16, 2009 Scott Settembre

19 Side note : Analyzing speech “ah”
Waveform (Raw acoustic data) Spectrograph (Frequency vs. Amplitude) Formant (Continuous peak that crosses frequencies) Image attributed to Dr. Douglas Roland from lecture notes describing speech recognition. March 16, 2009 Scott Settembre

20 Step 4. Similarity Matching
Other pattern classification techniques can be used on the normalized input Each speaker gets his/her own HMM, neural network, VQ codebook, etc. Another approach is to target specific phonemes or features Example showing the targeting of vowel sounds, in particular the syllable “ah” March 16, 2009 Scott Settembre

21 Example of Vowel Comparisons
Charts attributed to Pasich, C. Speaker Identification MATLAB files, Connexions Web site. Feb 16, 2007. March 16, 2009 Scott Settembre

22 Step 5. Decision/Threshold
For speaker identification, simply take the registered speaker template with the highest similarity score For speaker verification, there needs to be a minimum acceptable similarity score March 16, 2009 Scott Settembre

23 Scott Settembre [ss424@cse.buffalo.edu]
Overview Speaker Identification Speaker Validation Two types of Recognition methods Text dependent vs. Text independent Speaker Recognition steps Conclusion / References March 16, 2009 Scott Settembre

24 Scott Settembre [ss424@cse.buffalo.edu]
Conclusion : Why care? Speaker recognition will become ubiquitous Cell phone applications – banking, security, logins Forensic analysis (voiceprints) Home automation (know thy user) Google “speaker” search? (You know it’s going to happen!  ) March 16, 2009 Scott Settembre

25 Scott Settembre [ss424@cse.buffalo.edu]
References Video links MIT, CSAIL. IBM, developerWorks. Cole, Ronald A., Editor (1996) Survey of the State of the Art in Human Language Technology. Iyer, Manjunath Ramachandra (2007). “Differentially Fed Artificial Neural Networks for Speech Signal Prediction.” In Hector Perez-Meana, Editor. Advances in audio and speech signal processing : technologies and applications (pp ) Hershey, PA : Idea Group Pub., c2007. Lung, Shung-Yung (2007). “Speaker Recognition.” In Hector Perez-Meana, Editor. Advances in audio and speech signal processing : technologies and applications (pp ) Hershey, PA : Idea Group Pub., c2007. March 16, 2009 Scott Settembre


Download ppt "Scott Settembre CSE 734 : Cyber Physical Spaces"

Similar presentations


Ads by Google