Presentation is loading. Please wait.

Presentation is loading. Please wait.

Voice Biometrics. General Description Each individual has individual voice components called phonemes. Each individual has individual voice components.

Similar presentations


Presentation on theme: "Voice Biometrics. General Description Each individual has individual voice components called phonemes. Each individual has individual voice components."— Presentation transcript:

1 Voice Biometrics

2 General Description Each individual has individual voice components called phonemes. Each individual has individual voice components called phonemes. Each phoneme has a pitch, cadence, and inflection Each phoneme has a pitch, cadence, and inflection These three give each one of us a unique voice sound. These three give each one of us a unique voice sound. The similarity in voice comes from cultural and regional influences in the form of accents. The similarity in voice comes from cultural and regional influences in the form of accents.

3 General Description According to the National Center of Voice and Speech, as one phonate, the vocal folds and produces a complex sound spectrum made up of a range of frequencies and overtones. As the spectrum travels through the various-sized areas in the vocal track, some of the frequencies resonate more than others. According to the National Center of Voice and Speech, as one phonate, the vocal folds and produces a complex sound spectrum made up of a range of frequencies and overtones. As the spectrum travels through the various-sized areas in the vocal track, some of the frequencies resonate more than others. Larger spaces resonate at a lower frequencies Larger spaces resonate at a lower frequencies Smaller at higher frequencies Smaller at higher frequencies The two largest spaces in the vocal track and, the throat, and the mouth, produce the two lowest resonant frequencies or formants. The two largest spaces in the vocal track and, the throat, and the mouth, produce the two lowest resonant frequencies or formants. Certain inflections and pitches we learn from family members. Certain inflections and pitches we learn from family members. Voice physiological and behavior biometric are influenced by our body, environment, and age. Voice physiological and behavior biometric are influenced by our body, environment, and age. It is possible that our voice does not always sound the same. It is possible that our voice does not always sound the same. So is voice a good biometric? So is voice a good biometric?

4 Formants are the resonant frequencies of the vocal tract when vowels are pronounced. While vowels are attributed to this periodic resonance, consonants are not periodic. They are produced by restriction of air flow with the mouth, tongue, and jaw. Formants are the resonant frequencies of the vocal tract when vowels are pronounced. While vowels are attributed to this periodic resonance, consonants are not periodic. They are produced by restriction of air flow with the mouth, tongue, and jaw. Linguists classify each type of speech sound (called phenomes) into different categories. In order to identify each phenome, it is oftentimes useful to look at its spectrogram or frequency response where one can find the characteristic formants Linguists classify each type of speech sound (called phenomes) into different categories. In order to identify each phenome, it is oftentimes useful to look at its spectrogram or frequency response where one can find the characteristic formants General Description

5 Although all phenomes have their own formants, vowel sound formants are usually the easiest to identify Although all phenomes have their own formants, vowel sound formants are usually the easiest to identify All formants have the trait of waxing and waning in energy in all frequencies, which is caused by the repeated closing and opening of the human vocal tract. On average, this repeated closing and opening occurs at a rate of 125 times per second in an adult male and 250 times per second in an adult female. All formants have the trait of waxing and waning in energy in all frequencies, which is caused by the repeated closing and opening of the human vocal tract. On average, this repeated closing and opening occurs at a rate of 125 times per second in an adult male and 250 times per second in an adult female. This rate gives the sensation of pitch (higher frequencies result in higher pitches). This rate gives the sensation of pitch (higher frequencies result in higher pitches). Formant values can vary widely from person to person, but the spectrogram reader learns to recognize patterns which are independent of particular frequencies and which identify the various phonemes with a high degree of reliability. Formant values can vary widely from person to person, but the spectrogram reader learns to recognize patterns which are independent of particular frequencies and which identify the various phonemes with a high degree of reliability.

6 Vowel “I” Vowel “A”

7 Formants can be seen very clearly in a wideband spectrogram, where they are displayed as dark bands. The darker a formant is reproduced in the spectrogram, the stronger it is (the more energy there is there, or the more audible it is): Formants can be seen very clearly in a wideband spectrogram, where they are displayed as dark bands. The darker a formant is reproduced in the spectrogram, the stronger it is (the more energy there is there, or the more audible it is):

8 But there is a difference between oral vowels on one hand, and consonants and nasal vowels on the other. But there is a difference between oral vowels on one hand, and consonants and nasal vowels on the other. Nasal consonants and nasal vowels can exhibit additional formants, nasal formants, arising from resonance within the nasal branch. Nasal consonants and nasal vowels can exhibit additional formants, nasal formants, arising from resonance within the nasal branch. Consequently, nasal vowels may show one or more additional formants due to nasal resonance, while one or more oral formants may be weakened or missing due to nasal antiresonance. Consequently, nasal vowels may show one or more additional formants due to nasal resonance, while one or more oral formants may be weakened or missing due to nasal antiresonance.

9

10 Oral formants are numbered consecutively upwards from the lowest frequency. In the example, fragment from the previous wideband spectrogram shows the sequence [ins] from the beginning. Five formants are visible in this [i], labeled F1-F5. Four are visible in this [n] (F1-F4) and there is a hint of the fifth. There are four more formants between 5000Hz and 8000Hz in [i] and [n] but they are too weak to show up on the spectrogram, and mostly they are also too weak to be heard. Oral formants are numbered consecutively upwards from the lowest frequency. In the example, fragment from the previous wideband spectrogram shows the sequence [ins] from the beginning. Five formants are visible in this [i], labeled F1-F5. Four are visible in this [n] (F1-F4) and there is a hint of the fifth. There are four more formants between 5000Hz and 8000Hz in [i] and [n] but they are too weak to show up on the spectrogram, and mostly they are also too weak to be heard. The situation is reversed in this [s], where F4-F9 show very strongly, but there is little to be seen below F4. The situation is reversed in this [s], where F4-F9 show very strongly, but there is little to be seen below F4.

11 Individual Differences in Vowel Production There are differences in individual formant frequencies attributable to: size, age, gender, environment, and speech. There are differences in individual formant frequencies attributable to: size, age, gender, environment, and speech. The acoustic differences that allow us to differentiate between various vowel productions are usually explained by a source-filter theory. The acoustic differences that allow us to differentiate between various vowel productions are usually explained by a source-filter theory. The source is the sound spectrum created by airflow through the glottis which varies as vocal folds vibrate. The filter is the vocal track itself- its shape is controlled by the speaker. The source is the sound spectrum created by airflow through the glottis which varies as vocal folds vibrate. The filter is the vocal track itself- its shape is controlled by the speaker.

12 The three figures below (taken from Miller) illustrate how different configurations of the vocal tract selective pass certain frequencies and not others. The first shows the configuration of the vocal tract while articulating the phoneme [i] as in the word "beet," the second the phoneme [a], as in "father," and the third [u] as in "boot." Note how each configuration uniquely affects the acoustic spectrum--i.e., the frequencies that are passed The three figures below (taken from Miller) illustrate how different configurations of the vocal tract selective pass certain frequencies and not others. The first shows the configuration of the vocal tract while articulating the phoneme [i] as in the word "beet," the second the phoneme [a], as in "father," and the third [u] as in "boot." Note how each configuration uniquely affects the acoustic spectrum--i.e., the frequencies that are passed

13

14 Voice Capture Voice can be captured in two ways: Voice can be captured in two ways: Dedicated resource like a microphone Dedicated resource like a microphone Existing infrastructure like a telephone Existing infrastructure like a telephone Captured voice is influenced by two factors: Captured voice is influenced by two factors: Quality of the recording device Quality of the recording device The recording environment The recording environment In wireless communication, voice travels through open air and then through terrestrial lines, it therefore, suffers from great interference. In wireless communication, voice travels through open air and then through terrestrial lines, it therefore, suffers from great interference.

15 Algorithms for Voice Interpretation Algorithms used to capture, enroll and match voice fall into the following categories: Algorithms used to capture, enroll and match voice fall into the following categories: Fixed phase verification Fixed phase verification Fixed vocabulary verification Fixed vocabulary verification Flexible vocabulary verification Flexible vocabulary verification Text-independent verification. Text-independent verification.

16 Voice Verification Voice biometrics works by digitizing a profile of a person's speech to produce a stored model voice print, or template. Voice biometrics works by digitizing a profile of a person's speech to produce a stored model voice print, or template. Biometric technology reduces each spoken word to segments composed of several dominant frequencies called formants. Biometric technology reduces each spoken word to segments composed of several dominant frequencies called formants. Each segment has several tones that can be captured in a digital format. Each segment has several tones that can be captured in a digital format. The tones collectively identify the speaker's unique voice print. The tones collectively identify the speaker's unique voice print. Voice prints are stored in databases in a manner similar to the storing of fingerprints or other biometric data. Voice prints are stored in databases in a manner similar to the storing of fingerprints or other biometric data.

17 Application of Voice Technology Voice technology is applicable in a variety of areas but for us, those used in biometric technology include: Voice technology is applicable in a variety of areas but for us, those used in biometric technology include: Voice Verification Voice Verification Internet/intranet security: Internet/intranet security: on-line banking on-line banking on-line security trading on-line security trading access to corporate databases access to corporate databases on-line information services on-line information services PC access restriction software PC access restriction software Parental control Parental control Business software as a DSP solution at check points where smart cards or PIN used entrance / exit control points Business software as a DSP solution at check points where smart cards or PIN used entrance / exit control points

18 Voice Recognition Voice Recognition hands free devices, for example car mobile hands free sets hands free devices, for example car mobile hands free sets electronic devices, for example telephone, PC, or ATM cash dispenser electronic devices, for example telephone, PC, or ATM cash dispenser software applications, for example games, educational or office software software applications, for example games, educational or office software industrial areas, warehouses, etc. industrial areas, warehouses, etc. spoken multiple choice in interactive voice response systems, for example in telephony spoken multiple choice in interactive voice response systems, for example in telephony applications for people with disabilities applications for people with disabilities

19 Voice verification systems are different from voice recognition systems although the two are often confused. Voice verification systems are different from voice recognition systems although the two are often confused. Voice recognition is used to translate the spoken word into a specific response. The goal of voice recognition systems is simply to understand the spoken word, not to establish the identity of the speaker. A good familiar example of voice recognition systems is that of an automated call center asking a user to “press the number one on his phone keypad or say the word ‘one’.” In this case, the system is not verifying the identity of the person who says the word “one”; it is merely checking that the word “one” was said instead of another option. Voice recognition is used to translate the spoken word into a specific response. The goal of voice recognition systems is simply to understand the spoken word, not to establish the identity of the speaker. A good familiar example of voice recognition systems is that of an automated call center asking a user to “press the number one on his phone keypad or say the word ‘one’.” In this case, the system is not verifying the identity of the person who says the word “one”; it is merely checking that the word “one” was said instead of another option. Voice verification verifies the vocal characteristics against those associated with the enrolled user. Voice verification verifies the vocal characteristics against those associated with the enrolled user. The US PORTPASS Program, deployed at remote locations along the U.S.–Canadian border, recognizes voices of enrolled local residents speaking into a handset. This system enables enrollees to cross the border when the port is unstaffed. The US PORTPASS Program, deployed at remote locations along the U.S.–Canadian border, recognizes voices of enrolled local residents speaking into a handset. This system enables enrollees to cross the border when the port is unstaffed.

20 How is voice recognition performed? Voice recognition can be divided into two classes: Voice recognition can be divided into two classes: template matching - template matching is the simplest technique and has the highest accuracy when used properly, but it also suffers from the most limitations. template matching - template matching is the simplest technique and has the highest accuracy when used properly, but it also suffers from the most limitations. feature analysis feature analysis The first step is for the user to speak a word or phrase into a microphone. The first step is for the user to speak a word or phrase into a microphone. The electrical signal from the microphone is digitized by an "analog-to-digital (A/D) converter", and is stored in memory. The electrical signal from the microphone is digitized by an "analog-to-digital (A/D) converter", and is stored in memory. To determine the "meaning" of this voice input, the computer attempts to match the input with a digitized voice sample, or template, that has a known meaning. To determine the "meaning" of this voice input, the computer attempts to match the input with a digitized voice sample, or template, that has a known meaning. This technique is a close analogy to the traditional command inputs from a keyboard. The program contains the input template, and attempts to match this template with the actual input using a simple conditional statement. This technique is a close analogy to the traditional command inputs from a keyboard. The program contains the input template, and attempts to match this template with the actual input using a simple conditional statement.

21 The two stages of a biometric system

22 Software Open Source Speech Software from Carnegie Mellon University Open Source Speech Software from Carnegie Mellon University Hephaestus: Open Source activities at Carnegie Mellon Hephaestus: Open Source activities at Carnegie Mellon Hephaestus CMU Sphinx recognition engines -- Sphinx 2, Sphinx 3, Sphinx 4, and SphinxTrain. CMU Sphinx recognition engines -- Sphinx 2, Sphinx 3, Sphinx 4, and SphinxTrain. CMU Sphinx CMU Sphinx PocketSphinx Sphinx for embedded platforms. PocketSphinx Sphinx for embedded platforms. PocketSphinx Festvox Project speech synthesis engines, voices and tools Festvox Project speech synthesis engines, voices and tools Festvox Project Festvox Project CMU Statistical Language Modeling Toolkit (CMU SLM) CMU Statistical Language Modeling Toolkit (CMU SLM) CMU Statistical Language Modeling Toolkit CMU Statistical Language Modeling Toolkit CMUdict -- pronunciation dictionary CMUdict -- pronunciation dictionary CMUdict OpenVXI -- VoiceXML browser OpenVXI -- VoiceXML browser OpenVXI SALT browser - finally online! SALT browser - finally online! SALT browser SALT browser Audio Databases -- AN4, Microphone array, etc Audio Databases -- AN4, Microphone array, etc Audio Databases Audio Databases RavenClaw-Olympus Dialog system development toolkit. RavenClaw-Olympus Dialog system development toolkit. RavenClaw-Olympus We will try CMU Sphinx Group Open Source Speech Recognition We will try CMU Sphinx Group Open Source Speech Recognition


Download ppt "Voice Biometrics. General Description Each individual has individual voice components called phonemes. Each individual has individual voice components."

Similar presentations


Ads by Google