Presentation is loading. Please wait.

Presentation is loading. Please wait.

Speaker Recognition G. CHOLLET, G. GRAVIER,

Similar presentations


Presentation on theme: "Speaker Recognition G. CHOLLET, G. GRAVIER,"— Presentation transcript:

1 Speaker Recognition G. CHOLLET, G. GRAVIER,
J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, ENST/CNRS-LTCI 46 rue Barrault PARIS cedex 13

2 Our affiliations ENST: Ecole Nationale Supérieure des Télécommunications CNRS: Centre National de la Recherche Scientifique LTCI: Laboratoire de Traitement et Communication de l’Information

3 What is ENST? Ecole Nationale Supérieure des Télécommunications
classed among the ‘Grandes Ecoles d'Ingénieurs’. 250 state certified engineers each year . part of ‘Groupement des Ecoles de Télécommunications’

4 Modalities for Identity Verification
Bla-bla SECURED SPACE PIN

5 Modalities for Identity Verification
A device you own (key, smart card,…) A code you remember (password, …) Could be lost or stolen Physiological characteristics: Face, iris, finger print, hand shape,… Need special equipment Behavioral characteristics: Speech, signature, keystroke,… Speech is the prefered modality over the telephone (but a ‘voice print’ is much more variable than a finger print)

6 Outline Where is the information about the speaker identity in the speech signal ? How well could humans recognize a speaker ? Applications of Speaker Recognition Prior knowledge on what the speaker said Combining Speech Recognition and Speaker Verification Some research activities at ENST: Speaker verification: The CAVE-PICASSO projects (text dependent) The ELISA consortium, NIST evaluations (text independent) The EUREKA !2340 MAJORDOME project Multimodal Identity Verification: The M2VTS and BIOMET projects Perspectives

7 Speaker Identity in Speech
Differences in Vocal tract shapes and muscular control Fundamental frequency (typical values) 100 Hz (Male), 200 Hz (Female), 300 Hz (Child) Glottal waveform Phonotactics Lexical usage The differences between Voices of Twins is a limit case Voices can also be imitated or disguised

8 Speaker Identity suprasegmental factors segmental factors (~30ms)
spectral envelope of / i: / f A Speaker A Speaker B Speaker Identity segmental factors (~30ms) glottal excitation: fundamental frequency, amplitude, voice quality (e.g., breathiness) vocal tract: formant frequencies and bandwidths suprasegmental factors speaking speed (timing and rhythm of speech units) intonation patterns dialect, accent, pronunciation habits

9 Inter-speaker Variability
We were away a year ago.

10 Intra-speaker Variability
We were away a year ago.

11 Vocal Apparatus

12 Speech production

13 Glottal Waveform Modeling
Fitting a glottal pulse model to the excitation waveform allows perceptually relevant modifications to voice quality A t original residual: blue synthetic residual: red

14 Applications of Speaker Recognition
Identification from an open set (unrealistic) Identification from a closed set (who is speaking in a videoconference ?) Verification of claimed identity (risk of deliberate imposture) The human performance in speaker recognition is far from being perfect (highly dependent on familiarity with the subject)

15 Speaker Verification Typology of approaches (EAGLES Handbook)
Text dependent Public password Private password Customized password Text prompted Text independent Incremental enrolment Evaluation

16 What are the sources of difficulty ?
Intra-speaker variability of the speech signal (due to stress, pathologies, environmental conditions,…) Recording conditions (filtering, noise,…) Temporal drift Intentional imposture Voice disguise

17 Text-dependent Speaker Verification
Uses Automatic Speech Recognition techniques (DTW, HMM, …) Client model adaptation from speaker independent HMM (‘World’ model) Synchronous alignment of client and world models for the computation of a score.

18 Dynamic Time Warping (DTW)

19 HMM structure depends on the application

20 Signal detection theory

21 Score normalisation World model Cohort normalisation
Discriminant techniques

22 Detection Error Tradeoff (DET) Curve

23 CAVE – PICASSO

24 Incremental enrolment of customised password
The client chooses his password using some feedback from the system. The system attempts a phonetic transcription of the password. Incremental enrolment is achieved on further repetitions of that password Speaker independent phone HMM are adapted with the client enrolment data. Synchronous alignment likelihood ratio scoring is performed on access trials.

25 Deliberate imposture The impostor has some recordings of the target client voice. He can record the same sentences and align these speech signals with the recordings of the client. A transformation (Multiple Linear Regression) is computed from these aligned data. The impostor has heard the target client password. He records that password and applies the transformation to this recording. The PICASSO reference system with less than 1 % EER is defeated by this procedure (more than 30 % EER)

26 Speaker Verification (text independent)
The ELISA consortium ENST, LIA, IRISA, ... NIST evaluations Ergodic HMM Gaussian Mixture Model

27 Gaussian Mixture Model
Parametric representation of the probability distribution of observations:

28 Gaussian Mixture Models
8 Gaussians per mixture

29 National Institute of Standards & Technology (NIST) Speaker Verification Evaluations
Annual evaluation since 1995 Common paradigm for comparing technologies

30 GMM speaker modeling WORLD GMM MODEL TARGET GMM MODEL GMM MODELING
WORLD DATA TARGET SPEAKER Front-end GMM MODELING WORLD GMM MODEL GMM model adaptation TARGET GMM MODEL

31 Baseline GMM method l WORLD GMM MODEL HYPOTH. TARGET GMM MOD. =
Front-end WORLD GMM MODEL Test Speech = LLR SCORE

32 Support Vector Machines and Speaker Verification
Hybrid GMM-SVM system is proposed SVM scoring model trained on development data to classify true-target speakers access and impostors access, using new feature representation based on GMMs Modeling Scoring GMM SVM

33 SVM principles X y(X) Feature space Input space H Class(X) Ho
Separating hyperplans H , with the optimal hyperplan Ho Ho H Class(X)

34 Results

35 Combining Speech Recognition and Speaker Verification.
Speaker independent phone HMMs Selection of segments or segment classes which are speaker specific Preliminary evaluations are performed on the NIST extended data set (one hour of training data per speaker)

36 Selection of nasals in words in -ing
being everything getting anything thing something things going

37 «MAJORDOME» Vecsys EDF Software602 KTH Mensatec UPC Airtel
Unified Messaging System Eureka Projet no 2340 D. Bahu-Leyser, G. Chollet, K. Hallouli , J. Kharroubi, L. Likforman, D. Mostefa, D. Petrovska, M. Sigelle, P. Vaillant

38 Majordome’s Functionalities
Speaker verification Dialogue Routing Updating the agenda Automatic summary Voice Fax MAJORDOME (

39 Voice technology in Majordome
Server side background tasks: continuous speech recognition applied to voice messages upon reception Detection of sender’s name and subject User interaction: Speaker identification and verification Speech recognition (receiving user commands through voice interaction) Text-to-speech synthesis (reading text summaries, s or faxes)

40 BIOMET Bla-bla SECURED SPACE PIN

41 BIOMET An extension of the M2VTS and DAVID projects to include such modalities as signature, finger print, hand shape. Initial support (two years) is provided by GET (Groupement des Ecoles de Télécommunications) Emphasis will be on fusion of scores obtained from two or more modalities.

42 Conclusions and Perspectives
Evaluation trials (as conducted by NIST) help improve technology. A strategy combining speech recognition and segmental scoring seems to be a promissing approach for speaker verification. Whenever possible, text independent speaker verification should be confirmed by text dependent verification. Whenever possible, fusion of multiple experts (preferably multimodal) should be performed.


Download ppt "Speaker Recognition G. CHOLLET, G. GRAVIER,"

Similar presentations


Ads by Google