Speaker Verification System Part B Final Presentation

Slides:



Advertisements
Similar presentations
Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),
Advertisements

Masters Presentation at Griffith University Master of Computer and Information Engineering Magnus Nilsson
Speech Compression. Introduction Use of multimedia in personal computers Requirement of more disk space Also telephone system requires compression Topics.
Final Year Project Pat Hurney Digital Pitch Correction for Electric Guitars.
Digital Representation of Audio Information Kevin D. Donohue Electrical Engineering University of Kentucky.
Introduction The aim the project is to analyse non real time EEG (Electroencephalogram) signal using different mathematical models in Matlab to predict.
Speech Sound Production: Recognition Using Recurrent Neural Networks Abstract: In this paper I present a study of speech sound production and methods for.
CMSC Assignment 1 Audio signal processing
CELLULAR COMMUNICATIONS 5. Speech Coding. Low Bit-rate Voice Coding  Voice is an analogue signal  Needed to be transformed in a digital form (bits)
Speech Coding Nicola Orio Dipartimento di Ingegneria dell’Informazione IV Scuola estiva AISV, 8-12 settembre 2008.
F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)
1 Audio Compression Techniques MUMT 611, January 2005 Assignment 2 Paul Kolesnik.
Overview of Adaptive Multi-Rate Narrow Band (AMR-NB) Speech Codec
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
A STUDY ON SPEECH RECOGNITION USING DYNAMIC TIME WARPING CS 525 : Project Presentation PALDEN LAMA and MOUNIKA NAMBURU.
Top Level System Block Diagram BSS Block Diagram Abstract In today's expanding business environment, conference call technology has become an integral.
Digital Voice Communication Link EE 413 – TEAM 2 April 21 st, 2005.
A STUDY ON SPEECH RECOGNITION USING DYNAMIC TIME WARPING CS 525 : Project Presentation PALDEN LAMA and MOUNIKA NAMBURU.
Real-Time Speech Recognition Thang Pham Advisor: Shane Cotter.
A PRESENTATION BY SHAMALEE DESHPANDE
A Full Frequency Masking Vocoder for Legal Eavesdropping Conversation Recording R. F. B. Sotero Filho, H. M. de Oliveira (qPGOM), R. Campello de Souza.
Representing Acoustic Information
DSP. What is DSP? DSP: Digital Signal Processing---Using a digital process (e.g., a program running on a microprocessor) to modify a digital representation.
LE 460 L Acoustics and Experimental Phonetics L-13
Ni.com Data Analysis: Time and Frequency Domain. ni.com Typical Data Acquisition System.
DIGITAL VOICE NETWORKS ECE 421E Tuesday, October 02, 2012.
GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1.
Classification of place of articulation in unvoiced stops with spectro-temporal surface modeling V. Karjigi , P. Rao Dept. of Electrical Engineering,
LECTURE Copyright  1998, Texas Instruments Incorporated All Rights Reserved Encoding of Waveforms Encoding of Waveforms to Compress Information.
Supervisor: Dr. Eddie Jones Electronic Engineering Department Final Year Project 2008/09 Development of a Speaker Recognition/Verification System for Security.
1 CS 551/651: Structure of Spoken Language Lecture 8: Mathematical Descriptions of the Speech Signal John-Paul Hosom Fall 2008.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Implementing a Speech Recognition System on a GPU using CUDA
Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.
Jacob Zurasky ECE5526 – Spring 2011
Supervisor: Dr. Eddie Jones Co-supervisor: Dr Martin Glavin Electronic Engineering Department Final Year Project 2008/09 Development of a Speaker Recognition/Verification.
Dan Rosenbaum Nir Muchtar Yoav Yosipovich Faculty member : Prof. Daniel LehmannIndustry Representative : Music Genome.
1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Speaker Recognition by Habib ur Rehman Abdul Basit CENTER FOR ADVANCED STUDIES IN ENGINERING Digital Signal Processing ( Term Project )
EE 113D Fall 2008 Patrick Lundquist Ryan Wong
Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
MMDB-8 J. Teuhola Audio databases About digital audio: Advent of digital audio CD in Order of magnitude improvement in overall sound quality.
A STUDY ON SPEECH RECOGNITION USING DYNAMIC TIME WARPING CS 525 : Project Presentation PALDEN LAMA and MOUNIKA NAMBURU.
Performance Comparison of Speaker and Emotion Recognition
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.
Speech Processing Using HTK Trevor Bowden 12/08/2008.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
1 What is Multimedia? Multimedia can have a many definitions Multimedia means that computer information can be represented through media types: – Text.
Speaker Verification System Middle Term Presentation Performed by: Barak Benita & Daniel Adler Instructor: Erez Sabag.
ADAPTIVE BABY MONITORING SYSTEM Team 56 Michael Qiu, Luis Ramirez, Yueyang Lin ECE 445 Senior Design May 3, 2016.
BIOMETRICS VOICE RECOGNITION. Meaning Bios : LifeMetron : Measure Bios : LifeMetron : Measure Biometrics are used to identify the input sample when compared.
Emotional Intelligence Vivian Tseng, Matt Palmer, Jonathan Fouk Group #41.
PATTERN COMPARISON TECHNIQUES
ARTIFICIAL NEURAL NETWORKS
Speech Processing AEGIS RET All-Hands Meeting
Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.
Cepstrum and MFCC Cepstrum MFCC Speech processing.
Linear Predictive Coding Methods
Ala’a Spaih Abeer Abu-Hantash Directed by Dr.Allam Mousa
Digital Systems: Hardware Organization and Design
AUDIO SURVEILLANCE SYSTEMS: SUSPICIOUS SOUND RECOGNITION
A maximum likelihood estimation and training on the fly approach
Govt. Polytechnic Dhangar(Fatehabad)
Auditory Morphing Weyni Clacken
Presentation transcript:

Speaker Verification System Part B Final Presentation Performed by: Barak Benita & Daniel Adler Instructor: Erez Sabbag

Implementation of a speaker verification algorithm on a DSP The Project Goal Implementation of a speaker verification algorithm on a DSP The verification module will perform a real time authentication of the user based on sampled voice data. The idea is to integrate the speaker verification model with other security and management models allowing them to grant access to resources based on the speakers voice verification.

Introduction Speaker verification is the process of automatically authenticating the speaker on the basis of individual information included in speech waves. Speaker’s Identity (Reference) Speaker’s Voice Segment Speaker Verification System Result [0:1]

System Overview Access Denied BT Base Station My name is Bob! LAN Speaker Verification Unit Server BT Base Station LAN

System Description The system is compound from TI’s C6701floating point DSP with the speaker verification algorithm on it. A user with a hand device (e.g. bluetooth on a PDA), will receive access to different resources ( door opening, file access, etc) based on a voice verification process. The project implements only the speaker verification algorithm on the DSP and has input and output interfaces to interact with other devices (e.g. Bluetooth). The DSP is encoded with the users voice signature. Each time user verification is needed, the algorithm compares the speakers voice with the signature. 3

(training phase – building System Block Diagram DSP Signature parameters Enrollment Server (training phase – building A signature) Codec Verification Channel Voice Channel (optional) Bluetooth Radio Interface Bluetooth unit Codec Bluetooth Base station Authorization Server LAN Voice Channel (optional) Voice Channel “My name is Bob” 5

Project Description: Part One: Part Two: Literature review Algorithms selection MATLAB implementation Result analysis Part Two: Implementation of the chosen algorithm on a DSP

Speaker Verification Process Analog Speech Pre-Processing Feature Extraction Reference Model Pattern Matching Decision Result [0:1]

Implemented Algorithms: Feature Extraction Module – MFCC MFCC (Mel Frequency Cepstral Coefficients) is the most common technique for feature extraction. MFCC tries to mimic the way our ears work by analyzing the speech waves linearly at low frequencies and logarithmically at high frequencies. The idea acts as follows: Spectrum Mel Spectrum FFT Mel-frequency Wrapping Cepstrum Mel Cepstrum Windowed PDS Frame

Implemented Algorithms: Pattern Matching Modeling Module – Vector Quantization (VQ) In the enrolment part we build a codebook of the speaker according to the LBG (Linde, Buzo, Gray) algorithm, which creates an N size codebook from set of L feature vectors. In the verification stage, we are measuring the distortion of the given sequence of the feature vectors to the reference codebook. Pattern Matching = Distortion measure Reference Model = Codebook Distortion Rate Feature Vector

Implemented Algorithms: Decision In VQ the decision is based on checking if the distortion rate is higher than a preset threshold: acceptance if distortion rate > t, else rejection.   In this project no decision model will be build, the output of the system will be based on the following score rate (values between 0 to 1), which indicates the suitability of the person to the reference model: Score = exp (-mean distance)

Implementation Environment Hardware tools: TI DSP 6701 EVM board PC host station Software development tools: TI Code Composer Matlab 6.1 Programming Languages: C Assembler Matlab

Working Environment

TI DSP 6701 EVM Why? Floating Point Designed Especially for Voice Applications Large Bank of On Chip Memory High level development (C) PCI Interface Why Not? Price Size Consumption

Program Workflow Program DSP MATLAB Program Analog Speech (input) Pre-Processing Feature Extraction MATLAB Program Reference Model Pattern Matching Decision Result [0:1] (output)

Step By Step Implementation Pre-processing a ‘ones’ vector on the DSP and comparing it to the Matlab results Pre-processing an audio file and comparing to the Matlab results Feature extracting of the audio file (after pre-processing) and comparing to the Matlab results Pattern matching the feature vectors to a ‘ones’ codebook matrix and comparing to the Matlab results (running with the same codebook) Creating a real codebook from a reference speaker importing it to the DSP and comparing the running results of the DSP and the Matlab Verifying that the distances of the speakers from the codebook in the DSP program and in the Matlab program are the same

Creating the Assembler Lookup Files Creating the output data through Matlab functions (e.g. hamming(n)) Saving the output in an assembler lookup table format Referencing the lookup table with a name that will be called from the C source code in the DSP project (as a function) hamming = fopen('hamming.asm', 'wt', 'l'); fprintf(hamming, '; hamming.asm - single precision floating point table generated from MATLAB\n'); fprintf(hamming, '\t.def\t_hamming\n'); fprintf(hamming, '\t.sym\t_hamming, _hamming, 54, 2, %d,, %d\n', size, n); fprintf(hamming, '\t.data\n'); fprintf(hamming, '_hamming:\n'); fprintf(hamming, '\t.word\t%tXh, %tXh, %tXh, %tXh\n', h); fprintf(hamming, '\n'); fclose(hamming); Importing the file as an asm file (adding a file to the project) to the DSP project ; hamming.asm - single precision floating point table generated from MATLAB .def _hamming .sym _hamming, _hamming, 54, 2, 8192,, 256 .data _hamming: .word 3DA3D70Ah, 3DA4203Fh, 3DA4FBD3h, 3DA669A4h .word 3DA86978h, 3DAAFB01h, 3DAE1DD8h, 3DB1D180h .word 3DB61567h, 3DBAE8E1h, 3DC04B30h, 3DC63B7Dh .word 3DCCB8DCh, 3DD3C24Bh, 3DDB56B1h, 3DE374E1h .word 3DEC1B99h, 3DF5497Fh, 3DFEFD27h, 3E049A87h .word 3E09F7D0h, 3E0F9597h, 3E1572FFh, 3E1B8F1Ch h = hamming(n); Using the lookup table in the C source code // ----- Windowing the filtered frame with Hamming ---- for (k=0 ; k < N ; k++){ for (j=0 ; j < N ; j++){ if (k - j < 0) break; frame[k] += hamming[j]*filtered_frame[k-j]; }

Binding All The Pieces DSP Program C Code Analog Speech (input) Generation of assembly functions through Matlab Hamming.asm Generation of voice data file from a *.wav format file through Matlab waveread function Sari5fix.asm DSP Program C Code Pre-Processing Generation of assembly functions through Matlab Melbank.asm Rdct.asm Feature Extraction Generation of assembly functions through Matlab Codebook.asm Pattern Matching Decision Result [0:1] (output)

Software Modules main init O(1) extract_frame O(n^2) digitrev_index bitrev O(n) hamming O(1) bitrev O(n) melbank O(n) cfftr2_dit O(nlog(n)) calc_dist O(1)

Project Structure speakerverification.pjt Include Files board.h codec.h dma.h intr.h mcbsp.h link.cmd pci.h regs.h Libraries verification.h rts6700.lib Source bitrevf.asm cfftr2.asm codebook.asm digitrev_index.c hamming.asm melbank.asm rdct.asm verification.c

Tested System The Tested System parameters: The tested algorithms and methods were the MFCC and VQ with the following parameters: Sampling Frequency: 11025Hz Feature Vector Size: 18 Window Size: 256 Offset Size: 128 Codebook Size: 128 Number of iterations for codebook creation: 25 We compared between the Matlab and DSP results based on a codebook created from Daniel’s 60 seconds of random speech and random selection of different five seconds speakers.

Verifications The DSP results were compared to the Matlab simulation. We chose random speakers from the speakers DB with one reference codebook. For Example: Person MATLAB DSP Daniel 66.95% (0.4011) 66.95% (0.4011) Barak 44.01% (0.8206) 44.01% (0.8206) Ayelet 43.61% (0.8299) 43.61% (0.8299) Diego 53.97% (0.6166) 53.97% (0.6166) Adi 42.07% (0.8656) 42.07% (0.8656)

Conclusions The TI DSP 6701 EVM is capable of preforming speaker verification analysis and achieve high resolution results (as achieved in the Matlab) Speaker Verification algorithms are not mature enough to become a good biometric detection solution Code Composer is not stable and good enough to become an “easy to use” development environment A second phase project, which will implement a complete verification system should be build

Time Table – First Semester 14.11.01 – Project description presentation 15.12.01 – completion of phase A: literature review and algorithm selection 25.12.01 – Handing out the mid-term report 25.12.01 – Beginning of phase B: algorithm implementation in MATLAB 10.04.02 – Publishing the MATLAB results and selecting the algorithm that will be implemented on the DSP

Time Table – Second Semester 10.04.02 – Presenting the progress and planning of the project to the supervisor 17.04.02 – Finishing MATLAB Testing 17.04.02 – The beginning of the implementation on the DSP 07.11.02 – Project presentation and handing the project final report

Thanks

Backup Slides

Pre-Processing (step 1) Analog Speech Windowed PDS Frames [1, 2, … , N] Pre-Processing

Pre-Processing module Analog Speech Anti aliasing filter to avoid aliasing during sampling. LPF [0, Fs/2] LPF Band Limited Analog Speech Analog to digital converter with frequency sampling (Fs) of [10,16]KHz A/D Digital Speech Low order digital system to spectrally flatten the signal (in favor of vocal tract parameters), and make it less susceptible to later finite precision effects First Order FIR Pre-emphasized Digital Speech (PDS) Frame Blocking Frame blocking of the sampled signal. Each frame is of N samples overlapped with N-M samples of the previous frame. Frame rate ~ 100 Frames/Sec N values: [200,300], M values: [100,200] PDS Frames Frame Windowing Using Hamming (or Hanning or Blackman) windowing in order to minimize the signal discontinuities at the beginning and end of each frame. Windowed PDS Frames

Feature Extraction (step 2) Windowed PDS Frames Set of Feature Vectors [1, 2, … , N] [1, 2, … , K] Feature Extraction Extracting the features of speech from each frame and representing it in a vector (feature vector).

MFCC – Mel-frequency Wrapping Psychophysical studies have shown that human perception of the frequency contents of sounds for speech signals does not follow a linear scale. Thus for each tone with an actual frequency, f, measured in Hz, a subjective pitch is measured on a scale called the ‘mel’ scale. The mel-frequency scale is a linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz. Therefore we can use the following approximate formula to compute the mels for a given frequency f in Hz:

MFCC – Filter Bank One way to simulating the spectrum is by using a filter bank, spaced uniformly on the mel scale. That filter bank has a triangular bandpass frequency response, and the spacing as well as the bandwidth is determined by a constant mel frequency interval.

MFCC – Cepstrum Here, we convert the log mel spectrum back to time. The result is called the mel frequency cepstrum coefficients (MFCC). Because the mel spectrum coefficients are real numbers, we can convert them to the time domain using the Discrete Cosine Transform (DCT) and get a featured vector.

Pattern Matching Modeling (step 3) The pattern matching modeling techniques is divided into two sections: The enrolment part, in which we build the reference model of the speaker. The verifications (matching) part, where the users will be compared to this model.

Enrollment part – Modeling Set of Feature Vectors [1, 2, … , K] Modeling Speaker Model This part is done outside the DSP and the DSP receives only the speaker model (calculated offline in a host).

Pattern Matching Speaker Model Pattern Matching Matching Rate Set of Feature Vectors [1, 2, … , K] Pattern Matching Matching Rate

Decision Module (Optional) In VQ the decision is based on checking if the distortion rate is higher than a preset threshold: if distortion rate > t, Output = Yes, else Output = No. In HMM the decision is based on checking if the probability score is higher than a preset threshold: if probability scores > t, Output = Yes,

The Voice Database Two reference models were generated (one male and one female), each model was trained in 3 different ways: repeating the same sentence for 15 seconds repeating the same sentence for 40 seconds reading random text for one minute The voice database is compound from 10 different speakers (5 males and 5 females), each speaker was recorded in 3 ways: repeating the reference sentence once (5 seconds) repeating the reference sentence 3 times (15 seconds) speaking a random sentence for 5 seconds

Experiment Description Cont. Conclusions: Window size of 330 and offset of 110 samples performs better than window size of 256 and offset of 128 samples

Experiment Description Cont. Conclusions: Feature vector of 18 coeffs is better than feature vector of 12 coeffs

Experiment Description Cont. Conclusions: Worst combinations: 5 seconds of fixed sentence for testing with an enrolment of 15 seconds of the same sentence. enrolment of 40 seconds of the same sentence. Best combinations: 15 seconds of fixed sentence for testing with an enrolment of 40 seconds of the same sentence. enrolment of 60 seconds of random sentences. 5 seconds of a random sentence with an

Experiment Description Cont. The Best Results:

Additional verification results The DSP results were compared to the Matlab simulation. We chose random speakers from the speakers DB with one reference codebook. For Example: Person MATLAB DSP Alex 69.58% (0.3627) 69.58% (0.3627) Sari 61.66% (0.4835) 61.66% (0.4835) Roee 49.97% (0.6938) 49.97% (0.6938) Eran 54.75% (0.6023) 54.75% (0.6023) Hila 55.72% (0.5849) 55.72% (0.5849)