2001/03/29Chin-Kai Wu, CS, NTHU1 Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY.

2001/03/29Chin-Kai Wu, CS, NTHU1 Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY LEEK, DABEN LIU, LONG NGUYEN, RICHARD SCHWARTZ, AND AMIT SRIVASTAVA, MEMBER, IEEE PROCEEDINGS OF THE IEEE, VOL. 88, NO. 8, AUGUST 2000

2001/03/29Chin-Kai Wu, CS, NTHU2 Outline Introduction Indexing and Browsing with Rough’n’Ready Rough’n’Ready System Indexing and Browsing Statistical Modeling Paradigm Speech Recognition Speaker Recognition Segmentation Clustering Identification

2001/03/29Chin-Kai Wu, CS, NTHU3 Introduction Much of information will be in the form of speech from various source. It’s now possible to start building automatic content-based indexing and retrieval tools. The Rough’n’Ready system provides a rough transcription of the speech that is ready for browsing. The technologies incorporated in the system include speech/speaker recognition, name spotting, topic classification, story segmentation and information retrieval.

2001/03/29Chin-Kai Wu, CS, NTHU4 Rough’n’Ready system ActiveX controls MP3 Dual P733-MHz Collect/Manage Archive Interact with browser ActiveX controls

2001/03/29Chin-Kai Wu, CS, NTHU5 Indexing and Browsing

2001/03/29Chin-Kai Wu, CS, NTHU6 Indexing and Browsing (Cont’d) Speaker People Place Organization Topic Labels

2001/03/29Chin-Kai Wu, CS, NTHU7 Indexing and Browsing (Cont’d) Selected from over 5500 topic labels

2001/03/29Chin-Kai Wu, CS, NTHU8 Statistic Modeling Paradigm Maximize P(output|input, model) (desired recognized sequence of the data)

2001/03/29Chin-Kai Wu, CS, NTHU9 Speech Recognition Statistic model: acoustic models, language models Acoustic model Describe the time-varying evolution of feature vectors for each sound or phoneme Employ hidden Markov models (HMM) Gaussian mixture models the feature vector for each HMM states Special acoustic models for nonspeech events: music, silence/noise, laughter, breath, and lip-smack. Language model: N-gram language model

2001/03/29Chin-Kai Wu, CS, NTHU10 Speech Recognition (Cont’d) Multipass recognition search strategy Fast-match pass Narrows search space Followed by other passes with more accurate models operate on smaller search space Backward pass Generate top-scoring N-best word sequences (100 <= N <= 300) N-best rescoring pass: Tree Rescoring algorithm

2001/03/29Chin-Kai Wu, CS, NTHU11 Speech Recognition (Cont’d) Speedup algorithms Fast Gaussian Computation (FGC) Grammar Spreading N-Best Tree Rescoring Word error rate PII 450-MHz processor, 60000-word vocabulary 3 x RT=>21.4% 10 x RT=>17.5% 230 x RT=>14.8%

2001/03/29Chin-Kai Wu, CS, NTHU12 Speaker Recognition Speaker segmentation Segregate audio streams based on the speaker Speaker clustering Groups together audio segments that are from the same speaker Speaker identification Recognizes those speakers of interest whose voices are known to the system

2001/03/29Chin-Kai Wu, CS, NTHU13 Speaker Segmentation Two-stage approach to speaker change detection First: Detects speech/nonspeech boundaries Second: Perform actual speaker segmentation within the speech segments First stage Collapse the phoneme into three broad classes (vowels, fricatives, and obstruents) Include five nonspeech models (music, silence/noise, laughter, breath, and lip-smack) 5-states HMM Detection reliability over 90% of the time

2001/03/29Chin-Kai Wu, CS, NTHU14 Speaker Segmentation (Cont’d) Second stage Hypotheses a speaker change boundary at every phone boundary located in the first stage Speaker change decision takes the form of a likelihood ratio (λ) test Nonspeech region Speech region λ <= t λ > t λ <= t + α λ > t + α Same speaker otherwise

2001/03/29Chin-Kai Wu, CS, NTHU15 Speaker Clustering The likelihood ratio test is used repeatedly to group cluster pairs that are deemed most similar until all segments are grouped into one cluster and a complete cluster tree is generated To find the cut of the tree that is optimal based on criterion K: number of clusters for any particular cut of tree N j : number of feature vectors in cluster j Log of determinant of the within-cluster dispersion matrix Compensation for the previous term

2001/03/29Chin-Kai Wu, CS, NTHU16 Speaker Clustering (Cont’d) The algorithm performs well regardless of the true number of speakers, producing clusters of high purity The purity is defined as the percentage of frames that are correctly clustered, measured as 95.8%

2001/03/29Chin-Kai Wu, CS, NTHU17 Speaker Identification Every speaker cluster created in the speaker clustering stage is identified by gender The gender of a speaker segment is then determined by computing the log likelihood ratio between the male and female models This approach has resulted in a 2.3% error in gender detection

2001/03/29Chin-Kai Wu, CS, NTHU18 Speaker Identification (Cont’d) In the DARPA Broadcast News corpus, 20% of the speaker segments are from 20 known speakers The problem is what is known as an open set problem in that the data contains both known and unknown speakers and the system has to determine the identity of the known-speaker segments and reject the unknown-speaker segments

2001/03/29Chin-Kai Wu, CS, NTHU19 Speaker Identification (Cont’d) The system resulted in three types of errors False identification rate of 0.1%, a known-speaker segment was mistaken to be from another known speaker False rejection rate of 3.0%, where a known-speaker segment was classified as unknown False acceptance rate of 0.8%, where an unknown-speaker segment was classified as coming from one of the known speakers

2001/03/29Chin-Kai Wu, CS, NTHU1 Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY.

Similar presentations

Presentation on theme: "2001/03/29Chin-Kai Wu, CS, NTHU1 Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2001/03/29Chin-Kai Wu, CS, NTHU1 Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY.

Similar presentations

Presentation on theme: "2001/03/29Chin-Kai Wu, CS, NTHU1 Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY."— Presentation transcript:

Similar presentations

About project

Feedback