2001/03/29Chin-Kai Wu, CS, NTHU1 Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Discriminative Training in Speech Processing Filipp Korkmazsky LORIA.
Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.
1990s DARPA Programmes WSJ and BN Dapo Durosinmi-Etti Bo Xu Xiaoxiao Zheng.
Multipitch Tracking for Noisy Speech
Toward Automatic Music Audio Summary Generation from Signal Analysis Seminar „Communications Engineering“ 11. December 2007 Patricia Signé.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
15.0 Utterance Verification and Keyword/Key Phrase Spotting References: 1. “Speech Recognition and Utterance Verification Based on a Generalized Confidence.
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
Broadcast News Parsing Using Visual Cues: A Robust Face Detection Approach Yannis Avrithis, Nicolas Tsapatsoulis and Stefanos Kollias Image, Video & Multimedia.
Speech Translation on a PDA By: Santan Challa Instructor Dr. Christel Kemke.
Application of HMMs: Speech recognition “Noisy channel” model of speech.
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Page 0 of 8 Time Series Classification – phoneme recognition in reconstructed phase space Sanjay Patil Intelligent Electronics Systems Human and Systems.
Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.
Tanja Schultz, Alan Black, Bob Frederking Carnegie Mellon University West Palm Beach, March 28, 2003 Towards Dolphin Recognition.
2001/05/24Chin-Kai Wu, CS, NTHU1 Improved frame erasure concealment for CELP-based coders Juan Carlos De Martin, Takahiro Unno, Vishu Viswanathan DSPS.
Language and Speaker Identification using Gaussian Mixture Model Prepare by Jacky Chau The Chinese University of Hong Kong 18th September, 2002.
AdvAIR Supervised by Prof. Michael R. Lyu Prepared by Alex Fok, Shirley Ng 2002 Fall An Advanced Audio Information Retrieval System.
2001/07/18Chin-Kai Wu, CS, NTHU1 A Voicing-Driven Packet Loss Recovery Algorithm for Analysis- by-Synthesis Predictive Speech Coders over Internet Jhing-Fa.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
2001/11/29Chin-Kai Wu, CS, NTHU1 Characteristics of Network Delay and Delay Jitter and its Effect on Voice over IP Li Zheng, Liren Zhang, Dong Xu Communications,
FYP0202 Advanced Audio Information Retrieval System By Alex Fok, Shirley Ng.
김덕주 (Duck Ju Kim). Problems What is the objective of content-based video analysis? Why supervised identification has limitation? Why should use integrated.
Introduction to Automatic Speech Recognition
Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.
A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore.
1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.
Macquarie RT05s Speaker Diarisation System Steve Cassidy Centre for Language Technology Macquarie University Sydney.
9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.
Speech and Language Processing
Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Rundkast at LREC 2008, Marrakech LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen RUNDKAST: An Annotated.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
Speaker Verification Speaker verification uses voice as a biometric to determine the authenticity of a user. Speaker verification systems consist of two.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
1 Prosody-Based Automatic Segmentation of Speech into Sentences and Topics Elizabeth Shriberg Andreas Stolcke Speech Technology and Research Laboratory.
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
July Age and Gender Recognition from Speech Patterns Based on Supervised Non-Negative Matrix Factorization Mohamad Hasan Bahari Hugo Van hamme.
1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.
Institute of Information Science, Academia Sinica, Taiwan Introduction to Speaker Diarization Date: 2007/08/16 Speaker: Shih-Sian Cheng.
1 Applications of video-content analysis and retrieval IEEE Multimedia Magazine 2002 JUL-SEP Reporter: 林浩棟.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
1 Hidden Markov Model: Overview and Applications in MIR MUMT 611, March 2005 Paul Kolesnik MUMT 611, March 2005 Paul Kolesnik.
Statistical techniques for video analysis and searching chapter Anton Korotygin.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
CS 445/656 Computer & New Media
Statistical Models for Automatic Speech Recognition
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Statistical Models for Automatic Speech Recognition
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Presentation transcript:

2001/03/29Chin-Kai Wu, CS, NTHU1 Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY LEEK, DABEN LIU, LONG NGUYEN, RICHARD SCHWARTZ, AND AMIT SRIVASTAVA, MEMBER, IEEE PROCEEDINGS OF THE IEEE, VOL. 88, NO. 8, AUGUST 2000

2001/03/29Chin-Kai Wu, CS, NTHU2 Outline Introduction Indexing and Browsing with Rough’n’Ready Rough’n’Ready System Indexing and Browsing Statistical Modeling Paradigm Speech Recognition Speaker Recognition Segmentation Clustering Identification

2001/03/29Chin-Kai Wu, CS, NTHU3 Introduction Much of information will be in the form of speech from various source. It’s now possible to start building automatic content-based indexing and retrieval tools. The Rough’n’Ready system provides a rough transcription of the speech that is ready for browsing. The technologies incorporated in the system include speech/speaker recognition, name spotting, topic classification, story segmentation and information retrieval.

2001/03/29Chin-Kai Wu, CS, NTHU4 Rough’n’Ready system ActiveX controls MP3 Dual P733-MHz Collect/Manage Archive Interact with browser ActiveX controls

2001/03/29Chin-Kai Wu, CS, NTHU5 Indexing and Browsing

2001/03/29Chin-Kai Wu, CS, NTHU6 Indexing and Browsing (Cont’d) Speaker People Place Organization Topic Labels

2001/03/29Chin-Kai Wu, CS, NTHU7 Indexing and Browsing (Cont’d) Selected from over 5500 topic labels

2001/03/29Chin-Kai Wu, CS, NTHU8 Statistic Modeling Paradigm Maximize P(output|input, model) (desired recognized sequence of the data)

2001/03/29Chin-Kai Wu, CS, NTHU9 Speech Recognition Statistic model: acoustic models, language models Acoustic model Describe the time-varying evolution of feature vectors for each sound or phoneme Employ hidden Markov models (HMM) Gaussian mixture models the feature vector for each HMM states Special acoustic models for nonspeech events: music, silence/noise, laughter, breath, and lip-smack. Language model: N-gram language model

2001/03/29Chin-Kai Wu, CS, NTHU10 Speech Recognition (Cont’d) Multipass recognition search strategy Fast-match pass Narrows search space Followed by other passes with more accurate models operate on smaller search space Backward pass Generate top-scoring N-best word sequences (100 <= N <= 300) N-best rescoring pass: Tree Rescoring algorithm

2001/03/29Chin-Kai Wu, CS, NTHU11 Speech Recognition (Cont’d) Speedup algorithms Fast Gaussian Computation (FGC) Grammar Spreading N-Best Tree Rescoring Word error rate PII 450-MHz processor, word vocabulary 3 x RT=>21.4% 10 x RT=>17.5% 230 x RT=>14.8%

2001/03/29Chin-Kai Wu, CS, NTHU12 Speaker Recognition Speaker segmentation Segregate audio streams based on the speaker Speaker clustering Groups together audio segments that are from the same speaker Speaker identification Recognizes those speakers of interest whose voices are known to the system

2001/03/29Chin-Kai Wu, CS, NTHU13 Speaker Segmentation Two-stage approach to speaker change detection First: Detects speech/nonspeech boundaries Second: Perform actual speaker segmentation within the speech segments First stage Collapse the phoneme into three broad classes (vowels, fricatives, and obstruents) Include five nonspeech models (music, silence/noise, laughter, breath, and lip-smack) 5-states HMM Detection reliability over 90% of the time

2001/03/29Chin-Kai Wu, CS, NTHU14 Speaker Segmentation (Cont’d) Second stage Hypotheses a speaker change boundary at every phone boundary located in the first stage Speaker change decision takes the form of a likelihood ratio (λ) test Nonspeech region Speech region λ <= t λ > t λ <= t + α λ > t + α Same speaker otherwise

2001/03/29Chin-Kai Wu, CS, NTHU15 Speaker Clustering The likelihood ratio test is used repeatedly to group cluster pairs that are deemed most similar until all segments are grouped into one cluster and a complete cluster tree is generated To find the cut of the tree that is optimal based on criterion K: number of clusters for any particular cut of tree N j : number of feature vectors in cluster j Log of determinant of the within-cluster dispersion matrix Compensation for the previous term

2001/03/29Chin-Kai Wu, CS, NTHU16 Speaker Clustering (Cont’d) The algorithm performs well regardless of the true number of speakers, producing clusters of high purity The purity is defined as the percentage of frames that are correctly clustered, measured as 95.8%

2001/03/29Chin-Kai Wu, CS, NTHU17 Speaker Identification Every speaker cluster created in the speaker clustering stage is identified by gender The gender of a speaker segment is then determined by computing the log likelihood ratio between the male and female models This approach has resulted in a 2.3% error in gender detection

2001/03/29Chin-Kai Wu, CS, NTHU18 Speaker Identification (Cont’d) In the DARPA Broadcast News corpus, 20% of the speaker segments are from 20 known speakers The problem is what is known as an open set problem in that the data contains both known and unknown speakers and the system has to determine the identity of the known-speaker segments and reject the unknown-speaker segments

2001/03/29Chin-Kai Wu, CS, NTHU19 Speaker Identification (Cont’d) The system resulted in three types of errors False identification rate of 0.1%, a known-speaker segment was mistaken to be from another known speaker False rejection rate of 3.0%, where a known-speaker segment was classified as unknown False acceptance rate of 0.8%, where an unknown-speaker segment was classified as coming from one of the known speakers