A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:
1 Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda.
Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Text Mining with Machine Learning.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
J. Kunzmann, K. Choukri, E. Janke, A. Kießling, K. Knill, L. Lamel, T. Schultz, and S. Yamamoto Automatic Speech Recognition and Understanding ASRU, December.
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
Speech Translation on a PDA By: Santan Challa Instructor Dr. Christel Kemke.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Scalable Text Mining with Sparse Generative Models
Natural Language Understanding
Introduction to Automatic Speech Recognition
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
This week: overview on pattern recognition (related to machine learning)
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.
Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation
7-Speech Recognition Speech Recognition Concepts
1 Computational Linguistics Ling 200 Spring 2006.
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Understanding The Semantics of Media Chapter 8 Camilo A. Celis.
Luis Fernando D’Haro, Ondřej Glembek, Oldřich Plchot, Pavel Matejka, Mehdi Soufifar, Ricardo Cordoba, Jan Černocký.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
SINGULAR VALUE DECOMPOSITION (SVD)
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Survey of Approaches to Information Retrieval of Speech Message Kenney Ng Spoken Language Systems Group Laboratory for Computer Science Massachusetts Institute.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
National Taiwan University, Taiwan
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.
Performance Comparison of Speaker and Emotion Recognition
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
10.0 Latent Semantic Analysis for Linguistic Processing References : 1. “Exploiting Latent Semantic Information in Statistical Language Modeling”, Proceedings.
Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Gaussian Mixture Language Models for Speech Recognition Mohamed Afify, Olivier Siohan and Ruhi Sarikaya.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Linguistic knowledge for Speech recognition
Efficient Estimation of Word Representation in Vector Space
LTI Student Research Symposium 2004 Antoine Raux
Restructuring Sparse High Dimensional Data for Effective Retrieval
Hao Zheng, Shanshan Zhang, Liwei Qiao, Jianping Li, Wenju Liu
Presentation transcript:

A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore

2 ACM SIGIR August 15-19, 2005 Bin MA Agenda Spoken Document Classification & Related Works Phonotactic-semantic Approach Voice Tokenization with Acoustic Words Bag-of-Sounds Representation Language Identification Classifiers with SVM and LSA Conclusion

3 ACM SIGIR August 15-19, 2005 Bin MA Spoken Document Classification & Related Works Spoken Document Retrieval (SDR) is the task of retrieving excerpts from a large collection of spoken documents based on a user’s request. –Automatic spoken document classification (SDC) is an important topic in SDR; –Conventionally approached by integrating automatic speech recognition (ASR) technologies and text information retrieval (IR). Most SDC efforts so far have been devoted to two paradigms: –lexical-semantic –n-gram phonotactic

4 ACM SIGIR August 15-19, 2005 Bin MA lexical-semantic –Convert the spoken documents into text transcripts of lexical words; –The transcripts are typically generated from a large vocabulary continuous speech recognizer (LVCSR). –Text categorization (TC) techniques are then applied to the automatic transcripts to derive semantic classes. Homophone Out-of-Vocabulary (OOV) Multilinguality  The major limitations is its lexical choice. Spoken Document Classification & Related Works

5 ACM SIGIR August 15-19, 2005 Bin MA n-gram phonotactic –Use n-gram phonotactics, i.e. the rules governing the sequences of allowable phonemes, instead of lexical words to represent the lexical constraints that are imposed by semantic domains; –Enhance robustness against speech recognition errors. Semantic Abstraction Multilinguality  Its major shortcoming is not to exploit the global phonotactics in the larger context of a spoken document. Spoken Document Classification & Related Works

6 ACM SIGIR August 15-19, 2005 Bin MA Phonotactic-semantic Approach Spoken document classification (SDC) is more complex than text categorization (TC). –In TC, we usually derive the lexical vocabulary from the running text. –For spoken documents, an additional tokenization step is needed to convert sound wave into a sequence of phonetic units, such as words or phonemes. Two issues: –the definition of tokenization unit, and –the choice of vocabulary.

7 ACM SIGIR August 15-19, 2005 Bin MA Definition of tokenization unit –Traditionally use the lexical word or phonemes in a specific language. –We propose to use a set of universal acoustic word (AW) - language independent, self-organized, and phoneme-like units. –We treat the documents in all languages equally with the same set of AWs. –AWs can be learned from a multilingual training corpus using a data driven approach. Phonotactic-semantic Approach

8 ACM SIGIR August 15-19, 2005 Bin MA Choice of vocabulary –Use the bag-of-sounds statistics over AWs, instead of bag-of- words over lexical words, to derive high level semantic characteristics from a spoken document. –The bag-of-sounds concept is analogous to the bag-of-words paradigm originally formulated in the context of information retrieval (IR) and text categorization (TC). –A spoken document is then represented by a high-dimensional vector derived from the statistics of term frequency. Phonotactic-semantic Approach

9 ACM SIGIR August 15-19, 2005 Bin MA Phonotactic-semantic Approach

10 ACM SIGIR August 15-19, 2005 Bin MA Three fundamental components for SDC –A voice tokenizer, i.e. a speech recognizer front-end which segments a spoken documents into acoustic tokens; –A statistical language model which captures statistics of semantic domain information; –A classifier which categorizes a spoken document using the statistical language model. Phonotactic-semantic Approach

11 ACM SIGIR August 15-19, 2005 Bin MA Agenda Spoken Document Classification & Related Works Phonotactic-semantic Approach Voice Tokenization with Acoustic Words Bag-of-Sounds Representation Language Identification Classifiers with SVM and LSA Conclusion

12 ACM SIGIR August 15-19, 2005 Bin MA word phoneme frame Voice Tokenization with Acoustic Words

13 ACM SIGIR August 15-19, 2005 Bin MA Segment an utterance into Q consecutive segments in a maximum likelihood manner –minimizing an overall distortion with dynamic programming; Cluster all segments into T classes with k-means algorithm –speech segments in the same class are acoustically similar; Train one HMM for each class –establish T acoustic segment models to represent the overall acoustic space of all languages. Voice Tokenization – Acoustic segment modeling (ASM)

14 ACM SIGIR August 15-19, 2005 Bin MA Voice Tokenization – Phonetically-bootstrapped ASM Add phonetic constraints in segmentation –use large amount of labeled speech data from few well studied languages; –train language-specific phone models; –choose some models to form a set of T models for bootstrapping; Phonetically label the multilingual training utterances –use T models to decode all training utterances; –keep the recognized sequences as “true” labels; Re-train models –force-align and segment all utterances based on “true” labels; –group all speech segments of a specific label into a class; –use these segments to re-train an HMM.

15 ACM SIGIR August 15-19, 2005 Bin MA Agenda Spoken Document Classification & Related Works Phonotactic-semantic Approach Voice Tokenization with Acoustic Words Bag-of-Sounds Representation Language Identification Classifiers with SVM and LSA Conclusion

16 ACM SIGIR August 15-19, 2005 Bin MA Bag-of-sounds is analogous to the bag-of-words; AWs in the vocabulary with T acoustic tokens; A spoken document is described as a count vector of AWs, which has its element to represent the count of an AW and takes the AW vocabulary size W as dimension. Capture local phonotactics with lexical constraints; Capture global phonotactics with co-occurrences of AWs; Bag-of-Sounds Representation

17 ACM SIGIR August 15-19, 2005 Bin MA Agenda Spoken Document Classification & Related Works Phonotactic-semantic Approach Voice Tokenization with Acoustic Words Bag-of-Sounds Representation Language Identification Classifiers with SVM and LSA Conclusion

18 ACM SIGIR August 15-19, 2005 Bin MA Language Identification National Institute of Standards and Technology (NIST) 1996 Language Recognition Evaluation (LRE) database. 12 languages : Arabic, English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil, and Vietnamese. Linguistic Data Consortium (LDC) Callfriend corpus as the training data. –40 30-minute conversations; –12, second training sessions for each language second speech sessions from 1996 NIST LRE database as the test data.

19 ACM SIGIR August 15-19, 2005 Bin MA LM-L: French Universal VT LM-1: English LM-2: Chinese Language Classifier spoken utterance Hypothesized language Language Identification

20 ACM SIGIR August 15-19, 2005 Bin MA SVM Classifier with Feature Extraction SVM light V6.01 from Work with a linear kernel SVM; Feature dimension L*(L-1)/2 pair-wise binary SVMs The class that gains most of the winning votes takes all.

21 ACM SIGIR August 15-19, 2005 Bin MA Count-trimming (CT) –AWs that have very low frequency; –AWs that occurs in too few document. Mutual Information (MI) –Class membership –Particular AW’s presence –MI indicates the contribution to semantic classification from an AW’s presence. SVM Classifier with Feature Extraction

22 ACM SIGIR August 15-19, 2005 Bin MA Separation Margin (SM) –SVM with a linear kernel –, while –Margin is inversely proportional to –Features with higher |a j | are more influential in determining the width of the separation margin. Feature Weighting SVM Classifier with Feature Extraction

23 ACM SIGIR August 15-19, 2005 Bin MA SVM Classifier with Feature Extraction SLID error rate comparison among three feature selection techniques

24 ACM SIGIR August 15-19, 2005 Bin MA SVM Classifier with Feature Extraction Effect of training corpus size

25 ACM SIGIR August 15-19, 2005 Bin MA LSA Classifier with SVD Singular Vector Decomposition (SVD) –Term-document matrix : – SVD : –Retain the top Q singular values in matrix S Latent Semantic Analysis (LSA)

26 ACM SIGIR August 15-19, 2005 Bin MA LSA Classifier I – k-nearest neighbor LSA Classifier II – mixture modeling LSA Classifier with SVD

27 ACM SIGIR August 15-19, 2005 Bin MA Effect of Mixture Number M (LSAC-II) LSA Classifier with SVD

28 ACM SIGIR August 15-19, 2005 Bin MA LSA Classifier with SVD #M1,0002,0006,00012,000 LSAC-I Error (%) SVMC Error (%) Effect of training data size in LSAC-I & SVMC P-PRLM P-PRLM & Score Fusion LSAC_IISVMC Error (%) Benchmark of different models

29 ACM SIGIR August 15-19, 2005 Bin MA Conclusion Non-lexical approach to spoken document tokenization –Universal acoustic word (AW) - language independent, self- organized, and phoneme-like units; –Data driven approach to learn from multilingual training corpus. Phonotactic-semantic paradigm to model –Local phonotactics in an acoustic word (AW); –Global phonotactics in an bag-of-sounds vector.

30 ACM SIGIR August 15-19, 2005 Bin MA Thank you !