A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore.

A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore

2 ACM SIGIR August 15-19, 2005 Bin MA Agenda Spoken Document Classification & Related Works Phonotactic-semantic Approach Voice Tokenization with Acoustic Words Bag-of-Sounds Representation Language Identification Classifiers with SVM and LSA Conclusion

3 ACM SIGIR August 15-19, 2005 Bin MA Spoken Document Classification & Related Works Spoken Document Retrieval (SDR) is the task of retrieving excerpts from a large collection of spoken documents based on a user’s request. –Automatic spoken document classification (SDC) is an important topic in SDR; –Conventionally approached by integrating automatic speech recognition (ASR) technologies and text information retrieval (IR). Most SDC efforts so far have been devoted to two paradigms: –lexical-semantic –n-gram phonotactic

4 ACM SIGIR August 15-19, 2005 Bin MA lexical-semantic –Convert the spoken documents into text transcripts of lexical words; –The transcripts are typically generated from a large vocabulary continuous speech recognizer (LVCSR). –Text categorization (TC) techniques are then applied to the automatic transcripts to derive semantic classes. Homophone Out-of-Vocabulary (OOV) Multilinguality  The major limitations is its lexical choice. Spoken Document Classification & Related Works

5 ACM SIGIR August 15-19, 2005 Bin MA n-gram phonotactic –Use n-gram phonotactics, i.e. the rules governing the sequences of allowable phonemes, instead of lexical words to represent the lexical constraints that are imposed by semantic domains; –Enhance robustness against speech recognition errors. Semantic Abstraction Multilinguality  Its major shortcoming is not to exploit the global phonotactics in the larger context of a spoken document. Spoken Document Classification & Related Works

6 ACM SIGIR August 15-19, 2005 Bin MA Phonotactic-semantic Approach Spoken document classification (SDC) is more complex than text categorization (TC). –In TC, we usually derive the lexical vocabulary from the running text. –For spoken documents, an additional tokenization step is needed to convert sound wave into a sequence of phonetic units, such as words or phonemes. Two issues: –the definition of tokenization unit, and –the choice of vocabulary.

7 ACM SIGIR August 15-19, 2005 Bin MA Definition of tokenization unit –Traditionally use the lexical word or phonemes in a specific language. –We propose to use a set of universal acoustic word (AW) - language independent, self-organized, and phoneme-like units. –We treat the documents in all languages equally with the same set of AWs. –AWs can be learned from a multilingual training corpus using a data driven approach. Phonotactic-semantic Approach

8 ACM SIGIR August 15-19, 2005 Bin MA Choice of vocabulary –Use the bag-of-sounds statistics over AWs, instead of bag-of- words over lexical words, to derive high level semantic characteristics from a spoken document. –The bag-of-sounds concept is analogous to the bag-of-words paradigm originally formulated in the context of information retrieval (IR) and text categorization (TC). –A spoken document is then represented by a high-dimensional vector derived from the statistics of term frequency. Phonotactic-semantic Approach

9 ACM SIGIR August 15-19, 2005 Bin MA Phonotactic-semantic Approach

10 ACM SIGIR August 15-19, 2005 Bin MA Three fundamental components for SDC –A voice tokenizer, i.e. a speech recognizer front-end which segments a spoken documents into acoustic tokens; –A statistical language model which captures statistics of semantic domain information; –A classifier which categorizes a spoken document using the statistical language model. Phonotactic-semantic Approach

12 ACM SIGIR August 15-19, 2005 Bin MA word phoneme frame Voice Tokenization with Acoustic Words

13 ACM SIGIR August 15-19, 2005 Bin MA Segment an utterance into Q consecutive segments in a maximum likelihood manner –minimizing an overall distortion with dynamic programming; Cluster all segments into T classes with k-means algorithm –speech segments in the same class are acoustically similar; Train one HMM for each class –establish T acoustic segment models to represent the overall acoustic space of all languages. Voice Tokenization – Acoustic segment modeling (ASM)

14 ACM SIGIR August 15-19, 2005 Bin MA Voice Tokenization – Phonetically-bootstrapped ASM Add phonetic constraints in segmentation –use large amount of labeled speech data from few well studied languages; –train language-specific phone models; –choose some models to form a set of T models for bootstrapping; Phonetically label the multilingual training utterances –use T models to decode all training utterances; –keep the recognized sequences as “true” labels; Re-train models –force-align and segment all utterances based on “true” labels; –group all speech segments of a specific label into a class; –use these segments to re-train an HMM.

16 ACM SIGIR August 15-19, 2005 Bin MA Bag-of-sounds is analogous to the bag-of-words; AWs in the vocabulary with T acoustic tokens; A spoken document is described as a count vector of AWs, which has its element to represent the count of an AW and takes the AW vocabulary size W as dimension. Capture local phonotactics with lexical constraints; Capture global phonotactics with co-occurrences of AWs; Bag-of-Sounds Representation

18 ACM SIGIR August 15-19, 2005 Bin MA Language Identification National Institute of Standards and Technology (NIST) 1996 Language Recognition Evaluation (LRE) database. 12 languages : Arabic, English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil, and Vietnamese. Linguistic Data Consortium (LDC) Callfriend corpus as the training data. –40 30-minute conversations; –12,000 30-second training sessions for each language. 1492 30-second speech sessions from 1996 NIST LRE database as the test data.

19 ACM SIGIR August 15-19, 2005 Bin MA LM-L: French Universal VT LM-1: English LM-2: Chinese Language Classifier spoken utterance Hypothesized language Language Identification

20 ACM SIGIR August 15-19, 2005 Bin MA SVM Classifier with Feature Extraction SVM light V6.01 from http://svmlight.joachims.org/http://svmlight.joachims.org/ Work with a linear kernel SVM; Feature dimension L*(L-1)/2 pair-wise binary SVMs The class that gains most of the winning votes takes all.

21 ACM SIGIR August 15-19, 2005 Bin MA Count-trimming (CT) –AWs that have very low frequency; –AWs that occurs in too few document. Mutual Information (MI) –Class membership –Particular AW’s presence –MI indicates the contribution to semantic classification from an AW’s presence. SVM Classifier with Feature Extraction

22 ACM SIGIR August 15-19, 2005 Bin MA Separation Margin (SM) –SVM with a linear kernel –, while –Margin is inversely proportional to –Features with higher |a j | are more influential in determining the width of the separation margin. Feature Weighting SVM Classifier with Feature Extraction

23 ACM SIGIR August 15-19, 2005 Bin MA SVM Classifier with Feature Extraction SLID error rate comparison among three feature selection techniques

24 ACM SIGIR August 15-19, 2005 Bin MA SVM Classifier with Feature Extraction Effect of training corpus size

25 ACM SIGIR August 15-19, 2005 Bin MA LSA Classifier with SVD Singular Vector Decomposition (SVD) –Term-document matrix : – SVD : –Retain the top Q singular values in matrix S Latent Semantic Analysis (LSA)

26 ACM SIGIR August 15-19, 2005 Bin MA LSA Classifier I – k-nearest neighbor LSA Classifier II – mixture modeling LSA Classifier with SVD

27 ACM SIGIR August 15-19, 2005 Bin MA Effect of Mixture Number M (LSAC-II) LSA Classifier with SVD

28 ACM SIGIR August 15-19, 2005 Bin MA LSA Classifier with SVD #M1,0002,0006,00012,000 LSAC-I Error (%)19.816.515.214.8 SVMC Error (%)18.216.214.413.9 Effect of training data size in LSAC-I & SVMC P-PRLM P-PRLM & Score Fusion LSAC_IISVMC Error (%)22.017.014.913.9 Benchmark of different models

29 ACM SIGIR August 15-19, 2005 Bin MA Conclusion Non-lexical approach to spoken document tokenization –Universal acoustic word (AW) - language independent, self- organized, and phoneme-like units; –Data driven approach to learn from multilingual training corpus. Phonotactic-semantic paradigm to model –Local phonotactics in an acoustic word (AW); –Global phonotactics in an bag-of-sounds vector.

30 ACM SIGIR August 15-19, 2005 Bin MA Thank you !

A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore.

Similar presentations

Presentation on theme: "A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore.

Similar presentations

Presentation on theme: "A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore."— Presentation transcript:

Similar presentations

About project

Feedback