Chinese Spoken Document Retrieval and Organization

Chinese Spoken Document Retrieval and Organization
Berlin Chen (陳柏琳) Associate Professor Department of Computer Science ＆ Information Engineering National Taiwan Normal University 2006/08/03

Outline Introduction Information Retrieval Models
Spoken Document Summarization Spoken Document Organization Language Model Adaptation for Spoken Document Transcription Conclusions and Future Work

Introduction (1/2) Multimedia associated with speech is continuously growing and filling our computers and lives Such as broadcast news, lectures, historical archives, voice mails, meeting records, spoken dialogues etc. Speech is the most semantic (or information)-bearing On the other hand, speech is the primary and the most convenient means of communication between people Speech provides a better (or natural) user interface in wireless environments and on smaller hand-held devices Multimedia information access using speech will be very promising in the near future

Information Extraction
Introduction (2/2) Scenario of Multimedia Access using Speech Information Extraction and Retrieval (IE & IR) Spoken Dialogues Spoken Document Understanding and Organization Multimedia Network Content Networks Users ˙ Multimodal Interaction Multimedia Content Processing

Related Research Work Spoken document transcription, organization and retrieval have been active focuses of much research in the recent past [R3, R4] Informedia System at Carnegie Mellon Univ. Multimedia Document Retrieval Project at Cambridge Univ. Rough’n’Ready System at BBN Technologies SpeechBot Audio/Video Search System at HP Labs Spontaneous Speech Corpus and Processing Technology Project at Japan, etc. ….

Related Research Issues (1/2)
Transcription of Spoken Documents Automatically convert speech signals into sequences of words or other suitable units for further processing Audio Indexing and Information Retrieval Robust representation of the spoken document Matching between (spoken) queries and spoken documents Named Entity Extraction from Spoken Documents Personal names, organization names, location names, event names Very often out-of-vocabulary (OOV) words, difficult for recognition Spoken Document Segmentation Automatically segment spoken documents into short paragraphs, each with a central topic

Related Research Issues (2/2)
Information Extraction for Spoken Documents Extraction of key information such as who, when, where, what and how for the information described by spoken documents Summarization for Spoken Documents Automatically generate a summary (in text/speech form) for each short paragraph Title Generation for Multi-media/Spoken Documents Automatically generate a title (in text/speech form) for each short paragraph; i.e., a very concise summary indicating the topic area Topic Analysis and Organization for Spoken Documents Analyze the subject topics for the short paragraphs; or organize the subject topics of the short paragraphs into graphic structures for efficient browsing

An Example System for Chinese Broadcast News (1/2)
For example, a prototype system developed at NTU for efficient spoken document retrieval and browsing [R4]

An Example System for Chinese Broadcast News (2/2)
Users can browse spoken documents in top-down and bottom-up manners

Information Retrieval Models
Information retrieval (IR) models can be characterized by two different matching strategies Literal term matching Match queries and documents in an index term space Concept matching Match queries and documents in a latent semantic space 香港星島日報篇報導引述軍事觀察家的話表示，到二零零五年台灣將完全喪失空中優勢，原因是中國大陸戰機不論是數量或是性能上都將超越台灣，報導指出中國在大量引進俄羅斯先進武器的同時也得加快研發自製武器系統，目前西安飛機製造廠任職的改進型飛豹戰機即將部署尚未與蘇愷三十通道地對地攻擊住宅飛機，以督促遇到挫折的監控其戰機目前也已經取得了重大階段性的認知成果。根據日本媒體報導在台海戰爭隨時可能爆發情況之下北京方面的基本方針，使用高科技答應局部戰爭。因此，解放軍打算在二零零四年前又有包括蘇愷三十二期在內的兩百架蘇霍伊戰鬥機。中共新一代空軍戰力 relevant ?

IR Models: Literal Term Matching (1/2)
Vector Space Model (VSM) Vector representations are used for queries and documents Each dimension is associated with a index term and usually represented by TF-IDF weighting Cosine measure for query-document relevance VSM can be implemented with an inverted file structure for efficient document search y  x

IR Models: Literal Term Matching (2/2)
Hidden Markov Models (HMM) [R1] Also known as Language Modeling (LM) approaches Each document is a probabilistic generative model consisting of a set of N-gram distributions for predicting the query Models can be optimized by the expectation-maximization (EM) or minimum classification error (MCE) training algorithms Such approaches do provide a potentially effective and theoretically attractive probabilistic framework for studying IR problems Query So, given a training set of query and relevant document pairs,

IR Models: Concept Matching (1/3)
Latent Semantic Analysis (LSA) [R2] Start with a matrix describing the intra- and Inter-document statistics between all terms and all documents Singular value decomposition (SVD) is then performed on the matrix to project all term and document vectors onto a reduced latent topical space Matching between queries and documents can be carried out in this topical space D1 D Di Dn D1 D Di Dn w1 w2 wi wm w1 w2 wi wm Σk VTkxm kxk kxn A Umxk = mxk mxn

Probabilistic Latent Semantic Analysis (PLSA) [R5, R6] An probabilistic framework for the above topical approach Relevance measure is not obtained directly from the frequency of a respective query term occurring in a document, but has to do with the frequency of the term and document in the latent topics A query and a document thus may have a high relevance score even if they do not share any terms in common D1 D Di Dn D1 D Di Dn w1 w2 wi wm w1 w2 wj wm kxk rxn Σk VTkxn = Umxk P mxn mxk

PLSA also can be viewed as an HMM model or a topical mixture model (TMM) Explicitly interpret the document as a mixture model used to predict the query, which can be easily related to the conventional HMM modeling approaches widely studied in speech processing community (topical distributions are tied among documents) Thus quite a few of theoretically attractive model training algorithms can be applied in supervised or unsupervised manners A document model Query

Preliminary IR Evaluations
Experiments were conducted on TDT2/TDT3 spoken document collections [R6] TDT2 for parameter tuning/training, while TDT3 for evaluation E.g., mean average precision (mAP) tested on TDT3 HMM/PLSA/TMM are trained in a supervised manner Language modeling approaches (TMM/PLSA/HMM) are evidenced with significantly better results than that of conventional statistical approaches (VSM/LSA) for the above spoken document retrieval (SDR) task TMM HMM PLSA VSM LSA TD 0.7870 0.7174 0.6882 0.6505 0.6440 SD 0.7852 0.7156 0.6688 0.6216 0.6390

Spoken Document Summarization (1/3)
Extractive Summarization Extract a number of indicative sentences form the original spoken document A lightweight process as compared with abstractive summarization Use TMM to explore the relationships between the handcrafted titles and the documents of the contemporary newswire text collection for selecting salient sentences form spoken documents Build a probabilistic latent topical space A sentence model Document

An illustration of the TMM-based extractive summarization The sentence model can be obtained by probabilistic “fold-in” H1 H2 D2 HM DM T1 T2 TK Sg,1 Sg,2 Sg,3 Sg,L Spoken Document Dg Sg,l D1 Text Documents Sentence Hj Dj

Preliminary Tests on 200 radio broadcast news stories collected in Taiwan (automatic transcripts with 14.17% character error rate) ROUGE-2 measure was used to evaluate the performance levels of different models TMM yields considerable performance gains for the results with lower summarization ratios VSM LSA-1 LSA-2 SigSen TMM Random 10% 0.2603 0.1872 0.1976 0.2808 0.1466 20% 0.2739 0.2089 0.2332 0.2837 0.1696 30% 0.3092 0.2288 0.2724 0.3100 0.1956 50% 0.4107 0.3772 0.3883 0.4266 0.3436

Spoken Document Organization (1/3)
Each document is viewed as a TMM model to generate itself Additional transitions between topical mixtures have to do with the topological relationships between topical classes on a 2-D map A document model Two-dimensional Tree Structure for Organized Topics Document

Document models can be trained in an unsupervised way by maximizing the total log-likelihood of the document collection Initial evaluation results (conducted on the TDT2 collection) TMM-based approach is competitive to the conventional Self-Organization Map (SOM) approach Model Iterations distBet/distWithin TMM 10 1.9165 20 2.0650 30 1.9477 40 1.9175 SOM 100 2.0604

Each topical class can be labeled by words selected using the following criterion An example map for international political news

Dynamic Language Model Adaptation (1/3)
TMM also can be applied to dynamic language model adaptation for spoken document recognition In word graph rescoring, the history of each newly occurring word can be treated as a document used to predict itself

Automatic Speech Recognition (ASR)
A search process to uncover the word sequence that has the maximum posterior probability Language Model Probability Acoustic Model Probability Unigram: N-gram Language Modeling Bigram: Trigram:

For each history model, the probability distributions is trained beforehand and kept fixed during rescoring, while the probability can be dynamically estimated (folded-in) TMM LM can be incorporated with background N-gram LM via probability interpolation or scaling Probability Interpolation Probability Scaling

Preliminary Tests on 500 radio broadcast news stories TMM is competitive to conventional MAP-based LM adaptation, and is superior than LSA Combination of TMM (topical information) and MAP (contextual information) approaches yields additional performance gains CER (%) PP Baseline (Background Trigram LM) 15.22 752.49 +TMM (Interpolation) 14.47 502.50 +TMM (Scaling) 13.87 493.26 +LSA (Scaling) 15.19 676.67 +MAP In-domain Trigram 13.74 430.59 +TMM (Scaling) + MAP In-domain Trigram 13.27 306.75

Prototype Systems Developed at NTNU (1/3)
Spoken Document Retrieval

Speech Retrieval and Browsing of Digital Historical Archives

Systems Currently Implemented with a Client-Server Architecture Word-level Indexing Features Mandarin LVCSR Server Inverted Files Multi-Scale Indexer PDA Client Syllable-level Indexing Features Information Retrieval Server Audio Streaming Server Automatically Transcribed Broadcast News Corpus

Conclusions and Future Work
Multimedia information access using speech will be very promising in the near future Speech is the key for multimedia understanding and organization Several task domains still remain challenging PLSA/TMM-based modeling approaches using probabilistic latent topical information are promising for language modeling, information retrieval, document summarization, segmentation, organization, etc. A unified framework for spoken document recognition, organization and retrieval was initially investigated

References [R1] B. Chen et al., “A Discriminative HMM/N-Gram-Based Retrieval Approach for Mandarin Spoken Documents,” ACM Transactions on Asian Language Information Processing, Vol. 3, No. 2, June 2004 [R2] J.R. Bellegarda, “Latent Semantic Mapping,” IEEE Signal Processing Magazine, Vol. 22, No. 5, Sept. 2005 [R3] K. Koumpis and S. Renals, “Content-based access to spoken audio,” IEEE Signal Processing Magazine, Vol. 22, No. 5, Sept. 2005 [R4] L.S. Lee and B. Chen, “Spoken Document Understanding and Organization,” IEEE Signal Processing Magazine, Vol. 22, No. 5, Sept. 2005 [R5] T. Hofmann, “Unsupervised Learning by Probabilistic Latent Semantic Analysis,” Machine Learning, Vol. 42, 2001 [R6] B. Chen, “Exploring the Use of Latent Topical Information for Statistical Chinese Spoken Document Retrieval,” Pattern Recognition Letters, Vol. 27, No. 1, Jan. 2006

Thank You!

Chinese Spoken Document Retrieval and Organization

Similar presentations

Presentation on theme: "Chinese Spoken Document Retrieval and Organization"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chinese Spoken Document Retrieval and Organization

Similar presentations

Presentation on theme: "Chinese Spoken Document Retrieval and Organization"— Presentation transcript:

Similar presentations

About project

Feedback