Survey of Approaches to Information Retrieval of Speech Message Kenney Ng Spoken Language Systems Group Laboratory for Computer Science Massachusetts Institute.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Text Categorization.
Chapter 5: Introduction to Information Retrieval
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Introduction to Information Retrieval
Multimedia Database Systems
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
IR Models: Overview, Boolean, and Vector
Chapter 7 Retrieval Models.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Information Retrieval Review
Chapter 7: Text mining UIC - CS 594 Bing Liu 1 1.
ISP 433/533 Week 2 IR Models.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Modeling Modern Information Retrieval
Evaluating the Performance of IR Sytems
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
Presented by Zeehasham Rasheed
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Scalable Text Mining with Sparse Generative Models
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Information Retrieval in Practice
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Text Classification, Active/Interactive learning.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Image Classification 영상분류
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 4: Pattern Recognition. Classification is a process that assigns a label to an object according to some representation of the object’s properties.
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
Chapter 23: Probabilistic Language Models April 13, 2004.
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Web- and Multimedia-based Information Systems Lecture 2.
Vector Space Models.
C.Watterscsci64031 Probabilistic Retrieval Model.
Information Retrieval
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
Information Retrieval and Organisation Chapter 14 Vector Space Classification Dell Zhang Birkbeck, University of London.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Natural Language Processing Topics in Information Retrieval August, 2002.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
Plan for Today’s Lecture(s)
Multimedia Information Retrieval
Chapter 5: Information Retrieval and Web Search
Boolean and Vector Space Retrieval Models
INF 141: Information Retrieval
Presentation transcript:

Survey of Approaches to Information Retrieval of Speech Message Kenney Ng Spoken Language Systems Group Laboratory for Computer Science Massachusetts Institute of Technology Presenter: chia-hao Lee

Outline Introduction Information Retrieval Text Retrieval Difference between text and speech media Information Retrieval of Speech Messages

Introduction Process, organize, and analyze the data. Present the data in human usable form. Find the “interesting” piece of information efficiently. Increasingly large portions in spoken language information : –recorded speech messages –radio and television broadcasts Development of automatic methods.

2.1 Definition “connected with the representation , storage , organization , and accessing of information items”. Return the best match a “request” provided by user’s information need. There is no restriction on the type of document. Text Retrieval, Document Retrieval Image Retrieval, Speech Retrieval Multi-media Retrieval Information Retrieval

Database RetrievalInformation Retrieval Similarity The existence of an organized collection of information items The use of a request formulated by a user to access the items Diff. Goal Return the specific facts (answer or exactly match request) Return the relevant to the user’s request structurewell definednot well defined the request Complete specification of the user’s information need Incomplete specification of the user’s information need type of answer a specific fact or piece of information. a general topic or subject area , and find out more about it. Database Retrieval vs. Information Retrieval

Information Retrieval Creating document representations (indexing) Creating request representations (query formation) Comparing representations (retrieval) Evaluating retrieved documents (relevance feedback) Component Processes

Information Retrieval Recall The fraction of all the relevant documents in the entire collection that are retrieved in response to a query. Precision The fraction of the retrieved documents that are relevant. Average precision The precision values obtained at each new relevant document in the ranked output for an individual query are averaged. Component Processes Performance

Information Retrieval Related Information Process Information Filtering and Retrieval User requestData collectionThe UserTraining Data FilteringStaticDynamicPassiveYes!! Retrieval (Ad hoc) DynamicStaticActiveNo!!

Information Retrieval Related Information Process Information Categorization and Clustering GoalLabel dataTrain data Categorization Classify or assign label to document Yes!! Clustering Discover structure in a collection of unlabelled data No!!

Text Retrieval Indexing and Document Representation Query Formation Matching Query and Document Representation

Terms and Keywords –A list of words extracted from the full text document. –Construct a Stop list to remove the useless words because of those too common to important. –The use of synonyms Construct a dictionary structure to modify To replace each word in one class –Tradeoff exists between normalization and discrimination in the indexing process. Text Retrieval

Term frequency –The frequency of occurrence of each term in the document –For term t k in document d i Index Term Weighting

Inverse document frequency –Approach of weighting each term inversely proportional to the number of documents in which the term occurs. –For term t k N : the total number of documents n t k : the number of documents with term t k Text Retrieval Index Term Weighting

Weights of terms Terms that occur frequently in particular documents but rarely in the overall collection should receive a large weight. Text Retrieval Index Term Weighting

Text Retrieval Query Formation Relevance Feedback The IR system automatically modifies a query based on user feedback about documents retrieved in an initial run. Advantage: –add new terms to the query –re-weight existing query terms.

Text Retrieval Another approach to relevance feedback Compute a “relevance weight” for each term t i The weight can be used to re-weight the terms in the initial query. Query Formation

Text Retrieval Matching Query and Document Representations Boolean Model, Extended Boolean Model Vector Space Model Probabilistic Models MethodModel Exact-match Divide document collection into matched or unmatched. Boolean Model Best-match Give document collection a score , and rank document collection. Vector space Model or Probabilistic model

Text Retrieval Document representation –Binary value variable True: the term is present in the document False: the term is absent in the document –The document can be represented in a binary vector Query –Boolean operator : AND, OR and NOT Matching function –Standard rule of Boolean logic –If the document representation satisfy the query expression then that document matches the query Boolean Model

Text Retrieval Extended Boolean Model Because of the retrieval decision of the Boolean Model too harsh The extended boolean model: (AND query) This is maximal for a document contain all the terms and decreases the numbers of matching term decreases.

For the OR query This is minimal for a document that contains none of the terms and increases as the number of matching terms increases. The variable p is a constant in the range 1≤p≤∞ that is determined empirically; it is typically in the range 2≤p≤5. The model gives a “soft” Boolean matching function. Text Retrieval Extended Boolean Model

Text Retrieval Vector Space Model Documents and queries are represented as vector in a K-dimensional space K is the number of indexing terms. When the indexing terms form an orthogonal basis for the vector space , it isn’t assumed that the indexing terms are independent.

Text Retrieval Probabilistic Model(1/6) Bayes’ Decision Rule –The probability that the document d is relevant to the query q denotes –The probability that the document d is non-relevant to the query q denotes –C r is the cost of retrieving a non-relevant document –C n is the cost of not retrieving a relevant document –The expected cost of retrieving a extraneous document is

How to compute the and which are posterior probabilities? Base on Bayes’ Rule, are the priori probabilities of relevance and non-relevance of a document., are the likelihoods or class conditional probabilities. Text Retrieval Probabilistic Model(2/6)

Text Retrieval Probabilistic Model(3/6) Now we have to estimate and

In order to simplify the function , so we do the assumptions The document vectors are binary, indicating the presence or absence of each indexing term. Each term has a binomial distribution. There are no interactions between the terms. Text Retrieval Probabilistic Model(4/6)

Text Retrieval Probabilistic Model(5/6)

Probabilistic Model(6/6) w k is the same as the relevance weight of k th index term Assume p k a constant value : 0.5 q k overall frequency : n k /N Text Retrieval

Poisson Model Unlike the above model with binary document vector, in the model, each document vector contains the number of occurrences of each indexing term in the document. In the model, the probability that a document d in class contains n occurrences of the indexing term is : is the mean parameter for the indexing term in class R documents.

Text Retrieval Poisson Model Similarly for document in class, we have: So, we can get the function:

Text Retrieval Poisson Model Those with large separation of the Poisson mean parameters.

Text Retrieval Dependent Model In the above models, we have assumed that the indexing terms are independent of each other. So, we need the dependent function: But it is computationally impractical and there is not enough data, we use “partial” dependence between the indexing terms.

Speech is a richer and more expressive medium than text. (mood, tone) Robustness of the retrieval models to noise or errors in transcription. How to accurately extract and represent the contents of a speech message in a form that can be efficiently stored and searched because of multiple microphones,multiple speaker, and so on. Differences between text and speech media

Information Retrieval of Speech message Speech Message Retrieval –Large Vocabulary Word Recognition Approach –Sub-Word Unit Approaches –Word Spotting Approaches Speech Message Classification and Sorting –Topic Identification –Topic Spotting –Topic Clustering

Suggested by CMU in Information digital video library project. A user can interact with the text retrieval system to obtain video clips stored in the library that are relevant to his request. Large vocabulary speech recognizer (Sphinx-II) Sound track of video Textual transcript Natural language understanding Full-text information retrieval system Large Vocabulary Word Recognition Approach

Sub-Word Unit Approaches Syllabic Units Phonetic Units

VCV (vowel-consonants-vowel)-features Sub-word units consist of a maximum sequence of consonants enclosed between two maximum sequences of vowels. –eg: INFORMATION has INFO,ORMA,ATIO vcv-features Take subset of these features as the indexing terms. Syllabic Units

Criteria The features occur frequently enough for a reliable acoustic model to be trained for it. It is not occur so frequently that its ability to discriminate between different messages is poor. Process queryVCV-featurestf*idf weight Document representation Cosine similarity function Document with highest score Syllabic Units

Major problem –The acoustic confusability of VCV-feature based approach is not taken into account during the selection of indexing features; they are selected based only on the text transcription. –So, it may have a high false alarm rate.

Phonetic Units Using variable length phone sequences as indexing feature. –These features can be viewed as “pseudo -word” and were shown to be useful for detecting or spotting topics in recorded military radio broadcasts. –An automatic procedure based on “digital trees” is used to search the possible subsequences –A Hidden Markov Model (HMM) phone recognizer with 52 monophone models is used to process the speech More domain independent than a word based system.

Word Spotting Approaches Between the simple phonetic and the complex large-vocabulary recognition. Two different ways that word spotting has been used. 1. Small, fixed number of keywords are selected a priori for both recognition and indexing. 2. The speech messages in the collection are processed and stored in a form (e.g. phone lattice) that allows arbitrary keywords to be searched for after they are specified by the user.

Speech Message Classification and Sorting Topic Identifications –K keywords –n k is the binary value indicating the presence or absence of keyword w k. –Finding that topic T i which maximum the score S i

Speech Message Classification and Sorting Topic Identifications –If there are 6 topics, top scoring 40 words each, total 240 keywords. –These keywords used on the text transcriptions of the speech messages 82.4% classification accuracy achieved –If a genetic algorithm used to reduced the number of keywords down to 126 with a small drop in classification performance to 78.2%.

Topic Identifications The topic dependent unigram language models –K is the number of keywords in the indexing vocabulary –n k is the number of times keyword w k occurs in the speech message –p( w k | T i ) is the unigram or occurrence probability of keyword w k in the set of class T i message.

Topic Identifications Number of words The topic classification accuracy All 8431 words in the recognition vocabulary72.5% a subset of 4600 words by performing a X 2 hypothesis test based on contingency tables to select the “important” keywords 74% A genetic algorithm search was then used to Reduce to %

Topic Identifications The length normalized topic score –N is the total number of words in speech message –K is the number of keywords in the indexing vocabulary –n k is the number of times keyword w k occurs in the speech message –p( w k | T i ) is the unigram or occurrence probability of keyword w k in the set of class T i message.

Topic Identifications 750 keywords Classification accuracy is 74.6%

Topic Identifications The topic model is extended to a mixture of multinomial –M is the number of multinomial model components –Π m is the weight of the m th multinomial component –K is the number of keywords in the indexing vocabulary –n k is the number of times keyword w k occurs in the speech message –p( w k | T i ) is the unigram or occurrence probability of keyword w k in the set of class T i message.

Topic Identifications Experiments indicate that the more complex models do not perform as well as the simple single mixture model.

Topic Spotting “usefulness” measure how discriminating the word is for the topic. and are the probabilities of detecting the keyword in the topic and unwanted This measure select words that occur often in the topic and have high discriminability.

Topic Spotting Performed by accumulating over a window of speech (typically 60 seconds) The log likelihood ratio of the detected keywords to produce a topic score for that region of the speech message.

Topic Spotting Try to capture dependencies between the keywords are examined. w represent the vector of keywords is the coefficient of model. Their experiments show that using a carefully chosen log-linear model can give topic spotting performance that is better than using the basic model that assumes keyword independence

Topic Clustering Try to discover structure or relationships between messages in a collection. The clustering process Tokenization Similarity computation Clustering

Topic Clustering Tokenization to come up with a suitable representation of the speech message which can be used in the next two steps. Similarity it needs to compare every pair of messages, N-gram model is used. Clustering using hierarchical tree clustering or nearest neighbor classification. Work well under true transcription texts figure of merit (FOM) 90% rates Using speech input is worse than texts, it down to 70% FOM using recognition output, unigram language models and tree-based clustering.