Minimum Rank Error Training for Language Modeling Meng-Sung Wu Department of Computer Science and Information Engineering National Cheng Kung University,

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Discriminative Training in Speech Processing Filipp Korkmazsky LORIA.

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.

Brief introduction on Logistic Regression

1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

ECE 8443 – Pattern Recognition Objectives: Course Introduction Typical Applications Resources: Syllabus Internet Books and Notes D.H.S: Chapter 1 Glossary.

Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.

Chapter 7 Retrieval Models.

Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.

Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University

Minimum Classification Error Networks Based on book chapter 9, by Shigeru Katagiri Jaakko Peltonen, 28 th February, 2002.

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

Sample-Separation-Margin Based Minimum Classification Error Training of Pattern Classifiers with Quadratic Discriminant Functions Yongqiang Wang 1,2, Qiang.

The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.

Radial Basis Function Networks

Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.

1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.

Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.

IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.

2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.

1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.

A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.

Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.

Round-Robin Discrimination Model for Reranking ASR Hypotheses Takanobu Oba, Takaaki Hori, Atsushi Nakamura INTERSPEECH 2010 Min-Hsuan Lai Department of.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.

Chapter 23: Probabilistic Language Models April 13, 2004.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

I-SMOOTH FOR IMPROVED MINIMUM CLASSIFICATION ERROR TRAINING Haozheng Li, Cosmin Munteanu Pei-ning Chen Department of Computer Science & Information Engineering.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Relevance-Based Language Models Victor Lavrenko and W.Bruce Croft Department of Computer Science University of Massachusetts, Amherst, MA SIGIR 2001.

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,

Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.

NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006.

Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.

NTU & MSRA Ming-Feng Tsai

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Language Models for Information Retrieval

John Lafferty, Chengxiang Zhai School of Computer Science

Generally Discriminant Analysis

Language Model Approach to IR

Learning to Rank with Ties

Learning to Rank using Language Models and SVMs

SVMs for Document Ranking

Presentation transcript:

Minimum Rank Error Training for Language Modeling Meng-Sung Wu Department of Computer Science and Information Engineering National Cheng Kung University, Tainan, TAIWAN

Contents Introduction Language Model for Information Retrieval Discriminative Language Model Average Precision versus Classification Accuracy Evaluation of IR Systems Minimum Rank Error Training Summarization and Discussion

Introduction Language Modeling: Provides linguistic constraints to the text sequence W. Provides linguistic constraints to the text sequence W. Based on statistical N-gram language models Based on statistical N-gram language models Speech recognition system is always evaluated by the word error rate. Discriminative learning methods maximum mutual information (MMI) maximum mutual information (MMI) minimum classification error (MCE) minimum classification error (MCE) Classification error rate is not a suitable metric for measuring the rank of input document.

Language Model for Information Retrieval

Standard Probabilistic IR query d1 d2 dn … Information need document collection matching

IR based on LM query d1 d2 dn … Information need document collection generation …

Language Models Mathematical model of text generation Particularly important for speech recognition, information retrieval and machine translation. N-gram model commonly used to estimate probabilities of words Unigram, bigram and trigram Unigram, bigram and trigram N-gram model is equivalent to an (n-1) th order Markov model N-gram model is equivalent to an (n-1) th order Markov model Estimates must be smoothed by interpolating combinations of n-gram estimates

Using Language Models in IR Treat each document as the basis for a model (e.g., unigram sufficient statistics) Rank document d based on P(d | q) P(d | q) = P(q | d) x P(d) / P(q) P(q) is the same for all documents, so ignore P(q) is the same for all documents, so ignore P(d) [the prior] is often treated as the same for all d P(d) [the prior] is often treated as the same for all d But we could use criteria like authority, length, genre P(q | d) is the probability of q given d ’ s model P(q | d) is the probability of q given d ’ s model Very general formal approach

Using Language Models in IR Principle 1: Document D: Language model P(w|M D ) Document D: Language model P(w|M D ) Query Q = sequence of words q 1,q 2,…,q n (uni-grams) Query Q = sequence of words q 1,q 2,…,q n (uni-grams) Matching: P(Q|M D ) Matching: P(Q|M D ) Principle 2: Document D: Language model P(w|M D ) Document D: Language model P(w|M D ) Query Q: Language model P(w|M Q ) Query Q: Language model P(w|M Q ) Matching: comparison between P(.|M D ) and P(.|M Q ) Matching: comparison between P(.|M D ) and P(.|M Q ) Principle 3: Translate D to Q Translate D to Q

Problems Limitation to uni-grams: No dependence between words No dependence between words Problems with bi-grams Consider all the adjacent word pairs (noise) Consider all the adjacent word pairs (noise) Cannot consider more distant dependencies Cannot consider more distant dependencies Word order – not always important for IR Word order – not always important for IR Entirely data-driven, no external knowledge e.g. programming  computer e.g. programming  computer Direct comparison between D and Q Despite smoothing, requires that D and Q contain identical words (except translation model) Despite smoothing, requires that D and Q contain identical words (except translation model) Cannot deal with synonymy and polysemy Cannot deal with synonymy and polysemy

Discriminative Language Model

Minimum Classification Error The advent of powerful computing devices and success of statistical approaches A renewed pursuit for more powerful method to reduce recognition error rate A renewed pursuit for more powerful method to reduce recognition error rate Although MCE-based discriminative methods is rooted in the classical Bayes’ decision theory, instead of a classification task to distribution estimation problem, it takes a discriminant-function based statistical pattern classification approach For a given family of discriminant function, optimal classifier/recognizer design involves finding a set of parameters which minimize the empirical pattern recognition error rate

Minimum Classification Error LM Discrinimant function: MCE classifier design based on three steps Misclassification measure: Misclassification measure: Score of target hypothesis Score of competing hypotheses Loss function: Loss function: Expected loss: Expected loss:

MCE approach has several advantages in classifier design: It is meaningful in the sense of minimizing the empirical recognition error rate of the classifier It is meaningful in the sense of minimizing the empirical recognition error rate of the classifier If the true class posterior distributions are used as discriminant functions, the asymptotic behavior of the classifier will approximate the minimum Baye’s risk If the true class posterior distributions are used as discriminant functions, the asymptotic behavior of the classifier will approximate the minimum Baye’s risk

Average Precision versus Classification Accuracy

Example The same classification accuracy but different average precision The relevant documents = 10 Recall Precision Recall Precision AvgPrec=62.2% AvgPrec=52.0% Accuracy=50.0%

Evaluation of IR Systems

Measures of Retrieval Effectiveness Precision and Recall Single-valued P/R measure Significance tests

Precision and Recall Precision Proportion of a retrieved set that is relevant Proportion of a retrieved set that is relevant Precision = |relevant ∩ retrieved| / | retrieved | Precision = |relevant ∩ retrieved| / | retrieved | = P(relevant | retrieved) = P(relevant | retrieved)Recall Proportion of all relevant documents in the collection included in the retrieved set Proportion of all relevant documents in the collection included in the retrieved set Recall = |relevant ∩ retrieved| / | relevant | Recall = |relevant ∩ retrieved| / | relevant | = P(retrieved | relevant) = P(retrieved | relevant) Precision and recall are well-defined for sets

Average Precision Often want a single-number effectiveness measure E.g., for a machine-learning algorithm to detect improvement E.g., for a machine-learning algorithm to detect improvement Average precision is widely used in IR Average precision at relevant ranks Calculate by averaging precision when recall increases The relevant documents = 5 Recall Precision Recall Precision AvgPrec=62.2% AvgPrec=52.0%

Trec-eval demo Queryid (Num): 225 Total number of documents over all queries Retrieved: Retrieved: Relevant: 1838 Relevant: 1838 Rel_ret: 1110 Rel_ret: 1110 Interpolated Recall - Precision Averages: at at at at at at at at at at at at at at at at at at at at at at Average precision (non-interpolated) for all rel docs(averaged over queries) Precision: At 5 docs: At 5 docs: At 10 docs: At 10 docs: At 15 docs: At 15 docs: At 20 docs: At 20 docs: At 30 docs: At 30 docs: At 100 docs: At 100 docs: At 200 docs: At 200 docs: At 500 docs: At 500 docs: At 1000 docs: At 1000 docs: R-Precision (precision after R (= num_rel for a query) docs retrieved): Exact: Exact:

Significance tests System A beats system B on one query Is it just a lucky query for system A? Is it just a lucky query for system A? Maybe system B does better on some other query Maybe system B does better on some other query Need as many queries as possible Need as many queries as possible Empirical research suggests 25 is minimum need TREC tracks generally aim for at least 50 queries System A and B identical on all but one query If system A beats system B by enough on that one query, average will make A look better than B. If system A beats system B by enough on that one query, average will make A look better than B.

Sign Test Example For methods A and B, compare average precision for each pair of result generated by queries in test collection. If difference is large enough, count as + or -, otherwise ignore. Use number of +’s and the number of significant difference to determine significance level E.g. for 40 queries, method A produced a better result than B 12 times, B was better than A 3 times, and 25 were the “same”, p < and method A is significantly better than B. If A > B 18 times and B > A 9 times, p B 18 times and B > A 9 times, p < and A is not significantly better than B at the 5% level.

Wilcoxon Test Compute differences Rank differences by absolute value Sum separately + ranks and – ranks Two tailed test T= min (+ ranks, -ranks) T= min (+ ranks, -ranks) Reject null hypothesis if T < T 0, where T 0 is found in a table Reject null hypothesis if T < T 0, where T 0 is found in a table

Wilcoxon Test Example + ranks = 44 - ranks = 11 T= 11 T 0 = 8 (from table) Conclusion : not significant ABdiffrank Signed rank

Minimum Rank Error Training

Document ranking principle A ranking algorithm aims at estimating a function. The problem can be described as follows: Two disjoint sets S R and S I Two disjoint sets S R and S I A ranking function f(x) assigns to each document d of the document collection a score value. A ranking function f(x) assigns to each document d of the document collection a score value. denote that is ranked higher than. denote that is ranked higher than. The objective function The objective function

Document ranking principle There are different ways to measure the ranking error of a scoring function f. The natural criterion might be the proportion of misordered pair over the total pair number. This criterion is an estimate of the probability of misordering a pair

Document ranking principle Total distance measure is defined as

Illustration of the metric of average precision

Intuition and Theory Precision is the ratio of relevant documents retrieved to documents retrieved at a given rank. Average precision is the average of precision at the ranks of relevant documents r is returned documents s k is relevance of document k

Discriminative ranking algorithms Maximizing the average precision is tightly related to minimizing the following ranking error loss

Discriminative ranking algorithms Similar to MCE algorithm, ranking loss function L AP is express as a differentiable objective. The error count n ir is approximated by the differentiable loss function defined as

Discriminative ranking algorithms The differentiation of the ranking loss function turns out to be

Discriminative ranking algorithms We use a bigram language model as an example Using the steepest descent algorithm, the parameters of language model are adjusted iteratively by

Experiments

Experimental Setup We evaluated our model with two different TREC collections – Wall Street Journal 1987 (WSJ87), Wall Street Journal 1987 (WSJ87), Asscosiated Press Newswire 1988 (AP88). Asscosiated Press Newswire 1988 (AP88).

Language Modeling We used WSJ87 dataset as training data for language model estimation. The AP88 dataset is used as the test data. During MRE training procedure, parameters is adopted as Comparison of perplexity MLMRE Unigram Bigram

Experimental on Information Retrieval Two query sets and the corresponding relevant documents in this collection. TREC topics as training queries TREC topics as training queries TREC topics as test queries. TREC topics as test queries. Queries were sampled from the ‘title’ and ‘description’ fields of the topics. ML language model is used as the baseline system. To test the significance of improvement, Wilcoxon test was employed in the evaluation.

Comparison of Average Precision CollectionMLMREImprovementWilcoxon WSJ %0.0163* AP %0*

Comparison of Precision in Document Level Documents Retrieved ML (I) MCE (II) MRE (III) Wilcoxon (III  I) Wilcoxon (III  II) 5 docs * docs *0.0449* 15 docs *0.0447* 20 docs * docs *0.0330* 100 docs * docs * docs * docs *0.0413* R-Precision *0.0096*

Summary

Ranking learning requires to consider nonrelevance information. We will extend this method for spoken document retrieval Future work is focused on the area under of the ROC curves (AUC).

References M. Collins, “Discriminative reranking for natural language parsing”, in Proc. 17th International Conference on Machine Learning, pp , J. Gao, H. Qi, X. Xia, J.-Y. Nie, “Linear discriminant model for information retrieval”, in Proc. ACM SIGIR, pp , D. Hull, “Using statistical testing in the evaluation of retrieval experiments”, in Proc ACM SIGIR, pp , B. H. Juang, W. Chou, and C.-H. Lee, “Minimum classification error rate methods for speech recognition”, IEEE Trans. Speech and Audio Processing, pp , B.-H. Juang and S. Katagiri, “Discriminative learning for minimum error classification”, IEEE Trans. Signal Processing, vol. 40, no. 12, pp , H.-K. J. Kuo, E. Fosler-Lussier, H. Jiang, and C.-H. Lee, “Discriminative training of language models for speech recognition”, in Proc. ICASSP, pp , R. Nallapati, “Discriminative models for information retrieval”, in Proc. ACM SIGIR, pp , J. M. Ponte and W. B. Croft, “A language modeling approach to information retrieval”, in Proc. ACM SIGIR, pp , J.-N. Vittaut and P. Gallinari, “Machine learning ranking for structured information retrieval”, in Proc. 28th European Conference on IR Research, pp , 2006.

Thank You for Your Attention Thank You for Your Attention