Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Fast Algorithms For Hierarchical Range Histogram Constructions
Biointelligence Laboratory, Seoul National University
Computer vision: models, learning and inference Chapter 8 Regression.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Supervised Learning Recap
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Bayesian Robust Principal Component Analysis Presenter: Raghu Ranganathan ECE / CMR Tennessee Technological University January 21, 2011 Reading Group (Xinghao.
From: Probabilistic Methods for Bioinformatics - With an Introduction to Bayesian Networks By: Rich Neapolitan.
Chapter 8-3 Markov Random Fields 1. Topics 1. Introduction 1. Undirected Graphical Models 2. Terminology 2. Conditional Independence 3. Factorization.
Visual Recognition Tutorial
Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Maximum-Likelihood estimation Consider as usual a random sample x = x 1, …, x n from a distribution with p.d.f. f (x;  ) (and c.d.f. F(x;  ) ) The maximum.
Maximum Likelihood (ML), Expectation Maximization (EM)
Visual Recognition Tutorial
July 3, Department of Computer and Information Science (IDA) Linköpings universitet, Sweden Minimal sufficient statistic.
Scalable Text Mining with Sparse Generative Models
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.
Radial Basis Function Networks
Natural Language Understanding
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Copyright © Cengage Learning. All rights reserved. 8 Tests of Hypotheses Based on a Single Sample.
1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Graphical models for part of speech tagging
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
HMM - Part 2 The EM algorithm Continuous density HMM.
CS Statistical Machine learning Lecture 24
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Lecture 2: Statistical learning primer for biologists
Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
Relevance Language Modeling For Speech Recognition Kuan-Yu Chen and Berlin Chen National Taiwan Normal University, Taipei, Taiwan ICASSP /1/17.
John Lafferty Andrew McCallum Fernando Pereira
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
12. Principles of Parameter Estimation
Ch3: Model Building through Regression
Statistical Models for Automatic Speech Recognition
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Data Mining Lecture 11.
Bayesian Models in Machine Learning
N-Gram Model Formulas Word sequences Chain rule of probability
Topic Models in Text Processing
Parametric Methods Berlin Chen, 2005 References:
12. Principles of Parameter Estimation
Presentation transcript:

Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao

Abstract Simultaneously incorporate various aspects of natural language –Local word interaction, syntactic structure, semantic document information Latent maximum entropy (LME) principle –Which allows relationships over hidden features to be effectively captured in a unified model –Local lexical models (N-gram models) –Global document-level semantic models (PLSA)

Introduction There are various kinds of language models that can be used to capture different aspects of natural language regularity –Markov chain (N-gram) models effectively capture local lexical regularities in text –Smoothed N-gram: estimating rare events –Increase the order of an N-gram to capture longer range dependencies in natural language  curse of dimensionality (Rosenfeld)

Introduction –Structural language model effectively exploits relevant syntactic regularities to improve the perplexity score of N-gram models –Semantic language model exploits document-level semantic regularities to achieve similar improvements Although each of these language models outperforms simple N-grams, they each only capture specific linguistic phenomena

Introduction Several techniques for combining language models: –Linear interpolation: each individual model is trained separately and then combined by a weighted linear combination –Maximum entropy: it model distributions over explicitly observed features There are many hidden semantic and syntactic information in natural language

Introduction Latent maximum entropy (LME) principle extends ME to incorporate latent variables Let X  X denote the complete data, Y  Y be the observed incomplete data and Z  Z be the missing data.  X = (Y, Z) The goal of ME is to find a probability model that matches certain constraints in the observed data while otherwise maximizing entropy

Maximum Entropy construct separate models –Build a single, combined model, which attempts to capture all the information provided by tie various knowledge sources The intersection of all the constraints, if not empty, contains a (typically infinite) set of probability function, which are all consistent with the knowledge sources The second step in the ME approach is to choose, from among the functions in that set, that function which has the highest entropy (i.e. the “flattest” function)

An example Assume we wish to estimate P(“BANK”|h), one estimate may be provided by a conventional bigram Features

An example Consider one such equivalence class, say, the one where the history ends in “THE”. The bigram assigns the same probability estimate to all events in that class where is the does the …… the [1]

An example Another estimate may be provided by a particular trigger pair, say (LOAN,BANK) –Assume we want to capture the dependency of “BANK” on whether or not “LOAN” occurred before it in the same document. Thus a different partition of the event space will be added, as in Table IV –Similarly to the bigram case, consider now one such equivalence class, say, the one where “LOAN” did occur in the history. The trigger component assigns the same probability estimate to all events in that class

…loan…the…loan…of Features

An example Consider the bigram, under ME, we no longer insist that P(BANK|h) always have the same value (K {THE,BANK} ) whenever the history ends in “THE” Rather, we only require that P COMBINED (BANK|h) be equal to (K {THE,BANK} ) on average in the training data Equation [1] is replaced by where E stands for an expectation [2]

An example Similarly, we require that P COMBINED (BANK|h) be equal to K {BANK,LOANvh} on average over those histories that contain occurrences of “LOAN” [3]

Information sources as constraint functions We can view each information source as defining a subset (or many subsets) of the event space (h,w) We can define any subset S of the event space, and any desired expectation K, and impose the constraint: The subset S can be specified by an index function, also called selector function, f s : [4]

Information sources as constraint functions so Equation [4] becomes: [5]

Maximum Entropy The ME Principle (Jaynes, 1975; Kullback, 1959) can be stated as follows –1. Reformulate the different information sources as constraints to be satisfied by the target (combined) estimate –2. Among all probability distributions that satisfy these constraints, choose the one that has the highest entropy

Maximum Entropy Given a general event space {x}, to derive a combined probability function P(x), each constraint i is associated with a constraint function f i (x) and a desired expectation K i The constraint is then written as : [6]

LME Given features f 1,…,f N specifying the properties we would like to match in the data, select a joint model p* from the set of possible probability distributions that maximizes the entropy

LME Intuitively, the constraints specify that we require the expectations of f i (X) in the joint model to match their empirical expectations on the incomplete data Y – f i (X) = f i (Y, Z) When the features only depend on the observable data Y, the LME is equivalent to ME

Regularized LME (RLME) ME principle are subject to errors due to the empirical data, especially in a very sparse domain –Add a penalty to the entropy of the joint model

A Training Algorithm Assume we have already selected the features f 1,…f N from the training data Restrict p(x) to be an exponential model

A Training Algorithm Intimately related to finding locally maximum a posteriori (MAP) solutions: –Given a penalty function U over errors a, an associated prior U* on can be obtained by setting U* to the convex conjugate of U –e.g., given a quadratic penalty, the convex conjugate can be determined by setting ; which specifies a Gaussian prior on

A Training Algorithm –Then, given a prior U*, the standard MAP estimate maximizes the penalized log-likelihood Our key result is that locally maximizing R( ) is equivalent to satisfying the feasibility constraints (2) of the RLME principle

A Training Algorithm THEOREM 1. –Under the log-linear assumption, locally maximizing a posterior probability of log-linear models on incomplete data is equivalent to satisfying the feasibility constraints of the RLME principle. That is, the only distinction between MAP and RLME in log- linear models is that, among local maxima (feasible solutions), RLME selects the model with the maximum entropy, whereas MAP selects the model with the maximum posterior probability

A Training Algorithm R-EM-IS, which employs an EM algorithm as an outer loop, but uses a nested GIS/IIS algorithm to perform the internal M step Decompose the penalized log-likelihood function R( ): This is a standard decomposition used for deriving EM

A Training Algorithm For log-linear models:

A Training Algorithm LEMMA 1.

R-EM-IS algorithm

THEOREM 2.

R-ME-EM-IS algorithm

Combining N-gram and PLSA Models

Tri-gram portion

PLSA portion

Efficient Feature Expectation and Inference Sum-product algorithm: Normalization constant: Feature expectations:

Semantic Smoothing Add node C as word cluster between each topic node and word node. –|C| = 1, maximum smoothing –|C| = |V|, no smoothing (|V| is the vocabulary size) Add node S as document cluster between topic node and document node –|S| = 1, over-smoothed –|S| = |D|, no smoothing (|D| is the number of documents) W|C|TW|C|T T|S|DT|S|D

Computation in Testing

Experimental Evaluation Training Data: –NAB : documents, 1987~1989, 38M words Vocabulary size, |V|, is 20000, the most frequent words of the training data Testing Data: – words, 1989 Evaluation:

Experimental design Baseline: Tri-gram with GT smoothing, perplexity is 105 R-EM-IS procedure: EM iteration is 5, IIS loop iteration is 20 Feasible solution: initialized the parameters to zero and executed a single run of R-EM-IS RLME and MAP: use 20 random starting points for

Simple tri-gram There are no hidden variables, MAP, RLME and single run of R-EM-IS all reduce to the same standard ME principle Perplexity score is 107

Tri-gram + PLSA |T| =

Add word cluster

Add topic cluster 90 87

Add word and topic clusters (1/2) 82

Add word and topic clusters (2/2)

Experiment summarizes

Extensions LSA: Perplexity = 97

Extensions Raising LSA portion’s contribution to some power and renormalizing Perplexity = 82, equal to the best results using RLME