Formal Multinomial and Multiple- Bernoulli Language Models Don Metzler.

Slides:

Advertisements

Similar presentations

Bayes rule, priors and maximum a posteriori

Advertisements

Basics of Statistical Estimation

Probabilistic models Haixu Tang School of Informatics.

1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,

INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.

Language Models Hongning Wang

Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

LECTURE 11: BAYESIAN PARAMETER ESTIMATION

Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.

Information Retrieval in Practice

Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Probabilistic inference

Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.

1 Language Model CSC4170 Web Intelligence and Social Computing Tutorial 8 Tutor: Tom Chao Zhou

Incorporating Language Modeling into the Inference Network Retrieval Framework Don Metzler.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.

Latent Dirichlet Allocation a generative model for text

Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”

Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.

Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.

Computer vision: models, learning and inference

Scalable Text Mining with Sparse Generative Models

Thanks to Nir Friedman, HU

Language Modeling Approaches for Information Retrieval Rong Jin.

Chapter Two Probability Distributions: Discrete Variables

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori.

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.

Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.

Positional Relevance Model for Pseudo–Relevance Feedback Yuanhua Lv & ChengXiang Zhai Department of Computer Science, UIUC Presented by Bo Man 2014/11/18.

Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.

The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.

Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.

NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.

The Uniform Prior and the Laplace Correction Supplemental Material not on exam.

Univariate Gaussian Case (Cont.)

A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.

Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)

Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.

CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,

Text Classification and Naïve Bayes Formalizing the Naïve Bayes Classifier.

Prediction and Missing Data. Summarising Distributions ● Models are often large and complex ● Often only interested in some parameters – e.g. not so interested.

Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.

Univariate Gaussian Case (Cont.)

Bayesian and Markov Test

True/False questions (3pts*2)

Statistical Language Models

CS 2750: Machine Learning Density Estimation

Ch3: Model Building through Regression

Parameter Estimation 主講人：虞台文.

Bayes Net Learning: Bayesian Approaches

Language Models for Information Retrieval

Statistical NLP: Lecture 4

LECTURE 09: BAYESIAN LEARNING

Topic Models in Text Processing

CS590I: Information Retrieval

Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.

INF 141: Information Retrieval

Conceptual grounding Nisheeth 26th March 2019.

Mathematical Foundations of BME Reza Shadmehr

Presentation transcript:

Formal Multinomial and Multiple- Bernoulli Language Models Don Metzler

Overview Two formal estimation techniques –MAP estimates [Zargoza, Hiemstra, Tipping, SIGIR’03] –Posterior expectations Language models considered –Multinomial –Multiple-Bernoulli (2 models)

Bayesian Framework (MAP Estimation) Assume textual data X (document, query, etc) is generated by sampling from some distribution P(X | θ) parameterized by θ Assume some prior over θ. For each X, we want to find the maximum a posteriori (MAP) estimate: θ X is our (language) model for data X

Multinomial Modeling assumptions: Why Dirichlet? –Conjugate prior to multinomial –Easy to work with

Multinomial

How do we set α? α = 1 => uniform prior => ML estimate α = 2 => Laplacian smoothing Dirichlet-like smoothing:

left – ML estimate – α = 1 center – Laplace – α = 2 right – α = μP(w | C) μ = 10 X = A B B B P(A | C) = 0.45 P(B | C) = 0.55

Multiple-Bernoulli Assume vocabulary V = A B C D How do we model text X = D B B D? –In multinomial, we represent X as the sequence D B B D –In multiple-Bernoulli we represent X as the vector [ ] denoting terms B and D occur in X –Each X represented by single binary vector

Multiple-Bernoulli (Model A) Modeling assumptions: –Each X is a single sample from a multiple- Bernoulli distribution parameterized by θ –Use conjugate prior (multiple-Beta)

Multiple-Bernoulli (Model A)

Problems with Model A Ignores document length –This may be desirable in some applications Ignores term frequencies How to solve this? –Model X as a collection of samples (one per word occurrence) from an underlying multiple-Bernoulli distribution –Example: V = A B C D, X = B D D B Representation: {[ ], [ ], [ ], [ ] }

Multiple-Bernoulli (Model B) Modeling assumptions: –Each X is a collection (multiset) of indicator vectors sampled from a multiple-Bernoulli distribution parameterized by θ –Use conjugate prior (multiple-Beta)

Multiple-Bernoulli (Model B)

How do we set α, β? α = β = 1 => uniform prior => ML estimate But we want smoothed probabilities… –One possibility:

Multiple-Bernoulli Model B left – ML estimate α = β = 1 center – smoothed (μ = 1) right – smoothed (μ = 10) X = A B B B P(A | C) = 0.45 P(B | C) = 0.55

Another approach… Another way to formally estimate language models is via: Expectation over posterior Takes more uncertainty into account than MAP estimate Because we chose to use conjugate priors the integral can be evaluated analytically

Multinomial / Multiple-Bernoulli Connection Multinomial Multiple-Bernoulli Dirichlet smoothing

Bayesian Framework (Ranking) Query likelihood –estimate model θ D for each document D –score document D by P(Q | θ D ) –measures likelihood of observing query Q given model θ D KL-divergence –estimate model for both query and document –score document D by KL(θ Q || θ D ) –measures “distance” between two models Predictive density

Results

Conclusions Both estimation and smoothing can achieved using Bayesian estimation techniques Little difference between MAP and posterior expectation estimates – mostly depends on μ Not much difference between Multinomial and Multiple-Bernoulli language models Scoring multinomial is cheaper No good reason to choose multiple-Bernoulli over multinomial in general