Finding Scientific topics August 26 - 31, 2011. Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Introduction to Monte Carlo Markov chain (MCMC) methods
Topic models Source: Topic models, David Blei, MLSS 09.
Markov Chain Sampling Methods for Dirichlet Process Mixture Models R.M. Neal Summarized by Joon Shik Kim (Thu) Computational Models of Intelligence.
A Tutorial on Learning with Bayesian Networks
Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.
Hierarchical Dirichlet Processes
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Bayesian Estimation in MARK
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
Visual Recognition Tutorial
POMDPs: Partially Observable Markov Decision Processes Advanced AI
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
Visual Recognition Tutorial
Latent Dirichlet Allocation a generative model for text
Bayesian Networks. Graphical Models Bayesian networks Conditional random fields etc.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Statistical Background
Visual Recognition Tutorial
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Maximum likelihood (ML)
Image Analysis and Markov Random Fields (MRFs) Quanren Xiong.
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Bayes Factor Based on Han and Carlin (2001, JASA).
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
EM and expected complete log-likelihood Mixture of Experts
1 Gil McVean Tuesday 24 th February 2009 Markov Chain Monte Carlo.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
Fast Simulators for Assessment and Propagation of Model Uncertainty* Jim Berger, M.J. Bayarri, German Molina June 20, 2001 SAMO 2001, Madrid *Project of.
An Efficient Sequential Design for Sensitivity Experiments Yubin Tian School of Science, Beijing Institute of Technology.
Probability and Measure September 2, Nonparametric Bayesian Fundamental Problem: Estimating Distribution from a collection of Data E. ( X a distribution-valued.
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Chapter 2 Statistical Background. 2.3 Random Variables and Probability Distributions A variable X is said to be a random variable (rv) if for every real.
1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Lecture 2: Statistical learning primer for biologists
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.
Flat clustering approaches
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
STA347 - week 91 Random Vectors and Matrices A random vector is a vector whose elements are random variables. The collective behavior of a p x 1 random.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
MCMC Output & Metropolis-Hastings Algorithm Part I
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
Classification of unlabeled data:
Latent Variables, Mixture Models and EM
Outline Parameter estimation – continued Non-parametric methods.
Markov Networks.
Latent Dirichlet Analysis
Hidden Markov Models Part 2: Algorithms
Statistical Models for Automatic Speech Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Graduate School of Information Sciences, Tohoku University
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
LECTURE 07: BAYESIAN ESTIMATION
Junghoo “John” Cho UCLA
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

Finding Scientific topics August , 2011

Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution over words. 3.Words are assumed known and the number of words is fixed. If there are T topics, Here, { w } denote words, { z } denote topics. The conditional probability P ( w | z ) indicates which words are important to a topic. For a particular document, P (z), distribution over topics, determines how these topics are mixed together in forming the document.

An example “Soft Classification”: a document is not assigned with only one topic (a single class). For document, P(z) gives an indication of what topic should be assigned to it. How to think/visualize P(z), P(w | z )? words topics What do you want know? What do we want to compute from the input data? Inputs: a. A document or a collection of documents, with the collection of words appeared in the document {w 1, …., w n } (repetition allowed, perhaps with deletions of unimportant words such as articles ‘the’, ‘an’, prepositions ‘on’, ‘of’). b. Number of topics, T.

We want to know (compute) P ( z ), P (w | z ) for each topic z. There is one (or D, the number of documents) P( z ), and T P (w | z = j). What form should P(z) and P( w | z) take? Multinomial distributions ( ) What is this? Each P ( z) is a non-negative vector with T components (with sum = 1). Each P ( w | z ) is a non-negative vector with W components (with sum = 1). One possible solution Question: how many variables are here? Problem with this approach: local maxima and slow convergence

Bayesian Approach Estimate phi and theta indirectly via the following generative model Dirichlet Distribution ( ) What does this generative model say? It gives us the way that the observed data are thought to be generated. Where is the prior? Idea: using the generative model to explain the input data. Alpha, beta are the hyperparameters

The goal is evaluate the posterior distribution Difficult because the denominator cannot be computed. (Know very well what the notations Z, W stand for). However, we do have (for the numerator P( w, z ) )

P ( z ) = P ( z 1, …, z w | theta ) = P (z 1 | theta) … P (z w | theta) ( Assuming conditional Independence of z i given theta)

This gives Equation 3 (with D= 1, one document) Equation 2 can be obtained similarly.

The goal is evaluate the posterior distribution Difficult because the denominator cannot be computed. But what can we do with P(z | w)? Recall that our goal is to estimate theta (topic proportion) and phi (topics) Suppose we know the true topic assignments (z 1, …, z W ), theta can be estimated as theta_i = ( number of words assigned topic i ) / ( number of total words, W) and How about phi?

More About Dirichlet Distribution Dirichlet Distribution is defined on the T-dimensional standard simplex The density function for DD with parameters When alpha_i is close to zero, probability will concentrate near theta_i =0. On the other hand, when alpha_i is away from zero (large), probability will move away from theta_i=0. An example with T = 3,

More About Dirichlet Distribution The expected value and variance of each component theta_i are given by the formulas When alpha_i is close to zero, probability will concentrate near theta_i =0. On the other hand, when alpha_i is away from zero (large), probability will move away from theta_i=0. An example with T = 3,

Suppose we know the true topic assignments (z 1, …, z W ), theta can be estimated as theta_i = ( number of words assigned topic i ) / ( number of total words, W) and How about phi? Of course, we don’t know the ground truth, but only the distribution P ( Z | W). We need to know P ( theta | Z, W) By Baye’ rule, we have Therefore, P(theta | Z, W) is also another Dirichlet Distribution. For a given Z, what should the estimated theta be? This gives Equation 6 (and 7 similarly).

The point, of course, is that we don’t know the exact topic assignment (z 1, … z W ), but only its distribution P( z ). For a probability distribution P(x), the expectation of a function can be estimated as (where y i are the samples of P(x) ) For example, we can use this formula to estimate the mean, variance of a distribution from its samples. More samples give more accurate estimate on the right.

The goal is evaluate the posterior distribution Difficult because the denominator cannot be computed. Using Markov Chain Monte Carlo (MCMC) to simulate the probability). What this mean is that we want to draw samples with respect to the distribution for each sample we generated from P(z | w), we have one estimate of theta and phi according to

Simulating P (z|w) using MCMC (Markov Chain Monte Carlo) Much more on this later… The steps are 1.Initialize the topic assignments (z 1, … z W ) s=0, 2.Do (say three thousand iterations) for each i = 1, …., W change the current assignment for z i according to the probability One cycle through all i gives a new topic assignment (z 1, … z W ) s=s+1, 3. Generate Samples What does the formula say?

Example, Suppose there are 3 (T=3) topics and there are 3 words (A, B, C) in the dictionary and the word list has 10 words (W=10). Take alpha=beta=1. Word List is { A, A, C, B, C, A, C, A, B, B} With the initial topic assignment { 1, 1, 2, 2, 3, 1, 2, 3, 1, 1} How do we apply the formula? First word is A, in the word list, A has been assigned to topics 1, 1, 1, and 3. P ( z 1 | Z -1, W) = (0.5159, , ) { 3, 1, 2, 2, 3, 1, 2, 3, 1, 1} Next, compute P ( z 1 | Z -1, W) and sample a new z 2 value What are the effects of alpha and beta? (prior and large W)

Summary Goal is to infer theta (Dirichlet) from the data, with theta itself a distribution. Dirichlet distribution is a prior distribution on theta (A Bayesian approach). Therefore, it is a distribution on the space of distributions. No particular forms of theta are assumed (nonparametric). The base probability space is finite and discrete X = {1, …, T}. Things become much more complicated when X is no longer discrete, for example, X is the set of real numbers. Need more sophisticated mathematical language, and that will be the goal for the next two – three weeks.