David Mimno Andrew McCallum

David Mimno Andrew McCallum
Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression David Mimno Andrew McCallum

Topic Models Bayesian mixture model with multinomial components.
Latent Dirichlet Allocation (Blei, Ng, Jordan, 2003) Bayesian mixture model with multinomial components. Per-topic multinomial over words Document Dirichlet Per-document multinomial over topics Topic Word

Example Topics from CS Research Papers
models model hidden markov mixture vector support machines kernel svm fields extraction random conditional sequence models conditional discriminative maximum entropy speech recognition acoustic automatic features carlo monte sampling chain markov bias variance error cross estimator reinforcement control agent rl search language word words english statistical expression gene data genes binding software development computer design research Here are some sample topics trained on abstracts written by people in the NIPS community. We’ll see why that’s significant later. The results are really interesting: we’ve identified both general methodological words and some very specific “tool” words. Also note that the frequency of these topics is not uniform: the general words are much more prevalent than the specific words.

Uses of topic models Summarization & Pattern Discovery
Browsable interface to document collection Identifying trends over time Discovering low-dimensional representation for better generalization Expanding queries for information retrieval Semi-supervised learning Disambiguating words “LDA”: Latent Dirichlet Allocation? Linear Discriminant Analysis? There is a wide range of uses for topic models. Many previously published models have focused on learning models for text in addition to some non-textual side information. Let’s look at some examples.

Structured Topic Models
Words Additional Meta-data Author: Blei, Ng, Jordan Year: Venue: JMLR Cites: de Finetti 1975 Harman 1992 …

Words and Images CorrLDA (Blei and Jordan, 2003)
Model can predict image captions and support image retrieval by ad hoc queries

Words and References Mixed Membership Model
(Erosheva, Fienberg, Lafferty, 2004) Model identifies influential papers by Nobel-winning authors that are not the most cited.

Words and Dates Topics over Time (ToT) (Wang and McCallum, 2006)
LDA Model distinguishes between historical events that share similar vocabulary.

Words and Votes Group-Topic (Wang, Mohanty and McCallum, 2005)
Model discovers a group of “maverick Republicans,” including John McCain, specific to domestic issues.

Words and Named Entities
CorrLDA2 (Newman, Chemudugunta, Smyth, Steyvers, 2006) Model can predict names of entities given text of news articles.

Words and Metadata Supervised LDA (Blei and McAuliffe, 2007)
Data specified by choice of exponential family distribution. (e.g. Gaussian for ratings)

What do all these models have in common?
“Downstream” model Words and other data are generated conditioned on topic. other data = authors, dates, citations, entities,…

Problems with Downstream Models
Balancing influence of modalities requires careful tuning Strong independence assumptions among modalities may not be valid It can be difficult to force modalities to “share” topics

Another way to introduce “other data”
“Upstream” model Condition on other data to select topic. Then generate word conditioned on topic.

Words and Authors Author-Topic
(Rosen-Zvi, Griffiths, Steyvers and Smyth, 2004) Model discovers topics along with prominent authors.

Words and Email Headers
Author-Recipient-Topic (McCallum, Corrada-Emmanuel and Wang, 2005) Model analyzes relationships using text as well as links, unlike traditional SNA.

Words and Experts Author-Persona-Topic (Mimno and McCallum, 2007)
Model distinguishes between topical “personas” of authors, helping to find reviewers for conference papers. “Expert Finding”

Words and Citations Citation Influence
(Dietz, Bickel and Scheffer, 2007) Model can estimate the relative impact of different references on particular papers.

How do you create a new topic model?
Create a valid generative storyline Write down a graphical model Figure out inference procedure Write code (Fix errors in math and code)

What type of model accounts for arbitrary, correlated inputs?
Naïve Bayes :: MaxEnt HMM :: Linear-chain CRF Down-stream Topic models :: ?? Conditional models take real valued features, which can be used to encode discrete, categorical, and continuous inputs

What type of model accounts for arbitrary, correlated inputs?
Naïve Bayes :: MaxEnt HMM :: Linear-chain CRF Down-stream Topic models :: THIS PAPER Conditional models take real valued features, which can be used to encode discrete, categorical, and continuous inputs

THIS PAPER: Dirichlet-multinomial Regression (DMR)

This part same as LDA

Features for each document. Real-valued vector of length F.

Log position within an interval
Encoding Features Binary Indicators Does N. Wang appear as an author? Does the paper cite Li and Smith, 2006? Log position within an interval (i.e. beta sufficient statistics) Year within a range Survey response 2 strongly for 0 indifferent -2 strongly against …

Dirichlet parameters from which θ is sampled. Document specific. Depends on document features x

Vector of parameters for each topic. Matrix of size T x F.

Topic parameters for feature “published in JMLR”
2.27 kernel, kernels, rational kernels, string kernels, fisher kernel 1.74 bounds, vc dimension, bound, upper bound, lower bounds 1.41 reinforcement learning, learning, reinforcement 1.40 blind source separation, source separation, separation, channel 1.37 nearest neighbor, boosting, nearest neighbors, adaboost -1.12 agent, agents, multi agent, autonomous agents -1.21 strategies, strategy, adaptation, adaptive, driven -1.23 retrieval, information retrieval, query, query expansion -1.36 web, web pages, web page, world wide web, web sites -1.44 user, users, user interface, interactive, interface

Topic parameters for feature “published in UAI”
2.88 bayesian networks, bayesian network, belief networks 2.26 qualitative, reasoning, qualitative reasoning, qualitative simulation 2.25 probability, probabilities, probability distributions, uncertainty, symbolic, sketch, primal sketch, uncertain, connectionist 2.11 reasoning, logic, default reasoning, nonmonotonic reasoning -1.29 shape, deformable, shapes, contour, active contour -1.36 digital libraries, digital library, digital, library -1.37 workshop report, invited talk, international conference, report -1.50 descriptions, description, top, bottom, top bottom nearest neighbor, boosting, nearest neighbors, adaboost

Topic parameters for feature “Loopy Belief Propagation (MWJ, 1999)”
2.04 bayesian networks, bayesian network, belief networks 1.42 temporal, relations, relation, spatio temporal, temporal relations 1.26 local, global, local minima, simulated annealing 1.20 probabilistic, inference, probabilistic inference back propagation, propagation, belief propagation, message passing, loopy belief propagation -0.62 input, output, input output, outputs, inputs -0.65 responses, signal, response, signal processing -0.67 neurons, spiking neurons, neural, neuron -0.68 analysis, statistical, sensitivity analysis, statistical analysis -0.78 number, size, small, large, sample, small number

Topic parameters for feature “published in 2005”
2.04 semi supervised, learning, data, active learning 1.94 ontology, semantic, ontologies, ontological, daml oil 1.50 web, web pages, web page, world wide web 1.48 search, search engines, search engine, ranking 1.27 collaborative filtering, filtering, preferences -1.11 neural networks, neural network, network, networks -1.16 knowledge, knowledge representation, knowledge base -1.17 expert systems, expert, systems, expert system -1.25 system, architecture, describes, implementation -1.31 database, databases, relational, xml, queries

Topic parameters for feature “written by G.E. Hinton”
1.96 uncertainty, symbolic, sketch, primal sketch 1.70 expert systems, expert, systems, expert system 1.35 recognition, pattern recognition, character recognition 1.34 representation, representations, represent, representing, represented, compact representation 1.22 high dimensional, dimensional, dimensionality reduction -0.67 systems, system, dynamical systems, hybrid -0.77 theory, computational, computational complexity -0.80 general, case, cases, show, special case neurons, spiking neurons, neural, neuron -0.99 visual, visual servoing, vision, active

Dirichlet-multinomial Regression (DMR)
The α parameter for each topic in a document is the exponentiated inner product of the features (x) and the parameters (λ) for that topic.

Dirichlet-multinomial Regression (DMR)
Given the α parameters, the rest of the model is a standard Bayesian mixture model (LDA).

DMR understood as an extension of Dirichlet Hyperparameter Optimization
Commonly, Dirichlet prior on θ: Symmetric, flat. All topics equally prominent. Learning Dirichlet parameters from data: In DMR, features induce a different prior for each document, conditioned on features.

Training the model Given topic assignments (z), we can numerically optimize the parameters (λ). Given the parameters, we can sample topic assignments. … alternate between these two as needed. Note that this approach combines two off-the-shelf components: L-BFGS and a Gibbs sampler for LDA!

Log likelihood for topics and parameters
Dirichlet-multinomial distribution (aka Polya, DCM) with exp(…) instead of α

Log likelihood for topics and parameters
Gaussian regularization on parameters

Gradient equation for a single λ parameter
Gradient is quite simple. Can plug into standard optimizer we have for MaxEnt classifier

Training the model Model likelihood (top) and held-out likelihood jump at the first round of parameter optimization and improve with subsequent rounds. (maybe show a figure with model likelihood and EL to demonstrate that this is working) 250 burn-in iterations, then parameters are maximized after every 50 iterations. Empirical likelihood (bottom) initially decreases as topics become more specific (more uniform weight in topics generally leads to higher EL). As the prior starts getting better, the model starts predicting the right specific topics, so EL starts going up again.

Effect of different features on document-topic Dirichlet prior
Document features 2003 JMLR D.Blei A.Ng M.I.Jordan 2.10 Models model gaussian mixture generative .93 Bayesian inference networks network probabilistic .69 Classifier classifiers bayes classification naive .64 Probabilistic random conditional probabilities fields .61 Sampling sample monte carlo chain samples 4.05 Kernel density kernels data parametric 2.06 Space dimensionality high reduction spaces 1.78 Learning machine learn learned reinforcement 1.50 Prediction regression bayes predictions naive .88 Problem problems solving solution solutions The journal and one of the authors is constant for both papers. The model is more confident about the topic distribution of the second paper: Bach and Fukumizu combined have larger parameters in these topics than Ng and Jordan might also have more weight on “kernels” than 2003, but probably not much. 2004 JMLR M.I.Jordan F. Bach K. Fukumizu

Difficult with topic models
Evaluation Difficult with topic models Exponential number of possible topic assignments, over which we cannot marginalize. We use two methods on held-out documents

Evaluation: Half-Held-Out Perplexity
Half of test document Dirichlet prior Measures: the quality of the topic-word distributions the ability of the model to adapt quickly to local context. Estimated topic dist’n θ Rest of test doc Topic-word dist’ns Estimated P(doc|model)

Evaluation: Empirical Likelihood
[Diggle & Gratton, 1984] Dirichlet prior Measures: the quality of the topic-word distributions the prior distribution. Sampled topic dist’n θ Test doc Topic-word dist’ns Estimated P(doc|model)

Results: Emulating Author-Topic
DMR Author-Topic LDA By using author indicator features, we can emulate the Author-Topic model. (Rosen-Zvi, Griffiths, Steyvers and Smyth 2004)

Results: Emulating Author-Topic
DMR Author-Topic LDA DMR shows better perplexity, and similar empirical likelihood to AT. LDA is worse in both metrics.

Results: Emulating Citation model
DMR Cite-Topic LDA Citation “authors” have much better perplexity results than author features. EL similar, DMR slightly worse. LDA is worse in both metrics.

Results: Emulating Topics over Time
DMR ToT LDA DMR shows better perplexity and EL. LDA is worse in both metrics.

Arbitrary Queries Previous evaluations on ability to predict words given features. Can also predict features given words.

Results: Predict Author Given Document
DMR Author-Topic DMR predicts author better than Author-Topic

Efficiency Even with tricks to reduce complexity to O(A + T) from O(AT), DMR is faster As upstream models include more features, sampling grows more complicated because more variables must be sampled (e.g. author and topic for every token). DMR sampling is no more complicated than LDA. A-T 40 mins DMR 32 mins At 100 topics, 38% of training wall-clock time is sampling, the rest is parameter optimization A = average number of authors per doc, T = number of topics Optimization time could be improved by using stochastic gradient for L-BFGS.

Summary DMR benefits: Simple to implement Simple to use Fast
LDA + L-BFGS DMR benefits: Simple to implement Simple to use Fast Expressive Just add features! Toss in everything but the kitchen sink! Faster sampling than Author-Topic

David Mimno Andrew McCallum

Similar presentations

Presentation on theme: "David Mimno Andrew McCallum"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

David Mimno Andrew McCallum

Similar presentations

Presentation on theme: "David Mimno Andrew McCallum"— Presentation transcript:

Similar presentations

About project

Feedback