Statistical Methods for Mining Big Text Data ChengXiang Zhai Department of Computer Science Graduate School of Library & Information Science Institute.

Slides:



Advertisements
Similar presentations
1 A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs Qiaozhu Mei, Chao Liu, Hang Su, and ChengXiang Zhai : University of Illinois.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.
Information retrieval – LSI, pLSI and LDA
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
Language Models Hongning Wang
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov April TexPoint fonts used in EMF. Read the TexPoint manual before.
Mixture Language Models and EM Algorithm
Generative Topic Models for Community Analysis
Topic Modeling with Network Regularization Md Mustafizur Rahman.
Latent Aspect Rating Analysis without Aspect Keyword Supervision Hongning Wang, Yue Lu, ChengXiang Zhai Department of.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Latent Dirichlet Allocation a generative model for text
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
2008 © ChengXiang Zhai 1 Contextual Text Analysis with Probabilistic Topic Models ChengXiang Zhai Department of Computer Science Graduate School of Library.
British Museum Library, London Picture Courtesy: flickr.
Scalable Text Mining with Sparse Generative Models
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Topic Modeling with Network Regularization Qiaozhu Mei, Deng Cai, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.
Crash Course on Machine Learning
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Language Models Hongning Wang Two-stage smoothing [Zhai & Lafferty 02] c(w,d) |d| P(w|d) = +  p(w|C) ++ Stage-1 -Explain unseen words -Dirichlet.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Probabilistic Topic Models
Comparative Text Mining Q. Mei, C. Liu, H. Su, A. Velivelli, B. Yu, C. Zhai DAIS The Database and Information Systems Laboratory. at The University of.
A General Optimization Framework for Smoothing Language Models on Graph Structures Qiaozhu Mei, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Probabilistic Topic Models ChengXiang Zhai Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology.
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
Integrating Topics and Syntax -Thomas L
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.
Automatic Labeling of Multinomial Topic Models
Web-Mining Agents Topic Analysis: pLSI and LDA
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval Min Zhang, Xinyao Ye Tsinghua University SIGIR
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai DAIS The Database and Information Systems Laboratory.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
A Study of Poisson Query Generation Model for Information Retrieval
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Probabilistic Topic Models Hongning Wang Outline 1.General idea of topic models 2.Basic topic models -Probabilistic Latent Semantic Analysis (pLSA)
Hierarchical Clustering & Topic Models
Overview of Statistical Language Models
Semi-Supervised Clustering
Statistical Language Models
Probabilistic Topic Model
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Bayesian Inference for Mixture Language Models
John Lafferty, Chengxiang Zhai School of Computer Science
Bayesian Inference for Mixture Language Models
Michal Rosen-Zvi University of California, Irvine
Topic Models in Text Processing
Parametric Methods Berlin Chen, 2005 References:
Language Models for TR Rong Jin
GhostLink: Latent Network Inference for Influence-aware Recommendation
Presentation transcript:

Statistical Methods for Mining Big Text Data ChengXiang Zhai Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology Department of Statistics University of Illinois, Urbana-Champaign ADC PhD School in Big Data, The University of Queensland, Brisbane, Australia, July 14, 2014

2 Rapid Growth of Text Information WWW Blog/Tweets Literature Desktop Intranet … How to help people manage and exploit all the information?

Text Information Systems Applications Access Mining Organization Select information Create Knowledge Add Structure/Annotations 3 How to connect users with the right information at the right time? How to discover patterns in text and turn text data into actionable knowledge? Focus of this tutorial

4 Goal of the Tutorial Brief introduction to the emerging area of applying statistical topic models to text mining (TM) Targeted audience: –Practitioners working on developing intelligent text information systems who are interested in learning about cutting-edge text mining techniques –Researchers who are looking for new research problems in text data mining, information retrieval, and natural language processing Emphasis is on basic concepts, principles, and major application ideas Accessible to anyone with basic knowledge of probability and statistics Check out David Blei’s tutorials on this topic for a more complete coverage of advanced topic models:

5 Outline 1.Background -Text Mining (TM) -Statistical Language Models 2.Basic Topic Models -Probabilistic Latent Semantic Analysis (PLSA) -Latent Dirichlet Allocation (LDA) -Applications of Basic Topic Models to Text Mining 3.Advanced Topic Models -Capturing Topic Structures -Contextualized Topic Models -Supervised Topic Models 4.Summary We are here

What is Text Mining? Data Mining View: Explore patterns in textual data –Find latent topics –Find topical trends –Find outliers and other hidden patterns Natural Language Processing View: Make inferences based on partial understanding of natural language text –Information extraction –Question answering 6

Applications of Text Mining Direct applications –Discovery-driven (Bioinformatics, Business Intelligence, etc): We have specific questions; how can we exploit data mining to answer the questions? –Data-driven (WWW, literature, , customer reviews, etc): We have a lot of data; what can we do with it? Indirect applications –Assist information access (e.g., discover major latent topics to better summarize search results) –Assist information organization (e.g., discover hidden structures to link scattered information) 7

Text Mining Methods Data Mining Style: View text as high dimensional data –Frequent pattern finding –Association analysis –Outlier detection Information Retrieval Style: Fine granularity topical analysis –Topic extraction –Exploit term weighting and text similarity measures Natural Language Processing Style: Information Extraction –Entity extraction –Relation extraction –Sentiment analysis Machine Learning Style: Unsupervised or semi-supervised learning –Mixture models –Dimension reduction 8 This tutorial

9 Outline 1.Background -Text Mining (TM) -Statistical Language Models 2.Basic Topic Models -Probabilistic Latent Semantic Analysis (PLSA) -Latent Dirichlet Allocation (LDA) -Applications of Basic Topic Models to Text Mining 3.Advanced Topic Models -Capturing Topic Structures -Contextualized Topic Models -Supervised Topic Models 4.Summary We are here

What is a Statistical Language Model? A probability distribution over word sequences –p(“ Today is Wednesday ”)  –p(“ Today Wednesday is ”)  –p(“ The eigenvalue is positive” )  Context-dependent! Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model Today is Wednesday Today Wednesday is The eigenvalue is positive … 10

Why is a LM Useful? Provides a principled way to quantify the uncertainties associated with natural language Allows us to answer questions like: –Given that we see “ John ” and “ feels ”, how likely will we see “ happy ” as opposed to “ habit ” as the next word? (speech recognition) –Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval) –Given that a user is interested in sports news, how likely would the user use “baseball” in a query? (information retrieval) 11

12 Source-Channel Framework for “Traditional” Applications of SLMs Source Transmitter (encoder) Destination Receiver (decoder) Noisy Channel P(X) P(Y|X) X YX’ P(X|Y)=? When X is text, p(X) is a language model (Bayes Rule) Many Examples: Speech recognition: X=Word sequence Y=Speech signal Machine translation: X=English sentence Y=Chinese sentence OCR Error Correction: X=Correct word Y= Erroneous word Information Retrieval: X=Document Y=Query Summarization: X=Summary Y=Document This tutorial is about another type of applications of SLMs (i.e., topic mining)

The Simplest Language Model (Unigram Model) Generate a piece of text by generating each word INDEPENDENTLY Thus, p(w 1 w 2... w n )=p(w 1 )p(w 2 )…p(w n ) Parameters: {p(w i )} p(w 1 )+…+p(w N )=1 (N is voc. size) A piece of text can be regarded as a sample drawn according to this word distribution today eigenvalue Wednesday … P(“today is Wed”) = P(“today”)p(“is”)p(“Wed”) =  

14 Text Generation with Unigram LM (Unigram) Language Model  p(w|  ) … text 0.2 mining 0.1 assocation 0.01 clustering 0.02 … food … Topic 1: Text mining … food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … Topic 2: Health Document d Text mining paper Food nutrition paper Sampling Given , p(d|  ) varies according to d Given d, p(d|  ) varies according to 

15 Estimation of Unigram LM (Unigram) Language Model  p(w|  )=? Document text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 … text ? mining ? assocation ? database ? … query ? … Estimation Total #words =100 10/100 5/100 3/100 1/100 Maximum Likelihood (ML) Estimator: (maximizing the probability of observing document D)

Maximum Likelihood vs. Bayesian Maximum likelihood estimation –“Best” means “data likelihood reaches maximum” –Problem: small sample Bayesian estimation –“Best” means being consistent with our “prior” knowledge and explaining data well –Problem: how to define prior? In general, we consider distribution of , so a point estimate can be obtained in potentially multiple ways (e.g. mean vs. mode) 16

Illustration of Bayesian Estimation Prior: p(  ) Likelihood: p(X|  ) X=(x 1,…,x N ) Posterior: p(  |X)  p(X|  )p(  )   0 : prior mode  ml : ML estimate  : posterior mode 17

Computation of Maximum Likelihood Estimate Data: a document d with counts c(w 1 ), …, c(w N ), and length |d| Model: unigram LM with parameters  ={  i };  i = p(w i |  ) Set partial derivatives to zero Use Lagrange multiplier approach Use ML estimate = Normalized counts 18

Computation of Bayesian Estimate ML estimator: Bayesian estimator: – First consider posterior: – Then, consider the mean or mode of the posterior dist. p(d|  ) : Sampling distribution (of data) P(  )=p(  1,…,  N ) : our prior on the model parameters conjugate = prior can be interpreted as “extra”/“pseudo” data Dirichlet distribution is a conjugate prior for multinomial sampling distribution “extra”/“pseudo” word counts 19

Posterior distribution of parameters: Thus the posterior mean estimate is: Computation of Bayesian Estimate (cont.) Compare this with ML estimate: Each word gets unequal extra “pseudo counts” based on prior Total “pseudo counts” for all words 20

Unigram LMs for Topic Analysis the a … text 0.04 mining association 0.03 clustering computer … food … General Background English Text Text mining paper the 0.03 a 0.02 is we food computer … text … B Background LM: p(w|  B ) Computer Science Papers the a is we computer software … text … Collection LM: p(w|  C ) C Document LM: p(w|  d ) d 21

Unigram LMs for Association Analysis 22 What words are semantically related to “computer”? the a is we computer software … text Topic LM: p(w|“computer”) all the documents containing word “computer” the 0.03 a 0.02 is we computer … Background LM: p(w|  B ) General Background English Text B Normalized Topic LM: p(w|“computer”)/p(w|  B ) computer 400 software 150 program 104 … text 3.0 … the 1.1 a 0.99 is 0.9 we 0.8

23 More Sophisticated LMs Mixture of unigram language models –Assume multiple unigram LMs are involved in generating text data –Estimation of multiple unigram LMs “discovers” (recovers) latent topics in text Other sophisticated LMs (see [Jelinek 98, Manning & Schutze 99, Rosenfeld 00]) –N-gram language models: p(w 1 w 2... w n )=p(w 1 )p(w 2 |w 1 )…p(w n |w 1 …w n-1 ) –Remote-dependence language models (e.g., Maximum Entropy model) –Structured language models (e.g., probabilistic context-free grammar) Focus of this tutorial

24 Evaluation of SLMs Direct evaluation criterion: How well does the model fit the data to be modeled? –Example measures: Data likelihood, perplexity, cross entropy, Kullback-Leibler divergence (mostly equivalent) Indirect evaluation criterion: Does the model help improve the performance of the task? –Specific measure is task dependent –For retrieval, we look at whether a model is effective for a text mining task –We hope an “improvement” of a LM would lead to better task performance

25 Outline 1.Background -Text Mining (TM) -Statistical Language Models 2.Basic Topic Models -Probabilistic Latent Semantic Analysis (PLSA) -Latent Dirichlet Allocation (LDA) -Applications of Basic Topic Models to Text Mining 3.Advanced Topic Models -Capturing Topic Structures -Contextualized Topic Models -Supervised Topic Models 4.Summary We are here

Document as a Sample of Mixed Topics How can we discover these topic word distributions? Many applications would be enabled by discovering such topics –Summarize themes/aspects –Facilitate navigation/browsing –Retrieve documents –Segment documents –Many other text mining tasks Topic  1 Topic  k Topic  2 … Background  k government 0.3 response donate 0.1 relief 0.05 help city 0.2 new 0.1 orleans is 0.05 the 0.04 a [ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. … 26

27 Simplest Case: 1 topic + 1 “background” the a … text 0.04 mining association 0.03 clustering computer … food … General Background English Text Text mining paper the 0.03 a 0.02 is we food computer … text … B Background LM: p(w|  B ) Document LM: p(w|  d ) d How can we “get rid of” the common words from the topic to make it more discriminative? dd Assume words in d are from two distributions: 1 topic + 1 background (rather than just one) BB

The Simplest Case: One Topic + One Background Model w w Document d Maximum Likelihood P(w|  ) P(w|  B ) 1- P(Topic) Background words Topic words Assume p(w|  B ) and are known = assumed percentage of background words in d Topic choice 28

Understanding a Mixture Model the 0.2 a 0.1 we 0.01 to 0.02 … text mining … Known Background p(w|  B ) … text =? mining =? association =? word =? … Unknown query topic p(w|  )=? “Text mining” Suppose each model would be selected with equal probability =0.5 The probability of observing word “text”: p(“text”|  B ) + (1- )p(“text”|  ) =0.5* * p(“text”|  ) The probability of observing word “the”: p(“the”|  B ) + (1- )p(“the”|  ) =0.5* * p(“the”|  ) The probability of observing “the” & “text” (likelihood) [0.5* * p(“text”|  )]  [0.5* * p(“the”|  )] How to set p(“the”|  ) and p(“text”|  ) so as to maximize this likelihood? assume p(“the”|  )+p(“text”|  )=constant  give p(“text”|  ) a higher probability than p(“the”|  ) (why?)  B and  are competing for explaining words in document d! 29

Simplest Case Continued: How to Estimate  ? the 0.2 a 0.1 we 0.01 to 0.02 … text mining … Known Background p(w|  B ) … text =? mining =? association =? word =? … Unknown query topic p(w|  )=? “Text mining” =0.7 =0.3 Observed words Suppose we know the identity/label of each word... ML Estimator 30

Can We Guess the Identity? Identity (“hidden”) variable: z i  {1 (background), 0(topic)} the paper presents a text mining algorithm the paper... z i Suppose the parameters are all known, what’s a reasonable guess of z i ? - depends on (why?) - depends on p(w|  B ) and p(w|  ) (how?) Initially, set p(w|  ) to some random values, then iterate … E-step M-step 31

An Example of EM Computation Assume =0.5 Expectation-Step: Augmenting data by guessing hidden variables Maximization-Step With the “augmented data”, estimate parameters using maximum likelihood 32

Discover Multiple Topics in a Collection Topic  1 Topic  k Topic  2 … Topic coverage in document d Background  B warning 0.3 system aid 0.1 donation 0.05 support statistics 0.2 loss 0.1 dead is 0.05 the 0.04 a kk 11 22 B W  d,1  d, k 1 - B  d,2 “Generating” word w in doc d in the collection ? ? ? ? ? ? ? ? ? ? ? BB Parameters:  =( B, {  d,j }, {  j }) Can be estimated using ML Estimator Percentage of background words Coverage of topic  j in doc d Prob. of word w in topic  j 33

Probabilistic Latent Semantic Analysis/Indexing (PLSA/PLSI) [Hofmann 99a, 99b] Mix k multinomial distributions to generate a document Each document has a potentially different set of mixing weights which captures the topic coverage When generating words in a document, each word may be generated using a DIFFERENT multinomial distribution (this is in contrast with the document clustering model where, once a multinomial distribution is chosen, all the words in a document would be generated using the same multinomial distribution) By fitting the model to text data, we can estimate (1) the topic coverage in each document, and (2) word distribution for each topic, thus achieving “topic mining” 34

How to Estimate Multiple Topics? (Expectation Maximization) the 0.2 a 0.1 we 0.01 to 0.02 … Known Background p(w |  B ) … text =? mining =? association =? word =? … Unknown topic model p(w|  1 )=? “Text mining” Observed Words M-Step: Max. Likelihood Estimator based on “fractional counts” … … information =? retrieval =? query =? document =? … Unknown topic model p(w|  2 )=? “information retrieval” E-Step: Predict topic labels using Bayes Rule 35

Parameter Estimation E-Step: Word w in doc d is generated - from cluster j - from background Application of Bayes rule M-Step: Re-estimate - mixing weights - topic LM Fractional counts contributing to - using cluster j in generating d - generating w from cluster j Sum over all docs in the collection 36

How the Algorithm Works aid price oil π d1,1 ( P(θ 1 |d 1 ) ) π d1,2 ( P(θ 2 |d 1 ) ) π d2,1 ( P(θ 1 |d 2 ) ) π d2,2 ( P(θ 2 |d 2 ) ) aid price oil Topic 1Topic 2 aid price oil P(w| θ) Initial value Initializing π d, j and P(w| θ j ) with random values Iteration 1: E Step: split word counts with different topics (by computing z’ s) Iteration 1: M Step: re-estimate π d, j and P(w| θ j ) by adding and normalizing the splitted word counts Iteration 2: E Step: split word counts with different topics (by computing z’ s) Iteration 2: M Step: re-estimate π d, j and P(w| θ j ) by adding and normalizing the splitted word counts Iteration 3, 4, 5, … Until converging d1d1 d2d2 c(w, d) c(w,d)p(z d,w = B) c(w,d)(1 - p(z d,w = B))p(z d,w =j) Topic coverage 37

PLSA with Prior Knowledge Users have some domain knowledge in mind, e.g., –We expect to see “retrieval models” as a topic in IR literature –We want to see aspects such as “battery” and “memory” for opinions about a laptop –One topic should be fixed to model background words (infinitely strong prior!) We can easily incorporate such knowledge as priors of PLSA model 38

Adding Prior : Maximum a Posteriori (MAP) Estimation Topic  1 Topic  k Topic  2 … Background  B warning 0.3 system aid 0.1 donation 0.05 support statistics 0.2 loss 0.1 dead is 0.05 the 0.04 a kk 11 22 BB B W  d,1  d, k 1 - B  d,2 “Generating” word w in doc d in the collection Parameters: B =noise-level (manually set)  ’s and  ’s are estimated with Maximum A Posteriori (MAP) Most likely  Topic coverage in document d Prior can be placed on  as well (more about this later) 39

Adding Prior as Pseudo Counts the 0.2 a 0.1 we 0.01 to 0.02 … Known Background p(w | B) … text =? mining =? association =? word =? … Unknown topic model p(w|  1 )=? “Text mining” … information =? retrieval =? query =? document =? … … Unknown topic model p(w|  2 )=? “information retrieval” Suppose, we know the identity of each word... Observed Doc(s) MAP Estimator Pseudo Doc Size = μ text mining 40

Maximum A Posterior (MAP) Estimation +  p(w|  ’ j ) ++ Pseudo counts of w from prior  ’ Sum of all pseudo counts What if  =0? What if  =+  ? A consequence of using conjugate prior is that the prior can be converted into “pseudo data” which can then be “merged” with the actual data for parameter estimation 41

A General Introduction to EM Data: X (observed) + H(hidden) Parameter:  “Incomplete” likelihood: L(  )= log p(X|  ) “Complete” likelihood: Lc(  )= log p(X,H|  ) EM tries to iteratively maximize the incomplete likelihood: Starting with an initial guess  (0), 1. E-step: compute the expectation of the complete likelihood 2. M-step: compute  (n) by maximizing the Q-function 42

Convergence Guarantee Goal: maximizing “Incomplete” likelihood: L(  )= log p(X|  ) I.e., choosing  (n), so that L(  (n) )-L(  (n-1) )  0 Note that, since p(X,H|  ) =p(H|X,  ) P(X|  ), L(  ) =Lc(  ) -log p(H|X,  ) L(  (n) )-L(  (n-1) ) = Lc(  (n) )-Lc(  (n-1) )+log [p(H|X,  (n-1) )/p(H|X,  (n) )] Taking expectation w.r.t. p(H|X,  (n-1) ), L(  (n) )-L(  (n-1) ) = Q(  (n) ;  (n-1) )-Q(  (n-1) ;  (n-1) ) + D(p(H|X,  (n-1) )||p(H|X,  (n) )) KL-divergence, always non-negative EM chooses  (n) to maximize Q Therefore, L(  (n) )  L(  (n-1) )! Doesn’t contain H 43

EM as Hill-Climbing: converging to a local maximum Likelihood p(X|  )  current guess Lower bound (Q function) next guess E-step = computing the lower bound M-step = maximizing the lower bound L(  )= L(  (n-1) ) + Q(  ;  (n-1) ) -Q(  (n-1) ;  (n-1) ) + D(p(H|X,  (n-1) )||p(H|X,  )) L(  (n-1) ) + Q(  ;  (n-1) ) -Q(  (n-1) ;  (n-1) ) 44

45 Outline 1.Background -Text Mining (TM) -Statistical Language Models 2.Basic Topic Models -Probabilistic Latent Semantic Analysis (PLSA) -Latent Dirichlet Allocation (LDA) -Applications of Basic Topic Models to Text Mining 3.Advanced Topic Models -Capturing Topic Structures -Contextualized Topic Models -Supervised Topic Models 4.Summary We are here

Deficiency of PLSA Not a generative model –Can’t compute probability of a new document –Heuristic workaround is possible, though Many parameters  high complexity of models –Many local maxima –Prone to overfitting Not necessary a problem for text mining (only interested in fitting the “training” documents) 46

Latent Dirichlet Allocation (LDA) [Blei et al. 02] Make PLSA a generative model by imposing a Dirichlet prior on the model parameters  –LDA = Bayesian version of PLSA –Parameters are regularized Can achieve the same goal as PLSA for text mining purposes –Topic coverage and topic word distributions can be inferred using Bayesian inference 47

48 LDA = Imposing Prior on PLSA Topic coverage in document d kk 11 22 W  d,1  d, k  d,2 “Generating” word w in doc d in the collection PLSA: Topic coverage  d,j is specific to the “training documents”, thus can’t be used to generate a new document In addition, the topic word distributions {  j } are also drawn from another Dirichlet prior LDA: Topic coverage distribution {  d,j } for any document is sampled from a Dirichlet distribution, allowing for generating a new doc {  d,j } are regularized {  d,j } are free for tuning Magnitudes of  and  determine the variances of the prior, thus also the strength of prior (larger  and   stronger prior)

Equations for PLSA vs. LDA 49 PLSA LDA Core assumption in all topic models PLSA component Added by LDA

Parameter Estimation & Inferences in LDA 50 Parameter estimation can be done in the same say as in PLSA: Maximum Likelihood Estimator: However, must now be computed using posterior inference: Computationally intractable, must resort to approximate inference!

LDA as a graph model [Blei et al. 03a] NdNd D zizi wiwi  (d)  (j)    (d)  Dirichlet(  ) z i  Discrete(  (d) )  (j)  Dirichlet(  ) w i  Discrete(  (zi) ) T distribution over topics for each document (same as  d on the previous slides) topic assignment for each word distribution over words for each topic (same as  j on the previous slides) word generated from assigned topic Dirichlet priors Most approximate inference algorithms aim to infer from which other interesting variables can be easily computed 51

Approximate Inferences for LDA Many different ways; each has its pros & cons Deterministic approximation –variational EM [Blei et al. 03a] –expectation propagation [Minka & Lafferty 02] Markov chain Monte Carlo –full Gibbs sampler [Pritchard et al. 00] –collapsed Gibbs sampler [Griffiths & Steyvers 04] 52 Most efficient, and quite popular, but can only work with conjugate prior

The collapsed Gibbs sampler [Griffiths & Steyvers 04] Using conjugacy of Dirichlet and multinomial distributions, integrate out continuous parameters Defines a distribution on discrete ensembles z 53

The collapsed Gibbs sampler [Griffiths & Steyvers 04] Sample each z i conditioned on z -i This is nicer than your average Gibbs sampler: –memory: counts can be cached in two sparse matrices –optimization: no special functions, simple arithmetic –the distributions on  and  are analytic given z and w, and can later be found for each sample 54

Gibbs sampling in LDA iteration 1 55

Gibbs sampling in LDA iteration

Gibbs sampling in LDA iteration 1 2 Count of instances where w i is assigned with topic j Count of all words assigned with topic j words in d i assigned with topic j words in d i assigned with any topic 57

iteration 1 2 Gibbs sampling in LDA How likely would d i choose topic j? What’s the most likely topic for w i in d i ? How likely would topic j generate word w i ? 58

Gibbs sampling in LDA iteration

Gibbs sampling in LDA iteration

Gibbs sampling in LDA iteration

Gibbs sampling in LDA iteration

Gibbs sampling in LDA iteration 1 2 …

64 Outline 1.Background -Text Mining (TM) -Statistical Language Models 2.Basic Topic Models -Probabilistic Latent Semantic Analysis (PLSA) -Latent Dirichlet Allocation (LDA) -Applications of Basic Topic Models to Text Mining 3.Advanced Topic Models -Capturing Topic Structures -Contextualized Topic Models -Supervised Topic Models 4.Summary We are here

Applications of Topic Models for Text Mining: Illustration with 2 Topics Likelihood: Application Scenarios: -p(w|  1 ) & p(w|  2 ) are known; estimate -p(w|  1 ) & are known; estimate p(w|  2 ) -p(w|  1 ) is known; estimate & p(w|  2 ) - is known; estimate p(w|  1 )& p(w|  2 ) -Estimate, p(w|  1 ), p(w|  2 ) The doc is about text mining and food nutrition, how much percent is about text mining? 30% of the doc is about text mining, what’s the rest about? The doc is about text mining, is it also about some other topic, and if so to what extent? 30% of the doc is about one topic and 70% is about another, what are these two topics? The doc is about two subtopics, find out what these two subtopics are and to what extent the doc covers each. 65

Use PLSA/LDA for Text Mining Both PLSA and LDA would be able to generate –Topic coverage in each document: p(  d = j) –Word distribution for each topic: p(w|  j ) –Topic assignment at the word level for each document –The number of topics must be given in advance These probabilities can be used in many different ways –  j naturally serves as a word cluster –  d,j can be used for document clustering –Contextual text mining: Make these parameters conditioned on context, e.g., p(  j |time), from which we can compute/plot p(time|  j ) p(  j |location), from which we can compute/plot p(loc|  j ) 66

Sample Topics from TDT Corpus [Hofmann 99b] 67

68 How to Help Users Interpret a Topic Model? [Mei et al. 07b] Use top words –automatic, but hard to make sense Human generated labels –Make sense, but cannot scale up term 0.16 relevance 0.08 weight 0.07 feedback 0.04 independence 0.03 model 0.03 frequent 0.02 probabilistic 0.02 document 0.02 … Retrieval Models Question: Can we automatically generate understandable labels for topics? Term, relevance, weight, feedback insulin foraging foragers collected grains loads collection nectar … ?

69 What is a Good Label? Semantically close (relevance) Understandable – phrases? High coverage inside topic Discriminative across topics … term relevance weight feedback independence model frequent probabilistic document … iPod Nano Pseudo-feedback Information Retrieval Retrieval models じょうほうけんさく A topic from [Mei & Zhai 06b]

70 Automatic Labeling of Topics [Mei et al. 07b] Statistical topic models NLP Chunker Ngram stat. term relevance weight feedback independence model frequent probabilistic document … term relevance weight feedback independence model frequent probabilistic document … term relevance weight feedback independence model frequent probabilistic document … Multinomial topic models database system, clustering algorithm, r tree, functional dependency, iceberg cube, concurrency control, index structure … Candidate label pool Collection (Context) Ranked List of Labels clustering algorithm; distance measure; … Relevance Score Re-ranking Coverage; Discrimination 1 2

71 Relevance: the Zero-Order Score Intuition: prefer phrases well covering top words Clustering dimensional algorithm birch shape Latent Topic  … Good Label ( l 1 ): “clustering algorithm” body Bad Label ( l 2 ): “body shape” … p(w|  ) p(“clustering”|  ) = 0.4 p(“dimensional”|  ) = 0.3 p(“body”|  ) = p(“shape”|  ) = 0.01 √ >

72 Clustering hash dimension key algorithm … Bad Label ( l 2 ): “hash join” p(w | hash join ) Relevance: the First-Order Score Intuition: prefer phrases with similar context (distribution) Clustering dimension partition algorithm hash Topic  … P(w|  ) D(  | clustering algorithm ) < D(  | hash join ) SIGMOD Proceedings Clustering hash dimension algorithm partition … p(w | clustering algorithm ) Good Label ( l 1 ): “clustering algorithm” Score (l,  )

73 Results: Sample Topic Labels sampling 0.06 estimation 0.04 approximate 0.04 histograms 0.03 selectivity 0.03 histogram 0.02 answers 0.02 accurate 0.02 tree 0.09 trees 0.08 spatial 0.08 b 0.05 r 0.04 disk 0.02 array 0.01 cache 0.01 north 0.02 case 0.01 trial 0.01 iran 0.01 documents 0.01 walsh reagan charges the, of, a, and, to, data, > 0.02 … clustering 0.02 time 0.01 clusters 0.01 databases 0.01 large 0.01 performance 0.01 quality clustering algorithm clustering structure … large data, data quality, high data, data application, … selectivity estimation … iran contra … r tree b tree … indexing methods

74 Results: Contextual-Sensitive Labeling sampling estimation approximation histogram selectivity histograms … selectivity estimation; random sampling; approximate answers; multivalue dependency functional dependency Iceberg cube distributed retrieval; parameter estimation; mixture models; term dependency; independence assumption; Context: Database (SIGMOD Proceedings) Context: IR (SIGIR Proceedings) dependencies functional cube multivalued iceberg buc …

Using PLSA to Discover Temporal Topic Trends [Mei & Zhai 05] gene expressions probability microarray … marketing customer model business … rules association support … 75

Construct Theme Evolution Graph [Mei & Zhai 05] T SVM criteria classifica – tion linear … decision tree classifier class Bayes … Classifica - tion text unlabeled document labeled learning … Informa - tion web social retrieval distance networks … ………… 1999 … web classifica – tion features0.006 topic … mixture random cluster clustering variables … topic mixture LDA semantic … …

77 Use PLSA to Integrate Opinions [Lu & Zhai 08] cute… tiny…..thicker.. last many hrs die out soon could afford it still expensive Design Battery Price.. Topic: iPod Expert review with aspects Text collection of ordinary opinions, e.g. Weblogs Integrated Summary Design Battery Price Design Battery Price iTunes … easy to use… warranty …better to extend.. Review Aspects Extra Aspects Similar opinions Supplementary opinions Input Output

Methods Semi-Supervised Probabilistic Latent Semantic Analysis (PLSA) –The aspects extracted from expert reviews serve as clues to define a conjugate prior on topics –Maximum a Posteriori (MAP) estimation –Repeated applications of PLSA to integrate and align opinions in blog articles to expert review 78

Results: Product (iPhone) Opinion Integration with review aspects Review articleSimilar opinionsSupplementary opinions You can make emergency calls, but you can't use any other functions… N/A… methods for unlocking the iPhone have emerged on the Internet in the past few weeks, although they involve tinkering with the iPhone hardware… rated battery life of 8 hours talk time, 24 hours of music playback, 7 hours of video playback, and 6 hours on Internet use. iPhone will Feature Up to 8 Hours of Talk Time, 6 Hours of Internet Use, 7 Hours of Video Playback or 24 Hours of Audio Playback Playing relatively high bitrate VGA H.264 videos, our iPhone lasted almost exactly 9 freaking hours of continuous playback with cell and WiFi on (but Bluetooth off). Unlock/hack iPhone Activation Battery Confirm the opinions from the review Additional info under real usage 79

Results: Product (iPhone) Opinions on extra aspects supportSupplementary opinions on extra aspects 15You may have heard of iASign … an iPhone Dev Wiki tool that allows you to activate your phone without going through the iTunes rigamarole. 13Cisco has owned the trademark on the name "iPhone" since 2000, when it acquired InfoGear Technology Corp., which originally registered the name. 13With the imminent availability of Apple's uber cool iPhone, a look at 10 things current smartphones like the Nokia N95 have been able to do for a while and that the iPhone can't currently match... Another way to activate iPhone iPhone trademark originally owned by Cisco A better choice for smart phones? 80

Results: Product (iPhone) Support statistics for review aspects People care about price People comment a lot about the unique wi-fi feature Controversy: activation requires contract with AT&T 81

Comparison of Task Performance of PLSA and LDA [Lu et al. 11] Three text mining tasks considered –Topic model for text clustering –Topic model for text categorization (topic model is used to obtain low-dimensional representation) –Topic model for smoothing language model for retrieval Conclusions –PLSA and LDA generally have similar task performance for clustering and retrieval –LDA works better than PLSA when used to generate low- dimensional representation (PLSA suffers from overfitting) –Task performance of LDA is very sensitive to setting of hyperparameters –Multiple local maxima problem of PLSA didn’t seem to affect task performance much 82

Outline 1.Background -Text Mining (TM) -Statistical Language Models 2.Basic Topic Models -Probabilistic Latent Semantic Analysis (PLSA) -Latent Dirichlet Allocation (LDA) -Applications of Basic Topic Models to Text Mining 3.Advanced Topic Models -Capturing Topic Structures -Contextualized Topic Models -Supervised Topic Models 4.Summary We are here 83

Overview of Advanced Topic Models There are MANY variants and extensions of the basic PLSA/LDA topic models! Selected major lines to cover in this tutorial –Capturing Topic Structures –Contextualized Topic Models –Supervised Topic Models 84

Capturing Topic Structure: Learning topic hierarchies Fixed hierarchies: [Hofmann 99c] Learning hierarchies: [Blei et al 03b] Topic 0 Topic 1.1 Topic 1.2 Topic 2.1 Topic 2.2 Topic

Learning topic hierarchies Topic 0 Topic 1.1 Topic 1.2 Topic 2.1 Topic 2.2 Topic 2.3 The topics in each document form a path from root to leaf Fixed hierarchies: [Hofmann 99c] Learning hierarchies:[Blei et al. 03b] 86

Twelve Years of NIPS [Blei et al. 03b] 87

Capturing Topic Structures: Correlated Topic Model (CTM) [Blei & Lafferty 05] 88

Sample Result of CTM 89

90 Outline 1.Background -Text Mining (TM) -Statistical Language Models 2.Basic Topic Models -Probabilistic Latent Semantic Analysis (PLSA) -Latent Dirichlet Allocation (LDA) -Applications of Basic Topic Models to Text Mining 3.Advanced Topic Models -Capturing Topic Structures -Contextualized Topic Models -Supervised Topic Models 4.Summary We are here

Contextual Topic Mining Documents are often associated with context (meta- data) –Direct context: time, location, source, authors,… –Indirect context: events, policies, … Many applications require “contextual text analysis”: –Discovering topics from text in a context-sensitive way –Analyzing variations of topics over different contexts –Revealing interesting patterns (e.g., topic evolution, topic variations, topic communities) 91

Example: Comparing News Articles Common Themes“Vietnam” specific“Afghan” specific“Iraq” specific United nations ……… Death of people ……… … ……… Vietnam WarAfghan War Iraq War CNNFox Blog Before 9/11During Iraq war Current US blogEuropean blog Others What’s in common? What’s unique? 92

More Contextual Analysis Questions What positive/negative aspects did people say about X (e.g., a person, an event)? Trends? How does an opinion/topic evolve over time? What are emerging research topics in computer science? What topics are fading away? How can we mine topics from literature to characterize the expertise of a researcher? How can we characterize the content exchanges on a social network? … 93

Document context: Time = July 2005 Location = Texas Author = xxx Occup. = Sociologist Age Group = 45+ … Contextual Probabilistic Latent Semantics Analysis [Mei & Zhai 06b] View1View2View3 Themes government donation New Orleans government 0.3 response donate 0.1 relief 0.05 help city 0.2 new 0.1 orleans TexasJuly 2005 sociolo gist Theme coverages: Texas July 2005 document …… Choose a view Choose a Coverage government donate new Draw a word from  i response aid help Orleans Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut-in gas production … Over seventy countries pledged monetary donations or other assistance. … Choose a theme 94

Comparing News Articles [Zhai et al. 04] Iraq War (30 articles) vs. Afghan War (26 articles) Cluster 1Cluster 2Cluster 3 Common Theme united nations 0.04 … killed month deaths … … Iraq Theme n 0.03 Weapons Inspections … troops hoon sanches … … Afghan Theme Northern 0.04 alliance 0.04 kabul 0.03 taleban aid 0.02 … taleban rumsfeld 0.02 hotel front … … The common theme indicates that “United Nations” is involved in both wars Collection-specific themes indicate different roles of “United Nations” in the two wars 95

96 Spatiotemporal Patterns in Blog Articles [Mei et al. 06a] Query= “Hurricane Katrina” Topics in the results: Spatiotemporal patterns

Theme Life Cycles (“Hurricane Katrina”) city orleans new louisiana flood evacuate storm … price oil gas increase product fuel company … Oil Price New Orleans 97

Theme Snapshots (“Hurricane Katrina”) Week4: The theme is again strong along the east coast and the Gulf of Mexico Week3: The theme distributes more uniformly over the states Week2: The discussion moves towards the north and west Week5: The theme fades out in most states Week1: The theme is the strongest along the Gulf of Mexico 98

99 Multi-Faceted Sentiment Summary [Mei et al. 07a] (query=“Da Vinci Code”) NeutralPositiveNegative Facet 1: Movie... Ron Howards selection of Tom Hanks to play Robert Langdon. Tom Hanks stars in the movie,who can be mad at that? But the movie might get delayed, and even killed off if he loses. Directed by: Ron Howard Writing credits: Akiva Goldsman... Tom Hanks, who is my favorite movie star act the leading role. protesting... will lose your faith by... watching the movie. After watching the movie I went online and some research on... Anybody is interested in it?... so sick of people making such a big deal about a FICTION book and movie. Facet 2: Book I remembered when i first read the book, I finished the book in two days. Awesome book.... so sick of people making such a big deal about a FICTION book and movie. I’m reading “Da Vinci Code” now. … So still a good book to past time. This controversy book cause lots conflict in west society.

Separate Theme Sentiment Dynamics “book” “religious beliefs” 100

Event Impact Analysis: IR Research [Mei & Zhai 06b] vector concept extend model space boolean function feedback … xml model collect judgment rank subtopic … probabilist model logic ir boolean algebra estimate weight … model language estimate parameter distribution probable smooth markov likelihood … 1998 Publication of the paper “A language modeling approach to information retrieval” Starting of the TREC conferences year 1992 term relevance weight feedback independence model frequent probabilistic document … Theme: retrieval models SIGIR papers 101

The Author-Topic model [Rosen-Zvi et al. 04] NdNd D zizi wiwi  (a)  (j)    (a)  Dirichlet(  ) z i  Discrete(  (xi) )  (j)  Dirichlet(  ) w i  Discrete(  (zi) ) T xixi A x i  Uniform(A (d) ) each author has a distribution over topics the author of each word is chosen uniformly at random 102

Four example topics from NIPS 103

Dirichlet-multinomial Regression (DMR) [Mimno & McCallum 08] 104 Allows arbitrary features to be used to influence choice of topics

Outline 1.Background -Text Mining (TM) -Statistical Language Models 2.Basic Topic Models -Probabilistic Latent Semantic Analysis (PLSA) -Latent Dirichlet Allocation (LDA) -Applications of Basic Topic Models to Text Mining 3.Advanced Topic Models -Capturing Topic Structures -Contextualized Topic Models -Supervised Topic Models 4.Summary We are here 105

Supervised LDA [Blei & McAuliffe 07] 106

Sample Results of Supervised LDA 107

Latent Aspect Rating Analysis [Wang et al. 11] Given a set of review articles about a topic with overall ratings (ratings as “supervision signals”) Output –Major aspects commented on in the reviews –Ratings on each aspect –Relative weights placed on different aspects by reviewers Many applications –Opinion-based entity ranking –Aspect-level opinion summarization –Reviewer preference analysis –Personalized recommendation of products –… 108

How to infer aspect ratings? Value Location Service ….. How to infer aspect weights? Value Location Service … An Example of LARA

Excellent location in walking distance to Tiananmen Square and shopping streets. That’s the best part of this hotel! The rooms are getting really old. Bathroom was nasty. The fixtures were falling off, lots of cracks and everything looked dirty. I don’t think it worth the price. Service was the most disappointing part, especially the door men. this is not how you treat guests, this is not hospitality. A Unified Generative Model for LARA 110 Aspects location amazing walk anywhere terrible front-desk smile unhelpful room dirty appointed smelly Location Room Service Aspect Rating Aspect Weight Entity Review

Latent Aspect Rating Analysis Model [Wang et al. 11] Unified framework 111 Excellent location in walking distance to Tiananmen Square and shopping streets. That’s the best part of this hotel! The rooms are getting really old. Bathroom was nasty. The fixtures were falling off, lots of cracks and everything looked dirty. I don’t think it worth the price. Service was the most disappointing part, especially the door men. this is not how you treat guests, this is not hospitality. Rating prediction moduleAspect modeling module

Aspect Identification Amazon reviews: no guidance 112 battery life accessoryservice file formatvolumevideo

Network Supervised Topic Modeling [Mei et al. 08] Probabilistic topic modeling as an optimization problem (e.g., PLSA/LDA: Maximum Likelihood): Regularized objective function with network constrains –Topic distribution are smoothed over adjacent vertices Flexibility in selecting topic models and regularizers 113

Instantiation: NetPLSA Basic Assumption: Neighbors have similar topic distribution PLSA Graph Harmonic Regularizer, Generalization of [Zhu ’03], importance (weight) of an edge tradeoff topic distribution of a document difference of topic distribution 114

Topical Communities with PLSA Topic 1Topic 2Topic 3Topic 4 term 0.02 peer 0.02 visual 0.02 interface 0.02 question 0.02 patterns 0.01 analog 0.02 towards 0.02 protein 0.01 mining 0.01 neurons 0.02 browsing 0.02 training 0.01 clusters 0.01 vlsi 0.01 xml 0.01 weighting 0.01 stream 0.01 motion 0.01 generation 0.01 multiple 0.01 frequent 0.01 chip 0.01 design 0.01 recognition 0.01 e 0.01 natural 0.01 engine 0.01 relations 0.01 page 0.01 cortex 0.01 service 0.01 library 0.01 gene 0.01 spike 0.01 social 0.01 ? ? ? ? Noisy community assignment 115

Topical Communities with NetPLSA Topic 1Topic 2Topic 3Topic 4 retrieval 0.13 mining 0.11 neural 0.06 web 0.05 information 0.05 data 0.06 learning 0.02 services 0.03 document 0.03 discovery 0.03 networks 0.02 semantic 0.03 query 0.03 databases 0.02 recognition 0.02 services 0.03 text 0.03 rules 0.02 analog 0.01 peer 0.02 search 0.03 association 0.02 vlsi 0.01 ontologies 0.02 evaluation 0.02 patterns 0.02 neurons 0.01 rdf 0.02 user 0.02 frequent 0.01 gaussian 0.01 management 0.01 relevance 0.02 streams 0.01 network 0.01 ontology 0.01 Information Retrieval Data mining Machine learning Web Coherent community assignment 116

Outline 1.Background -Text Mining (TM) -Statistical Language Models 2.Basic Topic Models -Probabilistic Latent Semantic Analysis (PLSA) -Latent Dirichlet Allocation (LDA) -Applications of Basic Topic Models to Text Mining 3.Advanced Topic Models -Capturing Topic Structures -Contextualized Topic Models -Supervised Topic Models 4.Summary We are here 117

Summary Statistical Topic Models (STMs) are a new family of language models, especially useful for –Discovering latent topics in text –Analyzing latent structures and patterns of topics –Extensible for joint modeling and analysis of text and associated non-textual data PLSA & LDA are two basic topic models that tend to function similarly, with LDA better as a generative model Many different models have been proposed with probably many more to come Many demonstrated applications in multiple domains and many more to come 118

Summary (cont.) However, all topic models suffer from the problem of multiple local maxima –Make it hard/impossible to reproduce research results –Make it hard/impossible to interpret results in real applications Complex models can’t scale up to handle large amounts of text data –Collapsed Gibbs sampling is efficient, but only working for conjugate priors –Variational EM needs to be derived in a model-specific way –Parallel algorithms are promising Many challenges remain…. 119

120 Challenges and Future Directions Challenge 1: How can we quantitatively evaluate the benefit of topic models for text mining? –Currently, most quantitative evaluation is based on perplexity which doesn’t reflect the actual utility of a topic model for text mining –Need to separately evaluate the quality of both topic word distributions and topic coverage –Need to consider multiple aspects of a topic (e.g., coherent?, meaningful?) and define appropriate measures –Need to compare topic models with alternative approaches to solving the same text mining problem (e.g., traditional IR methods, non-negative matrix factorization) –Need to create standard test collections

Challenge 2: How can we help users interpret a topic? –Most of the time, a topic is manually labeled in a research paper; this is insufficient for real applications –Automatic labeling can help, but the utility still needs to evaluated –Need to generate a summary for a topic to enable a user to navigate into text documents to better understand a topic –Need to facilitate post-processing of discovered topics (e.g., ranking, comparison) 121

122 Challenges and Future Directions (cont.) Challenge 3: How can we address the problem of multiple local maxima? –All topic models have the problem of multiple local maxima, causing problems with reproducing results –Need to compute the variance of a discovered topic –Need to define and report the confidence interval for a topic Challenge 4: How can we develop efficient estimation/inference algorithms for sophisticated models? –How can we leverage a user’s knowledge to speed up inferences for topic models? –Need to develop parallel estimation/inference algorithms

123 Challenges and Future Directions (cont.) Challenge 5: How can we incorporate linguistic knowledge into topic models? –Most current topic models are purely statistical – Some progress has been made to incorporate linguistic knowledge (e.g., [Griffiths et al. 04, Wallach 08]) –More needs to be done Challenge 6: How can we incorporate domain knowledge and preferences from an analyst into a topic model to support complex text mining tasks? –Current models are mostly pre-specified with little flexibility for an analyst to “steer” the analysis process –Need to develop a general analysis framework to enable an analyst to use multiple topic models together to perform complex text mining tasks

124 References (incomplete) [Blei et al. 02] D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. In T G Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, Cambridge, MA, MIT Press. [Blei et al. 03a] David M. Blei, Andrew Y. Ng, Michael I. Jordan: Latent Dirichlet Allocation. Journal of Machine Learning Research 3: (2003) [Griffiths et al. 04] Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum: Integrating Topics and Syntax. NIPS 2004 [Blei et al. 03b] David M. Blei, Thomas L. Griffiths, Michael I. Jordan, Joshua B. Tenenbaum: Hierarchical Topic Models and the Nested Chinese Restaurant Process. NIPS 2003 [Teh et al. 04] Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, David M. Blei: Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes. NIPS 2004 [Blei & Lafferty 05] David M. Blei, John D. Lafferty: Correlated Topic Models. NIPS 2005 [Blei & McAuliffe 07] David M. Blei, Jon D. McAuliffe: Supervised Topic Models. NIPS 2007 [Hofmann 99a] T. Hofmann. Probabilistic latent semantic indexing. In Proceedings on the 22nd annual international ACM- SIGIR 1999, pages [Hofmann 99b] Thomas Hofmann: Probabilistic Latent Semantic Analysis. UAI 1999: [Hofmann 99c] Thomas Hofmann: The Cluster-Abstraction Model: Unsupervised Learning of Topic Hierarchies from Text Data. IJCAI 1999: [Jelinek 98] F. Jelinek, Statistical Methods for Speech Recognition, Cambirdge: MIT Press, [Lu & Zhai 08] Yue Lu, Chengxiang Zhai: Opinion integration through semi-supervised topic modeling. WWW 2008: [Lu et al. 11] Yue Lu, Qiaozhu Mei, ChengXiang Zhai: Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. Inf. Retr. 14(2): (2011) [Mei et al. 05] Qiaozhu Mei, ChengXiang Zhai: Discovering evolutionary theme patterns from text: an exploration of temporal text mining. KDD 2005: [Mei et al. 06a] Qiaozhu Mei, Chao Liu, Hang Su, ChengXiang Zhai: A probabilistic approach to spatiotemporal theme pattern mining on weblogs. WWW 2006:

125 References (incomplete) ]Mei & Zhai 06b] Qiaozhu Mei, ChengXiang Zhai: A mixture model for contextual text mining. KDD 2006: [Met et al. 07a] Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, ChengXiang Zhai: Topic sentiment mixture: modeling facets and opinions in weblogs. WWW 2007: [[Mei et al. 07b] Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai: Automatic labeling of multinomial topic models. KDD 2007: [Mei et al. 08] Qiaozhu Mei, Deng Cai, Duo Zhang, ChengXiang Zhai: Topic modeling with network regularization. WWW 2008: [Mimno & McCallum 08[ David M. Mimno, Andrew McCallum: Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression. UAI 2008: [Minka & Lafferty 03] T. Minka and J. Lafferty, Expectation-propagation for the generative aspect model, In Proceedings of the UAI 2002, pages [Pritchard et al. 00] J. K. Pritchard, M. Stephens, P. Donnelly, Inference of population structure using multilocus genotype data,Genetics Jun;155(2): [Rosen-Zvi et al. 04] Michal Rosen-Zvi, Thomas L. Griffiths, Mark Steyvers, Padhraic Smyth: The Author-Topic Model for Authors and Documents. UAI 2004: [Wnag et al. 10] Hongning Wang, Yue Lu, Chengxiang Zhai: Latent aspect rating analysis on review text data: a rating regression approach. KDD 2010: [Wang et al. 11] Hongning Wang, Yue Lu, ChengXiang Zhai: Latent aspect rating analysis without aspect keyword supervision. KDD 2011: [Zhai et al. 04] ChengXiang Zhai, Atulya Velivelli, Bei Yu: A cross-collection mixture model for comparative text mining. KDD 2004: