Probabilistic Topic Models for Text Mining

Slides:



Advertisements
Similar presentations
1 A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs Qiaozhu Mei, Chao Liu, Hang Su, and ChengXiang Zhai : University of Illinois.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
A Cross-Collection Mixture Model for Comparative Text Mining
Topic models Source: Topic models, David Blei, MLSS 09.
Information retrieval – LSI, pLSI and LDA
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
Title: The Author-Topic Model for Authors and Documents
Statistical Topic Modeling part 1
Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov April TexPoint fonts used in EMF. Read the TexPoint manual before.
Mixture Language Models and EM Algorithm
Generative Topic Models for Community Analysis
Topic Modeling with Network Regularization Md Mustafizur Rahman.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Latent Dirichlet Allocation a generative model for text
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
2008 © ChengXiang Zhai 1 Contextual Text Analysis with Probabilistic Topic Models ChengXiang Zhai Department of Computer Science Graduate School of Library.
Language Modeling Frameworks for Information Retrieval John Lafferty School of Computer Science Carnegie Mellon University.
Scalable Text Mining with Sparse Generative Models
Data Mining – Intro.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Context Analysis in Text Mining and Search Qiaozhu Mei Department of Computer Science University of Illinois at Urbana-Champaign
Topic Modeling with Network Regularization Qiaozhu Mei, Deng Cai, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.
Information Retrieval in Practice
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Example 16,000 documents 100 topic Picked those with large p(w|z)
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
2010 © University of Michigan Latent Semantic Indexing SI650: Information Retrieval Winter 2010 School of Information University of Michigan 1.
Language Models Hongning Wang Two-stage smoothing [Zhai & Lafferty 02] c(w,d) |d| P(w|d) = +  p(w|C) ++ Stage-1 -Explain unseen words -Dirichlet.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Probabilistic Topic Models
Comparative Text Mining Q. Mei, C. Liu, H. Su, A. Velivelli, B. Yu, C. Zhai DAIS The Database and Information Systems Laboratory. at The University of.
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 龙星计划课程 : 信息检索 Topic Models for Text Mining ChengXiang Zhai ( 翟成祥 )
Latent Dirichlet Allocation
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 龙星计划课程 : 信息检索 Course Summary ChengXiang Zhai ( 翟成祥 ) Department of.
Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.
Automatic Labeling of Multinomial Topic Models
Web-Mining Agents Topic Analysis: pLSI and LDA
Relevance Feedback Hongning Wang
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai DAIS The Database and Information Systems Laboratory.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Modeling Annotated Data (SIGIR 2003) David M. Blei, Michael I. Jordan Univ. of California, Berkeley Presented by ChengXiang Zhai, July 10, 2003.
A Study of Poisson Query Generation Model for Information Retrieval
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Probabilistic Topic Models Hongning Wang Outline 1.General idea of topic models 2.Basic topic models -Probabilistic Latent Semantic Analysis (pLSA)
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Hierarchical Clustering & Topic Models
Context Analysis in Text Mining and Search
Data Mining – Intro.
Probabilistic Topic Model
Multimodal Learning with Deep Boltzmann Machines
Qiaozhu Mei†, Chao Liu†, Hang Su‡, and ChengXiang Zhai†
Relevance Feedback Hongning Wang
Bayesian Inference for Mixture Language Models
John Lafferty, Chengxiang Zhai School of Computer Science
Michal Rosen-Zvi University of California, Irvine
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Topic Models in Text Processing
GhostLink: Latent Network Inference for Influence-aware Recommendation
Presentation transcript:

Probabilistic Topic Models for Text Mining ChengXiang Zhai (翟成祥) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, czhai@cs.uiuc.edu 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

(Slide from Rebecca Hwa’s “Intro to Text Mining”) What Is Text Mining? “The objective of Text Mining is to exploit information contained in textual documents in various ways, including …discovery of patterns and trends in data, associations among entities, predictive rules, etc.” (Grobelnik et al., 2001) “Another way to view text data mining is as a process of exploratory data analysis that leads to heretofore unknown information, or to answers for questions for which the answer is not currently known.” (Hearst, 1999) (Slide from Rebecca Hwa’s “Intro to Text Mining”) 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Two Different Views of Text Mining Data Mining View: Explore patterns in textual data Find latent topics Find topical trends Find outliers and other hidden patterns Natural Language Processing View: Make inferences based on partial understanding natural language text Information extraction Question answering Shallow mining Deep mining 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Applications of Text Mining Direct applications: Go beyond search to find knowledge Question-driven (Bioinformatics, Business Intelligence, etc): We have specific questions; how can we exploit data mining to answer the questions? Data-driven (WWW, literature, email, customer reviews, etc): We have a lot of data; what can we do with it? Indirect applications Assist information access (e.g., discover latent topics to better summarize search results) Assist information organization (e.g., discover hidden structures) 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Text Mining Methods Topic of this lecture Data Mining Style: View text as high dimensional data Frequent pattern finding Association analysis Outlier detection Information Retrieval Style: Fine granularity topical analysis Topic extraction Exploit term weighting and text similarity measures Natural Language Processing Style: Information Extraction Entity extraction Relation extraction Sentiment analysis Question answering Machine Learning Style: Unsupervised or semi-supervised learning Mixture models Dimension reduction Topic of this lecture 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Outline The Basic Topic Models: Extensions Probabilistic Latent Semantic Analysis (PLSA) [Hofmann 99] Latent Dirichlet Allocation (LDA) [Blei et al. 02] Extensions Contextual Probabilistic Latent Semantic Analysis (CPLSA) [Mei & Zhai 06] 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Basic Topic Model: PLSA 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

PLSA: Motivation What did people say in their blog articles about “Hurricane Katrina”? Query = “Hurricane Katrina” Results: 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Probabilistic Latent Semantic Analysis/Indexing (PLSA/PLSI) [Hofmann 99] Mix k multinomial distributions to generate a document Each document has a potentially different set of mixing weights which captures the topic coverage When generating words in a document, each word may be generated using a DIFFERENT multinomial distribution (this is in contrast with the document clustering model where, once a multinomial distribution is chosen, all the words in a document would be generated using the same model) We may add a background distribution to “attract” background words 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

in doc d in the collection PLSA as a Mixture Model Document d warning 0.3 system 0.2.. ? Topic 1 d,1 1 “Generating” word w in doc d in the collection aid 0.1 donation 0.05 support 0.02 .. 2 Topic 2 d,2 1 - B d, k W … k statistics 0.2 loss 0.1 dead 0.05 .. B Topic k B is 0.05 the 0.04 a 0.03 .. Background B Parameters: B=noise-level (manually set) ’s and ’s are estimated with Maximum Likelihood 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

How to Estimate j: EM Algorithm the 0.2 a 0.1 we 0.01 to 0.02 … Known Background p(w | B) Observed Doc(s) ML Estimator Suppose, we know the identity of each word ... Unknown topic model p(w|1)=? “Text mining” … text =? mining =? association =? word =? … … information =? retrieval =? query =? document =? Unknown topic model p(w|2)=? “information retrieval” 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

How the Algorithm Works c(w,d)p(zd,w = B) c(w,d)(1 - p(zd,w = B))p(zd,w=j) πd1,1 ( P(θ1|d1) ) πd1,2 ( P(θ2|d1) ) c(w, d) aid 7 d1 price 5 Initial value oil 6 πd2,1 ( P(θ1|d2) ) πd2,2 ( P(θ2|d2) ) aid 8 d2 price 7 oil 5 Initial value P(w| θ) Topic 1 Topic 2 Iteration 1: M Step: re-estimate πd, j and P(w| θj) by adding and normalizing the splitted word counts Initializing πd, j and P(w| θj) with random values Iteration 1: E Step: split word counts with different topics (by computing z’ s) Iteration 2: E Step: split word counts with different topics (by computing z’ s) Iteration 2: M Step: re-estimate πd, j and P(w| θj) by adding and normalizing the splitted word counts Iteration 3, 4, 5, … Until converging aid price Initial value 12 oil 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Application of Bayes rule (in multiple collections) Parameter Estimation E-Step: Word w in doc d is generated from cluster j from background Application of Bayes rule M-Step: Re-estimate mixing weights cluster LM Fractional counts contributing to using cluster j in generating d generating w from cluster j Sum over all docs (in multiple collections) m = 1 if one collection 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

PLSA with Prior Knowledge There are different ways of choosing aspects (topics) Google = Google News + Google Map + Google scholar, … Google = Google US + Google France + Google China, … Users have some domain knowledge in mind, e.g., We expect to see “retrieval models” as a topic in IR. We want to show the aspects of “history” and “statistics” for Youtube A flexible way to incorporate such knowledge as priors of PLSA model In Bayesian, it’s your “belief” on the topic distributions 14 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

in doc d in the collection Adding Prior Most likely  Document d warning 0.3 system 0.2.. Topic 1 d,1 1 “Generating” word w in doc d in the collection aid 0.1 donation 0.05 support 0.02 .. 2 Topic 2 d,2 1 - B d, k W … k statistics 0.2 loss 0.1 dead 0.05 .. B Topic k B is 0.05 the 0.04 a 0.03 .. Background B Parameters: B=noise-level (manually set) ’s and ’s are estimated with Maximum Likelihood 15 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Adding Prior as Pseudo Counts Observed Doc(s) the 0.2 a 0.1 we 0.01 to 0.02 … Known Background p(w | B) MAP Estimator Suppose, we know the identity of each word ... Unknown topic model p(w|1)=? “Text mining” … text =? mining =? association =? word =? Pseudo Doc … Unknown topic model p(w|2)=? “information retrieval” … information =? retrieval =? query =? document =? Size = μ text 16 mining 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Maximum A Posterior (MAP) Estimation Pseudo counts of w from prior ’ +p(w|’j) + Sum of all pseudo counts What if =0? What if =+? 17 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Basic Topic Model: LDA The following slides about LDA are taken from Michael C. Mozer’s course lecture http://www.cs.colorado.edu/~mozer/courses/ProbabilisticModels/ 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

LDA: Motivation “Documents have no generative probabilistic semantics” i.e., document is just a symbol Model has many parameters linear in number of documents need heuristic methods to prevent overfitting Cannot generalize to new documents 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Unigram Model 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Mixture of Unigrams 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Topic Model / Probabilistic LSI d is a localist representation of (trained) documents LDA provides a distributed representation 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

LDA Vocabulary of |V| words Document is a collection of words from vocabulary. N words in document w = (w1, ..., wN) Latent topics random variable z, with values 1, ..., k Like topic model, document is generated by sampling a topic from a mixture and then sampling a word from a mixture. But topic model assumes a fixed mixture of topics (multinomial distribution) for each document. LDA assumes a random mixture of topics (Dirichlet distribution) for each topic. 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Generative Model “Plates” indicate looping structure Outer plate replicated for each document Inner plate replicated for each word Same conditional distributions apply for each replicate Document probability 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Fancier Version 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Inference 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Inference In general, this formula is intractable: Expanded version: 1 if wn is the j'th vocab word 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Variational Approximation Computing log likelihood and introducing Jensen's inequality: log(E[x]) >= E[log(x)] Find variational distribution q such that the above equation is computable. q parameterized by γ and φn Maximize bound with respect to γ and φn to obtain best approximation to p(w | α, β) Lead to variational EM algorithm Sampling algorithms (e.g., Gibbs sampling) are also common 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Data Sets C. Elegans Community abstracts TREC AP corpus (subset) 28,414 unique terms TREC AP corpus (subset) 16,333 newswire articles 23,075 unique terms Held-out data – 10% Removed terms 50 stop words, words appearing once 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Note: fold in hack for pLSI to allow it to handle novel documents. C. Elegans Note: fold in hack for pLSI to allow it to handle novel documents. Involves refitting p(z|dnew) parameters -> sort of a cheat 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

AP 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Summary: PLSA vs. LDA LDA adds a Dirichlet distribution on top of PLSA to regularize the model Estimation of LDA is more complicated than PLSA LDA is a generative model, while PLSA isn’t PLSA is more likely to over-fit the data than LDA Which one to use? If you need generalization capacity, LDA If you want to mine topics from a collection, PLSA may be better (we want overfitting!) 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Extension of PLSA: Contextual Probabilistic Latent Semantic Analysis (CPLSA) 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

A General Introduction to EM Data: X (observed) + H(hidden) Parameter:  “Incomplete” likelihood: L( )= log p(X| ) “Complete” likelihood: Lc( )= log p(X,H| ) EM tries to iteratively maximize the incomplete likelihood: Starting with an initial guess (0), 1. E-step: compute the expectation of the complete likelihood 2. M-step: compute (n) by maximizing the Q-function 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Convergence Guarantee Goal: maximizing “Incomplete” likelihood: L( )= log p(X| ) I.e., choosing (n), so that L((n))-L((n-1))0 Note that, since p(X,H| ) =p(H|X, ) P(X| ) , L() =Lc() -log p(H|X, ) L((n))-L((n-1)) = Lc((n))-Lc( (n-1))+log [p(H|X,  (n-1) )/p(H|X, (n))] Taking expectation w.r.t. p(H|X, (n-1)), L((n))-L((n-1)) = Q((n);  (n-1))-Q( (n-1);  (n-1)) + D(p(H|X,  (n-1))||p(H|X,  (n))) Doesn’t contain H EM chooses (n) to maximize Q KL-divergence, always non-negative Therefore, L((n))  L((n-1))! 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Another way of looking at EM Likelihood p(X| ) L((n-1)) + Q(; (n-1)) -Q( (n-1);  (n-1) ) + D(p(H|X,  (n-1) )||p(H|X,  )) L((n-1)) + Q(; (n-1)) -Q( (n-1);  (n-1) ) next guess current guess Lower bound (Q function)  E-step = computing the lower bound M-step = maximizing the lower bound 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Why Contextual PLSA? 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Motivating Example: Comparing Product Reviews IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews Common Themes “IBM” specific “APPLE” specific “DELL” specific Battery Life Long, 4-3 hrs Medium, 3-2 hrs Short, 2-1 hrs Hard disk Large, 80-100 GB Small, 5-10 GB Medium, 20-50 GB Speed Slow, 100-200 Mhz Very Fast, 3-4 Ghz Moderate, 1-2 Ghz Unsupervised discovery of common topics and their variations 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Motivating Example: Comparing News about Similar Topics Vietnam War Afghan War Iraq War Common Themes “Vietnam” specific “Afghan” specific “Iraq” specific United nations … Death of people Unsupervised discovery of common topics and their variations 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Motivating Example: Discovering Topical Trends in Literature Theme Strength Time Explain the Plots. Temporal theme analysis separate to make the fonts bigger… more explanations. Title: sample ETP: theme evolutionary graph 1980 1990 1998 2003 TF-IDF Retrieval Language Model IR Applications Text Categorization Unsupervised discovery of topics and their temporal variations 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Motivating Example: Analyzing Spatial Topic Patterns How do blog writers in different states respond to topics such as “oil price increase during Hurricane Karina”? Unsupervised discovery of topics and their variations in different locations 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Motivating Example: Sentiment Summary Unsupervised/Semi-supervised discovery of topics and different sentiments of the topics 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Research Questions Can we model all these problems generally? Can we solve these problems with a unified approach? How can we bring human into the loop? 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Contextual Text Mining Given collections of text with contextual information (meta-data) Discover themes/subtopics/topics (interesting word clusters) Compute variations of themes over contexts Applications: Summarizing search results Federation of text information Opinion analysis Social network analysis Business intelligence .. 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Context Features of Text (Meta-data) Weblog Article communities Author Compared with other kinds of data, Weblogs have some interesting special characteristics, which make it interesting to exploit for text mining. source Location Time Author’s Occupation 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Context = Partitioning of Text Papers about Web papers written in 1998 1998 papers written by authors in US 1999 …… …… 2005 2006 WWW SIGIR ACL KDD SIGMOD 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Themes/Topics Uses of themes: government 0.3 response 0.2.. [ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. … Theme 1 donate 0.1 relief 0.05 help 0.02 .. Theme 2 … city 0.2 new 0.1 orleans 0.05 .. Theme k Is 0.05 the 0.04 a 0.03 .. Uses of themes: Summarize topics/subtopics Navigate in a document space Retrieve documents Segment documents … Background B 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

View of Themes: Context-Specific Version of Views Context: After 1998 (Language models) feedback language model smoothing query generation mixture estimate EM pseudo vector space TF-IDF Okapi LSI Rocchio weighting feedback term retrieval Theme 1: Retrieval Model retrieve model relevance document query feedback judge expansion pseudo query Theme 2: Feedback Context: Before 1998 (Traditional models) 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Coverage of Themes: Distribution over Themes Oil Price Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut-in gas production … Over seventy countries pledged monetary donations or other assistance. … Government Response Aid and donation Background Context: Texas Oil Price Government Response Theme coverage can depend on context Aid and donation Background Context: Louisiana 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

General Tasks of Contextual Text Mining Theme Extraction: Extract the global salient themes Common information shared over all contexts View Comparison: Compare a theme from different views Analyze the content variation of themes over contexts Coverage Comparison: Compare the theme coverage of different contexts Reveal how closely a theme is associated to a context Others: Causal analysis Correlation analysis 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

A General Solution: CPLSA CPLAS = Contextual Probabilistic Latent Semantic Analysis An extension of PLSA model ([Hofmann 99]) by Introducing context variables Modeling views of topics Modeling coverage variations of topics Process of contextual text mining Instantiation of CPLSA (context, views, coverage) Fit the model to text data (EM algorithm) Compute probabilistic topic patterns 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

“Generation” Process of CPLSA Choose a theme View1 View2 View3 Themes government donation New Orleans Draw a word from i Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut-in gas production … Over seventy countries pledged monetary donations or other assistance. … Document context: Time = July 2005 Location = Texas Author = xxx Occup. = Sociologist Age Group = 45+ … government 0.3 response 0.2.. donate 0.1 relief 0.05 help 0.02 .. city 0.2 new 0.1 orleans 0.05 .. government response donate help aid Orleans new Texas July 2005 sociologist Choose a view Theme coverages: Texas July 2005 document …… Choose a Coverage 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Probabilistic Model To generate a document D with context feature set C: Choose a view vi according to the view distribution Choose a coverage кj according to the coverage distribution Choose a theme according to the coverage кj Generate a word using The likelihood of the document collection is: 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Parameter Estimation: EM Algorithm Interesting patterns: Theme content variation for each view: Theme strength variation for each context Prior from a user can be incorporated using MAP estimation 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Regularization of the Model Why? Generality high complexity (inefficient, multiple local maxima) Real applications have domain constraints/knowledge Two useful simplifications: Fixed-Coverage: Only analyze the content variation of themes (e.g., author-topic analysis, cross-collection comparative analysis ) Fixed-View: Only analyze the coverage variation of themes (e.g., spatiotemporal theme analysis) In general Impose priors on model parameters Support the whole spectrum from unsupervised to supervised learning 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Interpretation of Topics Statistical topic models term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … Multinomial topic models Collection (Context) Relevance Score Re-ranking Coverage; Discrimination NLP Chunker Ngram stat. database system, clustering algorithm, r tree, functional dependency, iceberg cube, concurrency control, index structure … Candidate label pool Ranked List of Labels clustering algorithm; distance measure; … 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Relevance: the Zero-Order Score Intuition: prefer phrases covering high probability words Clustering Good Label (l1): “clustering algorithm” dimensional algorithm Latent Topic  … birch shape Bad Label (l2): “body shape” … p(w|) body 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Relevance: the First-Order Score Intuition: prefer phrases with similar context (distribution) Clustering Clustering Clustering dimension dimension dimension Good Label (l1): “clustering algorithm” Bad Label (l2): “hash join” Topic  … partition partition algorithm algorithm algorithm C: SIGMOD Proceedings join … … Score (l,  ) hash hash hash P(w|) P(w|l1) P(w|l2) D(||l1) < D(||l2) 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Sample Results Comparative text mining Spatiotemporal pattern mining Sentiment summary Event impact analysis Temporal author-topic analysis 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Comparing News Articles Iraq War (30 articles) vs Comparing News Articles Iraq War (30 articles) vs. Afghan War (26 articles) The common theme indicates that “United Nations” is involved in both wars Cluster 1 Cluster 2 Cluster 3 Common Theme united 0.042 nations 0.04 … killed 0.035 month 0.032 deaths 0.023 Iraq n 0.03 Weapons 0.024 Inspections 0.023 troops 0.016 hoon 0.015 sanches 0.012 Afghan Northern 0.04 alliance 0.04 kabul 0.03 taleban 0.025 aid 0.02 taleban 0.026 rumsfeld 0.02 hotel 0.012 front 0.011 Collection-specific themes indicate different roles of “United Nations” in the two wars 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Comparing Laptop Reviews Top words serve as “labels” for common themes (e.g., [sound, speakers], [battery, hours], [cd,drive]) These word distributions can be used to segment text and add hyperlinks between documents 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Spatiotemporal Patterns in Blog Articles Query= “Hurricane Katrina” Topics in the results: Spatiotemporal patterns 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Theme Life Cycles for Hurricane Katrina Oil Price price 0.0772 oil 0.0643 gas 0.0454 increase 0.0210 product 0.0203 fuel 0.0188 company 0.0182 … New Orleans city 0.0634 orleans 0.0541 new 0.0342 louisiana 0.0235 flood 0.0227 evacuate 0.0211 storm 0.0177 … The upper figure is the life cycles for different themes in Texas. The red line refers to a theme with the top probability words such as price, oil, gas, increase, etc, from which we know that it is talking about “oil price”. The blue one, on the other hand, talks about events that happened in the city “new orleans”. In the upper figure, we can see that both themes were getting hot during the first two weeks, and became weaker around the mid September. The theme New Orleans got strong again around the last week of September while the other theme dropped monotonically. In the bottom figure, which is the life cycles for the same theme “New Orleans” in different states. We observe that this theme reaches the highest probability first in Florida and Louisiana, followed by Washington and Texas, consecutively. During early September, this theme drops significantly in Louisiana while still strong in other states. We suppose this is because of the evacuation in Louisiana. Surprisingly, around late September, a re-arising pattern can be observed in most states, which is most significant in Louisiana. Since this is the time period in which Hurricane Rita arrived, we guess that Hurricane Rita has an impact on the discussion of Hurricane Katrina. This is reasonable since people are likely to mention the two hurricanes together or make comparisons. We can find more clues to this hypothesis from Hurricane Rita data set. 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Theme Snapshots for Hurricane Katrina Week4: The theme is again strong along the east coast and the Gulf of Mexico Week3: The theme distributes more uniformly over the states Week2: The discussion moves towards the north and west Week5: The theme fades out in most states Week1: The theme is the strongest along the Gulf of Mexico This slide shows the snapshot for theme ``Government Response'' over the first five weeks of Hurricane Katrina. The darker the color is, the hotter the discussion about this theme is. we observe that at the first week of Hurricane Katrina, the theme ``Government Response'‘ is the strongest in the southeast states, especially those along the Gulf of Mexico. In week 2, we can see the pattern that the theme is spreading towards the north and western states because the northern states are getting darker. In week 3, the theme is distributed even more uniformly, which means that it is spreading all over the states. However, in week 4, we observe that the theme converges to east states and southeast coast again. Interestingly, this week happens to overlap with the first week of Hurricane Rita, which may raise the public concern about government response again in those areas. In week 5, the theme becomes weak in most inland states and most of the remaining discussions are along the coasts. Another interesting observation is that this theme is originally very strong in Louisiana (the one to the right of Texas, ), but dramatically weakened in Louisiana during week 2 and 3, and becomes strong again from the fourth week. Interestingly, Week 2 and 3 are consistent with the time of evacuation in Louisiana. 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Theme Life Cycles: KDD gene 0.0173 expressions 0.0096 probability 0.0081 microarray 0.0038 … marketing 0.0087 customer 0.0086 model 0.0079 business 0.0048 … rules 0.0142 association 0.0064 support 0.0053 … Global Themes life cycles of KDD Abstracts 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Theme Evolution Graph: KDD 1999 2000 2001 2002 2003 2004 T web 0.009 classifica –tion 0.007 features0.006 topic 0.005 … SVM 0.007 criteria 0.007 classifica – tion 0.006 linear 0.005 … mixture 0.005 random 0.006 cluster 0.006 clustering 0.005 variables 0.005 … topic 0.010 mixture 0.008 LDA 0.006 semantic 0.005 … decision 0.006 tree 0.006 classifier 0.005 class 0.005 Bayes 0.005 … … Classifica - tion 0.015 text 0.013 unlabeled 0.012 document 0.008 labeled 0.008 learning 0.007 … Informa - tion 0.012 web 0.010 social 0.008 retrieval 0.007 distance 0.005 networks 0.004 … … … 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Blog Sentiment Summary (query=“Da Vinci Code”) Neutral Positive Negative Facet 1: Movie ... Ron Howards selection of Tom Hanks to play Robert Langdon. Tom Hanks stars in the movie,who can be mad at that? But the movie might get delayed, and even killed off if he loses. Directed by: Ron Howard Writing credits: Akiva Goldsman ... Tom Hanks, who is my favorite movie star act the leading role. protesting ... will lose your faith by ... watching the movie. After watching the movie I went online and some research on ... Anybody is interested in it? ... so sick of people making such a big deal about a FICTION book and movie. Facet 2: Book I remembered when i first read the book, I finished the book in two days. Awesome book. I’m reading “Da Vinci Code” now. … So still a good book to past time. This controversy book cause lots conflict in west society. 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Results: Sentiment Dynamics Facet: the book “ the da vinci code”. ( Bursts during the movie, Pos > Neg ) Facet: religious beliefs ( Bursts during the movie, Neg > Pos ) 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Event Impact Analysis: IR Research xml 0.0678 email 0.0197 model 0.0191 collect 0.0187 judgment 0.0102 rank 0.0097 subtopic 0.0079 … vector 0.0514 concept 0.0298 extend 0.0297 model 0.0291 space 0.0236 boolean 0.0151 function 0.0123 feedback 0.0077 … Publication of the paper “A language modeling approach to information retrieval” Starting of the TREC conferences year 1992 term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … Theme: retrieval models SIGIR papers 1998 model 0.1687 language 0.0753 estimate 0.0520 parameter 0.0281 distribution 0.0268 probable 0.0205 smooth 0.0198 markov 0.0137 likelihood 0.0059 … probabilist 0.0778 model 0.0432 logic 0.0404 ir 0.0338 boolean 0.0281 algebra 0.0200 estimate 0.0119 weight 0.0111 … 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Temporal-Author-Topic Analysis close 0.0805 pattern 0.0720 sequential 0.0462 min_support 0.0353 threshold 0.0207 top-k 0.0176 fp-tree 0.0102 … index 0.0440 graph 0.0343 web 0.0307 gspan 0.0273 substructure 0.0201 gindex 0.0164 bide 0.0115 xml 0.0109 … project 0.0444 itemset 0.0433 intertransaction 0.0397 support 0.0264 associate 0.0258 frequent 0.0181 closet 0.0176 prefixspan 0.0170 … Author Jiawei Han Rakesh Agrawal Author A Global theme: frequent patterns time 2000 Author B pattern 0.1107 frequent 0.0406 frequent-pattern 0.039 sequential 0.0360 pattern-growth 0.0203 constraint 0.0184 push 0.0138 … research 0.0551 next 0.0308 transaction 0.0308 panel 0.0275 technical 0.0275 article 0.0258 revolution 0.0154 innovate 0.0154 … 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Modeling Topical Communities (Mei et al. 08) Community 1: Information Retrieval Community 2: Data Mining Community 3: Machine Learning 71 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Other Extensions (LDA Extensions) Many extensions of LDA, mostly done by David Blei, Andrew McCallum and their co-authors Some examples: Hierarchical topic models [Blei et al. 03] Modeling annotated data [Blei & Jordan 03] Dynamic topic models [Blei & Lafferty 06] Pachinko allocation [Li & McCallum 06]) Also, some specific context extension of PLSA, e.g., author-topic model [Steyvers et al. 04] 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Future Research Directions Topic models for text mining Evaluation of topic models Improve the efficiency of estimation and inferences Incorporate linguistic knowledge Applications in new domains and for new tasks Text mining in general Combination of NLP-style and DM-style mining algorithms Integrated mining of text (unstructured) and unstructured data (e.g., Text OLAP) Interactive mining: Incorporate user constraints and support iterative mining Design and implement mining languages 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Lecture 5: Key Points Topic models coupled with topic labeling are quite useful for extracting and modeling subtopics in text Adding context variables significantly increases a topic model’s capacity of performing text mining Enable interpretation of topics in context Accommodate variation analysis and correlation analysis of topics over context User’s preferences and domain knowledge can be added as prior or soft constraint 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Readings PLSA: LDA: CPLSA: http://www.cs.brown.edu/~th/papers/Hofmann-UAI99.pdf LDA: http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf Many recent extensions, mostly done by David Blei and Andrew McCallums CPLSA: http://sifaka.cs.uiuc.edu/czhai/pub/kdd06-mix.pdf http://sifaka.cs.uiuc.edu/czhai/pub/www08-net.pdf 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Discussion Topic models for mining multimedia data Simultaneous modeling of text and images Cross-media analysis Text provides context to analyze images and vice versa 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Looking forward to collaborations… Course Summary Integrated Multimedia Data Analysis Mutual reinforcement (e.g., text images) Simultaneous mining of text + images +video… Scope of the course Information Retrieval Text Data Retrieval models/framework Evaluation Feedback Contextual topic models Evaluation User modeling Ranking Learning with little supervision Multimedia Data Computer Vision Natural Language Processing Machine Learning Looking forward to collaborations… Statistics 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008

Thank You! 2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008