Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models Ramesh Nallapati Joint work with John Lafferty, Amr Ahmed, William.

Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models Ramesh Nallapati Joint work with John Lafferty, Amr Ahmed, William Cohen and Eric Xing Machine Learning Department Carnegie Mellon University

8/28/2007/9:30amICDM’07 HPDM workskop2/28 Introduction Statistical topic modeling: an attractive framework for topic discovery –Completely unsupervised –Models text very well Lower perplexity compared to unigram models –Reveals meaningful semantic patterns –Can help summarize and visualize document collections –e.g.: PLSA, LDA, DPM, DTM, CTM, PA

8/28/2007/9:30amICDM’07 HPDM workskop3/28 Introduction A common assumption in all the variants: –Exchangeability: “bag of words” assumption –Topics represented as a ranked list of words Consequences: –Word Correlation information is lost e.g.: “white-house” vs. “white” and “house” Long distance correlations

8/28/2007/9:30amICDM’07 HPDM workskop4/28 Introduction Objective: –To capture correlations between words within topics Motivation: –More interpretable representation of topics as a network of words rather than a list –Helps better visualize and summarize document collections –May reveal unexpected relationships and patterns within topics

8/28/2007/9:30amICDM’07 HPDM workskop5/28 Past Work: Topic Models Bigram topic models [Wallach, ICML 2006] Requires KV(K-1) parameters Only captures local dependencies Does not model sparsity of correlations Does not capture “within-topic” correlations

8/28/2007/9:30amICDM’07 HPDM workskop6/28 Past work: Other approaches Hyperspace Analog to Language (HAL) [Lund and Burges, Cog. Sci., ‘96] –Word pair correlation measured as a weighted count of number of times they occur within a fixed length window – Weight of an occurrence / 1/(mutual distance)

8/28/2007/9:30amICDM’07 HPDM workskop7/28 Past work: Other approaches Hyperspace Analog to Language (HAL) [Lund and Burges, Cog. Sci., ‘96] –Plusses: Sparse solutions, scalability –Minuses: Only unearths global correlations, not semantic correlations –E.g.: “river – bank”, “bank – check” Only local dependencies

8/28/2007/9:30amICDM’07 HPDM workskop8/28 Past work: Other approaches Query expansion in IR –Similar in spirit: finds words that highly co- occur with the query words –However, not a corpus visualization tool: requires a context to operate on Wordnet –Semantic networks –Human labeled: not directly related to our goal

8/28/2007/9:30amICDM’07 HPDM workskop9/28 Our approach L 1 norm regularization –Known to enforce sparse solutions Sparsity permits scalability –Convex optimization problem Globally optimal solutions – Recent advances in learning structure of graphical models: L 1 regularization framework asymptotically leads to true structure

8/28/2007/9:30amICDM’07 HPDM workskop10/28 Background:LASSO Example: linear regression Regularization used to improve generalizability –E.g.1: Ridge regression: L 2 norm regularization –E.g.2: Lasso: L 1 norm regularization

8/28/2007/9:30amICDM’07 HPDM workskop11/28 Background: LASSO Lasso encourages sparse solutions

8/28/2007/9:30amICDM’07 HPDM workskop12/28 Background: Gaussian Random Fields Multivariate Gaussian distribution Random field structure: G = (V,E) –V: set of all variables {X 1, ,X p} –(s,t) 2 E,  -1 st  0 –X s ? X u | X N(s) where u  N(s)

8/28/2007/9:30amICDM’07 HPDM workskop13/28 Background: Gaussian Random Fields Estimating the graph structure of GRF from data [Meinshausen and Buhlmann, Annals. Stats., 2006] –Regress each variable onto others imposing L 1 penalty to encourage sparsity –Estimated neighborhood:

8/28/2007/9:30amICDM’07 HPDM workskop14/28 Background: Gaussian Random Fields True Graph Estimated graph Courtesy: [Meinshausen and Buhlmann, Annals. Stats., 2006]

8/28/2007/9:30amICDM’07 HPDM workskop15/28 Background: Gaussian Random Fields Application to topic models: CTM [Blei and Lafferty, NIPS, 2006]

8/28/2007/9:30amICDM’07 HPDM workskop16/28 Background: Gaussian Random Fields Application to CTM :[Blei & Lafferty, Annals. Appl. Stats., ‘07]

8/28/2007/9:30amICDM’07 HPDM workskop17/28 Structure learning of an MRF Ising model L 1 regularized conditional likelihood learns true structure asymptotically [Wainwright, Ravikumar and Lafferty, NIPS’06]

8/28/2007/9:30amICDM’07 HPDM workskop18/28 Structure learning of an MRF Courtesy: [Wainwright, Ravikumar and Lafferty, NIPS’06]

8/28/2007/9:30amICDM’07 HPDM workskop19/28 Sparse Word Graphs Algorithm –Run LDA on the document collection and obtain topic assignments –Convert topic assignments for each document into K binary vectors X: –Assume an MRF for each topic with X as underlying data –Apply structure learning for MRF using regularized conditional likelihood

8/28/2007/9:30amICDM’07 HPDM workskop20/28 Sparse Word Graphs

8/28/2007/9:30amICDM’07 HPDM workskop21/28 Sparse Word Graphs: Scalability We still run V logistic regression problems, each of size V for each topic: O(KV 2 ) ! –However, each example is very sparse –L 1 penalty results in sparse solutions –Can run each topic in parallel –Efficient interior point based L 1 regularized logistic regression [Koh, Kim & Boyd, JMLR,’07]

8/28/2007/9:30amICDM’07 HPDM workskop22/28 Experiments Small AP corpus –2.2K Docs, 10.5K unique words Ran 10 topic LDA model Used  = 0.1 in L 1 logistic regression Took just 45 min. per topic Very sparse solutions –Computes only under 0.1% of the total number of possible edges

8/28/2007/9:30amICDM’07 HPDM workskop23/28 Topic “Business”: neighborhood of top LDA terms

8/28/2007/9:30amICDM’07 HPDM workskop24/28 Topic “Business”: neighborhood of top edges

8/28/2007/9:30amICDM’07 HPDM workskop25/28 Topic “War”: neighborhood of top LDA terms

8/28/2007/9:30amICDM’07 HPDM workskop26/28 Topic “War”: neighborhood of top edges

8/28/2007/9:30amICDM’07 HPDM workskop27/28 Concluding remarks Pros –A highly scalable algorithm for capturing within topic word correlations –Captures both short distance and long distance correlations –Makes topics more interpretable Cons –Not a complete probabilistic model Significant modeling challenge since the correlations are latent

8/28/2007/9:30amICDM’07 HPDM workskop28/28 Concluding remarks Applications of Sparse Word Graphs –Better document summarization and visualization tool –Word sense disambiguation –Semantic query expansion Future Work –Evaluation on a “real task” –Build a unified statistical model

Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models Ramesh Nallapati Joint work with John Lafferty, Amr Ahmed, William.

Similar presentations

Presentation on theme: "Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models Ramesh Nallapati Joint work with John Lafferty, Amr Ahmed, William."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models Ramesh Nallapati Joint work with John Lafferty, Amr Ahmed, William.

Similar presentations

Presentation on theme: "Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models Ramesh Nallapati Joint work with John Lafferty, Amr Ahmed, William."— Presentation transcript:

Similar presentations

About project

Feedback