Predictively Modeling Social Text William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.

Slides:



Advertisements
Similar presentations
Topic models Source: Topic models, David Blei, MLSS 09.
Advertisements

Xiaolong Wang and Daniel Khashabi
Hierarchical Dirichlet Processes
1 Modeling Political Blog Posts with Response Tae Yano Carnegie Mellon University IBM SMiLe Open House Yorktown Heights, NY October 8,
Social networks, in the form of bibliographies and citations, have long been an integral part of the scientific process. We examine how to leverage the.
Title: The Author-Topic Model for Authors and Documents
Probabilistic Clustering-Projection Model for Discrete Data
Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.
Generative Topic Models for Community Analysis
Funding Networks Abdullah Sevincer University of Nevada, Reno Department of Computer Science & Engineering.
Caimei Lu et al. (KDD 2010) Presented by Anson Liang.
Statistical Models for Networks and Text Jimmy Foulds UCI Computer Science PhD Student Advisor: Padhraic Smyth.
Predictively Modeling Social Text William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.
Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models Ramesh Nallapati Joint work with John Lafferty, Amr Ahmed, William.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Multiscale Topic Tomography Ramesh Nallapati, William Cohen, Susan Ditmore, John Lafferty & Kin Ung (Johnson and Johnson Group)
Models for Authors and Text Documents Mark Steyvers UCI In collaboration with: Padhraic Smyth (UCI) Michal Rosen-Zvi (UCI) Thomas Griffiths (Stanford)
Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Clustering Unsupervised learning Generating “classes”
Modeling Scientific Impact with Topical Influence Regression James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Example 16,000 documents 100 topic Picked those with large p(w|z)
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Modeling Community & Sentiment using latent variable models Ramnath Balasubramanyan (with William Cohen, Alek Kolcz.
Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.
Latent Dirichlet Allocation (LDA) Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University and Arvind Ramanathan of Oak Ridge National.
Online Learning for Latent Dirichlet Allocation
Annealing Paths for the Evaluation of Topic Models James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine* *James.
Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.
William W. Cohen with Ni Lao (Google), Ramnath Balasubramanyan, Dana Moshovitz-Attias School of Computer Science, Carnegie Mellon University, Reasoning.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Mining Social Network for Personalized Prioritization Language Techonology Institute School of Computer Science Carnegie Mellon University Shinjae.
Summary We propose a framework for jointly modeling networks and text associated with them, such as networks or user review websites. The proposed.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
Ramnath Balasubramanyan, William W. Cohen Language Technologies Institute and Machine Learning Department, School of Computer Science, Carnegie Mellon.
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Storylines from Streaming Text The Infinite Topic Cluster Model Amr Ahmed, Jake Eisenstein, Qirong Ho Alex Smola, Choon Hui Teo, Eric Xing Carnegie Mellon.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Latent Dirichlet Allocation
Techniques for Dimensionality Reduction
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Predictively Modeling Social Text William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.
Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.
Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 1 CMU TEAM-A in TDT 2004 Topic Tracking Yiming.
Analysis of Social Media MLD , LTI William Cohen
Latent Dirichlet Allocation (LDA)
Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li
Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.
Latent Feature Models for Network Data over Time Jimmy Foulds Advisor: Padhraic Smyth (Thanks also to Arthur Asuncion and Chris Dubois)
Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media.
Analysis of Social Media MLD , LTI William Cohen
Probabilistic models for corpora and graphs. Review: some generative models Multinomial Naïve Bayes C W1W1 W2W2 W3W3 ….. WNWN  M  For each document.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.
Learning Deep Generative Models by Ruslan Salakhutdinov
Multimodal Learning with Deep Boltzmann Machines
J. Zhu, A. Ahmed and E.P. Xing Carnegie Mellon University ICML 2009
Topic and Role Discovery In Social Networks
Bayesian Inference for Mixture Language Models
Topic models for corpora and for graphs
Michal Rosen-Zvi University of California, Irvine
LDA AND OTHER DIRECTED MODELS FOR MODELING TEXT
Topic models for corpora and for graphs
Learning to Rank Typed Graph Walks: Local and Global Approaches
Presentation transcript:

Predictively Modeling Social Text William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon University Joint work with: Amr Ahmed, Andrew Arnold, Ramnath Balasubramanyan, Frank Lin, Matt Hurst (MSFT), Ramesh Nallapati, Noah Smith, Eric Xing, Tae Yano

Document modeling with Latent Dirichlet Allocation (LDA) z w  M  N  For each document d = 1, ,M Generate  d ~ Dir(¢ |  ) For each position n = 1, , N d generate z n ~ Mult( ¢ |  d ) generate w n ~ Mult( ¢ |  z n )

Hyperlink modeling using LinkLDA [Erosheva, Fienberg, Lafferty, PNAS, 2004] z w  M  N  For each document d = 1, ,M Generate  d ~ Dir(¢ |  ) For each position n = 1, , N d generate z n ~ Mult( ¢ |  d ) generate w n ~ Mult( ¢ |  z n ) For each citation j = 1, , L d generate z j ~ Mult(. |  d ) generate c j ~ Mult(. |  z j ) z c  L Learning using variational EM

Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] z w  M  N  For each author a = 1, ,A Generate  a ~ Dir(¢ |  ) For each topic k = 1, ,K Generate  k ~ Dir( ¢ |  ) For each document d = 1, ,M For each position n = 1, , N d Generate author x ~ Unif(¢ | a d ) generate z n ~ Mult( ¢ |  a ) generate w n ~ Mult( ¢ |  z n ) x a A P  K

Labeled LDA: [ Ramage, Hall, Nallapati, Manning, EMNLP 2009]

Labeled LDA Del.icio.us tags as labels for documents

Labeled LDA

Author-Topic-Recipient model for data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

“SNA” = Jensen-Shannon divergence for recipients of messages

Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] Copycat model of citation influence c is a cited document s is a coin toss to mix γ and  plaigarism innovation

s is a coin toss to mix γ and 

Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] Citation influence graph for LDA paper

Modeling Citation Influences

User study: self- reported citation influence on Likert scale LDA-post is Prob(cited doc|paper) LDA-js is Jensen-Shannon dist in topic space

Models of hypertext for blogs, scientific literature [ICWSM 2008, KDD 2008] Ramesh Nallapati me Amr Ahmed Eric Xing

LinkLDA model for citing documents Variant of PLSA model for cited documents Topics are shared between citing, cited Links depend on topics in two documents Link-PLSA-LDA

Stochastic Block models: assume 1) nodes w/in a block z and 2) edges between blocks z p,z q are exchangeable zpzp zqzq a pq N2N2 zpzp N  p  Gibbs sampling: Randomly initialize z p for each node p. For t = 1… For each node p Compute z p given other z’s Sample z p See: Snijders & Nowicki, 1997, Estimation and Prediction for Stochastic Blockmodels for Groups with Latent Graph Structure

Mixed Membership Stochastic Block models pp qq zp.zp. z.qz.q a pq N2N2 pp N  p  Airoldi et al, JMLR 2008

Pairwise Link-LDA z w   N  z w  N z z c 

Pairwise Link-LDA supports new inferences… …but doesn’t perform better on link prediction

Want to predict linkage based on similarity of topic distributions. Using Z’s rather than θ’s: In Gibbs sampling the z’s are more accessible than the θ’s. Only observed links are modeled but higher link probabilities are penalized Component-wise product of expectation over topics is used as feature for a logistic regression function

Experiments Three hypertext corpora: WebKB, PNAS, Cora Each about k words, 1-3k documents, 1.5-5k links

Experiments Three hypertext corpora: WebKB, PNAS, Cora Each about k words, 1-3k documents, 1.5-5k links Measure perplexity in predicting links from words, words from links

Link prediction

Word prediction

Link predictionWord prediction

Predicting Response to Political Blog Posts with Topic Models [NAACL ’09] Tae Yano Noah Smith

33 Political blogs and and comments Comment style is casual, creative, less carefully edited Posts are often coupled with comment sections

Political blogs and comments Most of the text associated with large “A-list” community blogs is comments –5-20x as many words in comments as in text for the 5 sites considered in Yano et al. A large part of socially-created commentary in the blogosphere is comments. –Not blog  blog hyperlinks Comments do not just echo the post

Modeling political blogs Our political blog model: CommentLDA D = # of documents; N = # of words in post; M = # of words in comments z, z` = topic w = word (in post) w`= word (in comments) u = user

Modeling political blogs Our proposed political blog model: CommentLDA LHS is vanilla LDA D = # of documents; N = # of words in post; M = # of words in comments

Modeling political blogs Our proposed political blog model: CommentLDA RHS to capture the generation of reaction separately from the post body Two separate sets of word distributions D = # of documents; N = # of words in post; M = # of words in comments Two chambers share the same topic-mixture

Modeling political blogs Our proposed political blog model: CommentLDA User IDs of the commenters as a part of comment text generate the words in the comment section D = # of documents; N = # of words in post; M = # of words in comments

Modeling political blogs Another model we tried: CommentLDA This is a model agnostic to the words in the comment section! D = # of documents; N = # of words in post; M = # of words in comments Took out the words from the comment section! The model is structurally equivalent to the LinkLDA from (Erosheva et al., 2004)

40 Topic discovery - Matthew Yglesias (MY) site

41 Topic discovery - Matthew Yglesias (MY) site

42 Topic discovery - Matthew Yglesias (MY) site

Ramnath Balasubramanyan, William W. Cohen ICML WS 2010, SDM 2011 Language Technologies Institute and Machine Learning Department, School of Computer Science, Carnegie Mellon University, Joint Modeling of Entity-Entity Links and Entity-Annotated Text

Motivation: Toward Re-usable “Topic Models” LDA inspired many similar “topic models” –“Topic models” = generative models of selected properties of data (e.g., LDA: word co-occurance in a corpus, sLDA: word co-occurance and document labels,..., RelLDA, Pairwise LinkLDA: words and links in hypertext, …) LDA-like models are surprisingly hard to build –Conceptually modular, but nontrivial to implement –High-level toolkits like HBC, BLOG, … have had limited success –An alternative: general-purpose families of models than can be reconfigured and re-tasked for different purposes Somewhere between a modeling language (like HBC) and a task-specific LDA-like topic model

Motivation: Toward Re-usable “Topic” Models Examples of re-use of LDA-like topic models: –LinkLDA model Proposed to model text and citations in publications (Eroshova et al, 2004) z word  M  N  z cite  L

Motivation: Toward Re-usable “Topic” Models Examples of re-use of LDA-like topic models: –LinkLDA model Proposed to model text and citations in publications Re-used to model commenting behavior on blogs (Yano et al, NAACL 2009) z word  M  N  z userId  L

Motivation: Toward Re-usable “Topic” Models Examples of re-use of LDA-like topic models: –LinkLDA model Proposed to model text and citations in publications Re-used to model commenting behavior on blogs Re-used to model selectional restrictions for information extraction (Ritter et al, ACL 2010) z subj  M  N  z obj  L

Motivation: Toward Re-usable “Topic” Models Examples of re-use of LDA-like topic models: –LinkLDA model Proposed to model text and citations in publications Re-used to model commenting behavior on blogs Re-used to model selectional restrictions for IE Extended and re-used to model multiple types of annotations (e.g., authors, algorithms) and numeric annotations (e.g., timestamps, as in TOT) z subj  M  N  z obj  L [Our current work]

Motivation: Toward Re-usable “Topic” Models Examples of re-use of LDA-like topic models: –LinkLDA model Proposed to model text and citations in publications Re-used to model commenting behavior on blogs Re-used to model selectional restrictions for information extraction What kinds of models are easy to re-use?

Motivation: Toward Re-usable “Topic” Models What kinds of models are easy to reuse? What makes re-use possible? What syntactic shape does information often take? –(Annotated) text: i.e., collections of documents, each containing a bag of words, and (one or more) bags of typed entities Simplest case: one type  entity-annotated text Complex case: many entity types, time-stamps, … –Relations: i.e., k-tuples of typed entities Simplest case: k=2  entity-entity links Complex case: relational DB –Combinations of relations and annotated text are also common –Research goal: jointly model information in annotated text + set of relations This talk: –one binary relation and one corpus of text annotated with one entity type –joint model of both

Test problem: Protein-protein interactions in yeast Using known interactions between 844 proteins, curated by Munich Info Center for Protein Sequences (MIPS). Studied by Airoldi et al in 2008 JMLR paper (on mixed membership stochastic block models) Index of protein 1 Index of protein 2 p1, p2 do interact (sorted after clustering)

Test problem: Protein-protein interactions in yeast Using known interactions between 844 proteins from MIPS. … and 16k paper abstracts from SGD, annotated with the proteins that the papers refer to (all papers about these 844 proteins). Vac1p coordinates Rab and phosphatidylinositol 3-kinase signaling in Vps45p-dependent vesicle docking/fusion at the endosome. The vacuolar protein sorting (VPS) pathway of Saccharomyces cerevisiae mediates transport of vacuolar protein precursors from the late Golgi to the lysosome-like vacuole. Sorting of some vacuolar proteins occurs via a prevacuolar endosomal compartment and mutations in a subset of VPS genes (the class D VPS genes) interfere with the Golgi-to-endosome transport step. Several of the encoded proteins, including Pep12p/Vps6p (an endosomal target (t) SNARE) and Vps45p (a Sec1p homologue), bind each other directly [1]. Another of these proteins, Vac1p/Pep7p/Vps19p, associates with Pep12p and binds phosphatidylinositol 3- phosphate (PI(3)P), the product of the Vps34 phosphatidylinositol 3-kinase (PI 3-kinase) EP7, VPS45, VPS34, PEP12, VPS21,… Protein annotations English text

Aside: Is there information about protein interactions in the text? MIPS interactions Thresholded text co-occurrence counts

Question: How to model this? Vac1p coordinates Rab and phosphatidylinositol 3-kinase signaling in Vps45p-dependent vesicle docking/fusion at the endosome. The vacuolar protein sorting (VPS) pathway of Saccharomyces cerevisiae mediates transport of vacuolar protein precursors from the late Golgi to the lysosome-like vacuole. Sorting of some vacuolar proteins occurs via a prevacuolar endosomal compartment and mutations in a subset of VPS genes (the class D VPS genes) interfere with the Golgi-to-endosome transport step. Several of the encoded proteins, including Pep12p/Vps6p (an endosomal target (t) SNARE) and Vps45p (a Sec1p homologue), bind each other directly [1]. Another of these proteins, Vac1p/Pep7p/Vps19p, associates with Pep12p and binds phosphatidylinositol 3- phosphate (PI(3)P), the product of the Vps34 phosphatidylinositol 3-kinase (PI 3-kinase) EP7, VPS45, VPS34, PEP12, VPS21 Protein annotations English text Generic, configurable version of LinkLDA

Question: How to model this? Vac1p coordinates Rab and phosphatidylinositol 3-kinase signaling in Vps45p-dependent vesicle docking/fusion at the endosome. The vacuolar protein sorting (VPS) pathway of Saccharomyces cerevisiae mediates transport of vacuolar protein precursors from the late Golgi to the lysosome-like vacuole. Sorting of some vacuolar proteins occurs via a prevacuolar endosomal compartment and mutations in a subset of VPS genes (the class D VPS genes) interfere with the Golgi-to-endosome transport step. Several of the encoded proteins, including Pep12p/Vps6p (an endosomal target (t) SNARE) and Vps45p (a Sec1p homologue), bind each other directly [1]. Another of these proteins, Vac1p/Pep7p/Vps19p, associates with Pep12p and binds phosphatidylinositol 3- phosphate (PI(3)P), the product of the Vps34 phosphatidylinositol 3-kinase (PI 3-kinase) EP7, VPS45, VPS34, PEP12, VPS21 Protein annotations English text Instantiation z word  M  N  z prot  L

Question: How to model this? Index of protein 1 Index of protein 2 p1, p2 do interact MMSBM of Airoldi et al 1.Draw K 2 Bernoulli distributions 2.Draw a θ i for each protein 3. For each entry i,j, in matrix a)Draw z i* from θ i b)Draw z *j from θ j c)Draw m ij from a Bernoulli associated with the pair of z’s.

Question: How to model this? Index of protein 1 Index of protein 2 p1, p2 do interact Sparse block model of Parkinnen et al, 2007 These define the “blocks” we prefer… 1.Draw K 2 multinomial distributions β 2.For each row in the link relation: a)Draw (z L*, z *R ) from  b)Draw a protein i from left multinomial associated with pair c)Draw a protein j from right multinomial associated with pair d) Add i,j to the link relation

Gibbs sampler for sparse block model Sampling the class pair for a link probability of class pair in the link corpus probability of the two entities in their respective classes

BlockLDA: jointly modeling blocks and text Entity distributions shared between “blocks” and “topics”

Recovering the interaction matrix MIPS interactionsSparse Block modelBlock-LDA

Varying The Amount of Training Data

1/3 of links + all text for training; 2/3 of links for testing 1/3 of text + all links for training; 2/3 of docs for testing

Another Performance Test Goal: predict “functional categories” of proteins –15 categories at top-level (e.g., metabolism, cellular communication, cell fate, …) –Proteins have 2.1 categories on average –Method for predicting categories: Run with 15 topics Using held-out labeled data, associate topics with closest category If category has n true members, pick top n proteins by probability of membership in associated topic. –Metric: F1, Precision, Recall

Performance

Enron Corpus 96,103 s in “sent folders” –Entities in header are “annotations” 200,404 links (sender-recipient)

Other Related Work Link PLSA LDA: Nallapati et al., Models linked documents Nubbi: Chang et al., 2009, - Discovers relations between entities in text Topic Link LDA: Liu et al, Discovers communities of authors from text corpora

Other related work

Conclusions Hypothesis: –relations + annotated text are a common syntactic representation of data, so joint models for this data should be useful –BlockLDA is an effective model for this sort of data Result: for yeast protein-protein interaction data –improvements in block modeling when entity- annotated text about the entities involved is added –improvements in entity perplexity given text when relational data about the entities involved is added