One Theme in All Views: Modeling Consensus Topics in Multiple Contexts Jian Tang 1, Ming Zhang 1, Qiaozhu Mei 2 1 School of EECS, Peking University 2 School.

Slides:

Advertisements

Similar presentations

Google News Personalization: Scalable Online Collaborative Filtering

Advertisements

Google News Personalization Scalable Online Collaborative Filtering

Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM)

KDD 2011 Summary of Text Mining sessions Hongbo Deng.

Topic models Source: Topic models, David Blei, MLSS 09.

Metadata in Carrot II Current metadata –TF.IDF for both documents and collections –Full-text index –Metadata are transferred between different nodes Potential.

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.

Entity-Centric Topic-Oriented Opinion Summarization in Twitter Date : 2013/09/03 Author : Xinfan Meng, Furu Wei, Xiaohua, Liu, Ming Zhou, Sujian Li and.

Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Unsupervised Modeling of Twitter Conversations

Content Management & Hashtag Recommendation IN P2P OSN By Keerthi Nelaturu.

Statistical Topic Modeling part 1

Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.

Caimei Lu et al. (KDD 2010) Presented by Anson Liang.

Topic Modeling with Network Regularization Md Mustafizur Rahman.

Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.

Commentary-based Video Categorization and Concept Discovery By Janice Leung.

IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.

Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”

MusicSense: Contextual Music Recommendation using Emotional Allocation Modeling Rui Cai, Chao Zhang, Chong Wang, Lei Zhang, and Wei-Ying Ma Proceedings.

J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Multiscale Topic Tomography Ramesh Nallapati, William Cohen, Susan Ditmore, John Lafferty & Kin Ung (Johnson and Johnson Group)

The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.

Guillaume Rivalle APRIL 2014 MEASURE YOUR RESEARCH PERFORMANCE WITH INCITES.

Projects ( ) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.

School of Electronics Engineering and Computer Science Peking University Beijing, P.R. China Ziqi Wang, Yuwei Tan, Ming Zhang.

Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.

1 A Discriminative Approach to Topic- Based Citation Recommendation Jie Tang and Jing Zhang Presented by Pei Li Knowledge Engineering Group, Dept. of Computer.

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

FYP Presentation DATA FUSION OF CONSUMER BEHAVIOR DATASETS USING SOCIAL MEDIA Madhav Kannan A R 1.

1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.

On Finding Fine-Granularity User Communities by Profile Decomposition Seulki Lee, Minsam Ko, Keejun Han, Jae-Gil Lee Department of Knowledge Service Engineering.

Learning Geographical Preferences for Point-of-Interest Recommendation Author(s): Bin Liu Yanjie Fu, Zijun Yao, Hui Xiong [KDD-2013]

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Multiple Location Profiling for Users and Relationships from Social Network and Content Rui Li, Shengjie Wang, Kevin Chen-Chuan Chang University of Illinois.

Department of Electrical Engineering and Computer Science Kunpeng Zhang, Yu Cheng, Yusheng Xie, Doug Downey, Ankit Agrawal, Alok Choudhary {kzh980,ych133,

Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.

Clustering More than Two Million Biomedical Publications Comparing the Accuracies of Nine Text-Based Similarity Approaches Boyack et al. (2011). PLoS ONE.

Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.

1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.

AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.

Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai.

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.

Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.

Automatic Labeling of Multinomial Topic Models

An Inquiry and Analysis of Metadata Utilization A Case Study of MARC 2005 ASIS&T Annual Meeting, November 1, 2005, Charlotte, North Carolina William E.

Unsupervised Streaming Feature Selection in Social Media

Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.

2005/09/13 A Probabilistic Model for Retrospective News Event Detection Zhiwei Li, Bin Wang*, Mingjing Li, Wei-Ying Ma University of Science and Technology.

ENHANCING CLUSTER LABELING USING WIKIPEDIA David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab SIGIR’09.

NN k Networks for browsing and clustering image collections Daniel Heesch Communications and Signal Processing Group Electrical and Electronic Engineering.

Multiple Location Profiling for Users and Relationships from Social Network and Content Rui Li, Shengjie Wang, Kevin Chen-Chuan Chang University of Illinois.

A Study of Poisson Query Generation Model for Information Retrieval

How Social Media Changed The World Of Event Planning By Olivia Burke.

Topic Modeling for Short Texts with Auxiliary Word Embeddings

Efficient Multi-User Indexing for Secure Keyword Search

Summary Presented by : Aishwarya Deep Shukla

A Consensus-Based Clustering Method

Integrating Meta-Path Selection With User-Guided Object Clustering in Heterogeneous Information Networks Yizhou Sun†, Brandon Norick†, Jiawei Han†, Xifeng.

Community-based User Recommendation in Uni-Directional Social Networks

Michal Rosen-Zvi University of California, Irvine

Topic Models in Text Processing

Topic: Semantic Text Mining

Yingze Wang and Shi-Kuo Chang University of Pittsburgh

Presentation transcript:

One Theme in All Views: Modeling Consensus Topics in Multiple Contexts Jian Tang 1, Ming Zhang 1, Qiaozhu Mei 2 1 School of EECS, Peking University 2 School of Information, University of Michigan 1

User-Generated Content (UGC) billion tweets million/day 1 A huge amount of user-generated content Profit from user-generated content $1.8 billion for facebook 2 $0.9 billion for youtube Applications: online advertising recommendation policy making

Topic Modeling for Data Exploration Infer the hidden themes (topics) within the data collection. Annotate the data through the discovered themes Explore and search the entire data with the annotations 3 Key Idea: document-level word co-occurrences -words appearing in the same document tend to take on the same topics

Challenges of Topic Modeling on User- Generated Content 4 Social media Tradition media Benign document length Controlled vocabulary size Refined language Short document length Large vocabulary size Noisy language v.s. document-level word co-occurrences in UGC are sparse and noisy!

Rich Context Information 5

Why Context Helps? Document-level word co-occurrences – words appearing in the same document tend to take on the same topic; – sparse and noisy Context-level word co-occurrences – Much richer – E.g., words written by the same user tend to take on the same topics; – E.g., words surrounding the same hashtag tend to take on the same topic; – Note that it may not hold for all that contexts! 6

Existing Ways to Utilize Contexts Concatenate documents in particular context into a longer pseudo-document. Introduce particular context variables into the generative process, e.g., – Rosen-Zvi et al (author context) – Wang et al (time context) – Yin et al (location context) A coin-flipping process to select among multiple contexts – e.g., Ahmed et al (ideology context, document context) Cons: – Complicated graphical structure and inference procedure – Cannot generalize to arbitrary contexts – Coin-flipping approach makes data sparser 7

Coin-Flipping: Competition among Contexts 8 Word Token Context Competition makes data even sparser!

Type of Context, Context, View Context : a subset of the corpus, or a pseudo document, defined by a value of a type of context (e.g., tweets by a user) Type of Context: a metadata variable, e.g. user, time, hashtag, tweet View: a partition of the corpus according to a type of context 9 … … … …… … … U1U1 U2U2 …… U3U3 UNUN Time: User: Hashtag #kdd2013 #jobs

Competition  Collaboration 10 Collaboration utilizes different views of the data Let different types of contexts vote for topics in common (topics that stand out from multiple views are more robust) Allow each type (view) to keep its own version of (view-specific) topics

How? A Co-regularization Framework 11 View 1 View-specific topics View-specific topics Consensus topics View-specific topics View 2View 3 (View: partition of corpus into pseudo- documents) Objective: Minimize the disagreements between individual opinions (view-specific topics) and the consensus (topics)

The General Co-regularization Framework 12 View 1 Consensus topics View 2View 3 Objective: Minimize the disagreements between individual opinions (view-specific topics) and the consensus (topics) KL-divergence View-specific topics

Learning Procedure: Variational EM Variational E-step: mean-field algorithm – Update the topic assignments of each token in each view. M-step: – Update the view-specific topics – Update the consensus topics Geometric mean Topic-word count from view c Topic-word probability from consensus topics 13

Experiments Datasets – Twitter: user, hashtag, tweet – DBLP: author, conference, title Metric: Topic semantic coherence – The average point-wise mutual information of word pairs among the top-ranked words (D. Newman et al. 2010) External task: User/Author clustering – Partition users/authors by assigning each user/author to the most probable topic – Evaluate the partition on the social networks with modularity (M. Newman, 2006) – Intuition: Better topics should correspond to better communities on the social network 14

Topic Coherence (Twitter) Multiple types of contexts: CR(User+Hashtag) >ATM>Coin-Flipping CR(User+Hashtag) > CR(User+Hashtag+Tweet) 15 AlgorithmTopic coherence LDA (User)1.94 LDA (Hashtag)2.54 LDA (Tweet) Single type of context: LDA(Hashtag) > LDA(User) >> LDA(Tweet) AlgorithmTopic coherence HashtagConsensus ATM (User+Hashtag)-2.15 Coin-Flipping (User+Hashtag) CR (User+Tweet)-1.67 CR (User+Hashtag) CR (Hashtag+Tweet) CR (User+Hashtag+Twee t)

User Clustering (Twitter) CR(User+Hashtag)> LDA(User) CR(User+Hashtag)> CR(User+Hashtag+Tweet) 16 TypeAlgorithmModularity Single context LDA (User)0.445 Multiple contexts CR (User+Hashtag)0.491 CR (User+Tweet)0.457 CR (User+Hashtag+Tweet)0.480

Topic Coherence (DBLP) Single type of context: LDA(Author)> LDA(Conference) >> LDA(Title) Multiple types of contexts: CR(Author+Conference) >ATM>Coin-flipping CR(Author+Conference+Title)> CR(Author+Conference) 17 AlgorithmTopic coherence LDA (Author)0.613 LDA (Conference)0.569 LDA (Title) AlgorithmTopic coherence AuthorConsensus ATM (Author+Conference) Coin-flipping (Author+Conference) CR (Author+Conference) CR (Conference+Title) CR (Author+Conference+Titl e)

Author Clustering (DBLP) CR(Author+Conference)> LDA(Author) CR(Author+Conference)> CR(Author+Conference+Title) 18 TypeAlgorithmModularity Single context LDA (Author)0.289 Multiple contexts CR (Author+Title)0.288 CR (Author+Conference)0.298 CR (Author+Conference+Title)0.295

Summary Utilizing multiple types of contexts enhances topic modeling in user-generated content. Each type of contexts define a partition (view) of the whole corpus A co-regularization framework to let multiple views collaborate with each other Future work : – how to select contexts – weight the contexts differently 19

Thanks! 20 - Acknowledgements: NSF IIS , IIS , CCF ; - NSFC , China Scholarship Council (CSC, ); - Twitter.com

Multi-contextual LDA : context type proportion c: context type x: context value z: topic assignment : the context values of type i : the topic proportion of contexts : the word distribution of topics To sample a word, (1) sample a context type c according to the context type proportion (2) Uniformly sample a context value x from (3) sample a topic assignment z from the distribution over topics associated with x (4) sample a word w from the distribution over words associated with z 21

Parameter Sensitivity 22