1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.

Slides:



Advertisements
Similar presentations
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Advertisements

One Theme in All Views: Modeling Consensus Topics in Multiple Contexts Jian Tang 1, Ming Zhang 1, Qiaozhu Mei 2 1 School of EECS, Peking University 2 School.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Probabilistic Clustering-Projection Model for Discrete Data
Statistical Topic Modeling part 1
A Joint Model of Text and Aspect Ratings for Sentiment Summarization Ivan Titov (University of Illinois) Ryan McDonald (Google Inc.) ACL 2008.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
WIMS 2014, Thessaloniki, June 2014 A soft frequent pattern mining approach for textual topic detection Georgios Petkos, Symeon Papadopoulos, Yiannis Kompatsiaris.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.
Collective Word Sense Disambiguation David Vickrey Ben Taskar Daphne Koller.
Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models Ramesh Nallapati Joint work with John Lafferty, Amr Ahmed, William.
Latent Dirichlet Allocation a generative model for text
Scalable Text Mining with Sparse Generative Models
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
Example 16,000 documents 100 topic Picked those with large p(w|z)
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
KDD 2012, Beijing, China Community Discovery and Profiling with Social Messages Wenjun Zhou Hongxia Jin Yan Liu.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.
Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
1 Rated Aspect Summarization of Short Comments Yue Lu, ChengXiang Zhai, and Neel Sundaresan.
1 Rated Aspect Summarization of Short Comments Yue Lu, ChengXiang Zhai, and Neel Sundaresan Presented by: Sapan Shah.
A Comparative Study of Kernel Methods for Classification Applications Yan Liu Oct 21, 2003.
Machine Learning Tutorial Amit Gruber The Hebrew University of Jerusalem.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
A DISTRIBUTION BASED VIDEO REPRESENTATION FOR HUMAN ACTION RECOGNITION Yan Song, Sheng Tang, Yan-Tao Zheng, Tat-Seng Chua, Yongdong Zhang, Shouxun Lin.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
1 A Probabilistic Model for Bursty Topic Discovery in Microblogs Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Jun Xu, Xueqi Cheng CAS Key Laboratory of Web Data.
Instance-based mapping between thesauri and folksonomies Christian Wartena Rogier Brussee Telematica Instituut.
Semantic v.s. Positions: Utilizing Balanced Proximity in Language Model Smoothing for Information Retrieval Rui Yan†, ♮, Han Jiang†, ♮, Mirella Lapata‡,
Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
NEW EVENT DETECTION AND TOPIC TRACKING STEPS. PREPROCESSING Removal of check-ins and other redundant data Removal of URL’s maybe Stemming of words using.
Topic Modeling using Latent Dirichlet Allocation
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Ranking Related Entities Components and Analyses CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
Latent Dirichlet Allocation
Named Entity Recognition in Query Jiafeng Guo 1, Gu Xu 2, Xueqi Cheng 1,Hang Li 2 1 Institute of Computing Technology, CAS, China 2 Microsoft Research.
Local Linear Matrix Factorization for Document Modeling Institute of Computing Technology, Chinese Academy of Sciences Lu Bai,
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
More Than Relevance: High Utility Query Recommendation By Mining Users' Search Behaviors Xiaofei Zhu, Jiafeng Guo, Xueqi Cheng, Yanyan Lan Institute of.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Link Distribution on Wikipedia [0407]KwangHee Park.
Automatic Labeling of Multinomial Topic Models
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
A Novel Relational Learning-to- Rank Approach for Topic-focused Multi-Document Summarization Yadong Zhu, Yanyan Lan, Jiafeng Guo, Pan Du, Xueqi Cheng Institute.
Natural Language Processing Topics in Information Retrieval August, 2002.
2005/09/13 A Probabilistic Model for Retrospective News Event Detection Zhiwei Li, Bin Wang*, Mingjing Li, Wei-Ying Ma University of Science and Technology.
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai DAIS The Database and Information Systems Laboratory.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
Inferring User Interest Familiarity and Topic Similarity with Social Neighbors in Facebook INSTRUCTOR: DONGCHUL KIM ANUSHA BOOTHPUR
Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Latent Dirichlet Analysis
Michal Rosen-Zvi University of California, Irvine
CS246: Latent Dirichlet Analysis
Junghoo “John” Cho UCLA
Topic Models in Text Processing
Presentation transcript:

1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences

2 Short Texts Are Prevalent on Today's Web

3 Background Understanding the topics of short texts is important for many tasks content recomendation user interest profiling content characterizing emerging topic detecting semantic analysis... This work originates from a browsing recomendation project

4 Topic Models Model the generation of documents with latent topic structure a topic ~ a distribution over words a document ~ a mixture of topics a word ~ a sample drawn from one topic Previous studies mainly focus on normal texts From Blei

5 Problem on Short Texts: Data Sparsity Word counts are not discriminative normal doc: topical words occur frequently short msg: most words only occur once Not enougth contexts to identify the senses of ambiguous words normal doc: rich context, many relevant words short msg: limited context, few relevant words The severe data sparsity makes conventional topic models less effective on short texts

6 Previous Approaches on Short Texts Document aggregation e.g. aggregating the tweets published by the same users heuristic, not general Mixture of unigrams each document has only one topic too strict assumption, peaked posteriors P(z|d) Sparse topic models add sparse constraints on the distribution over topics in a document, e.g. Focused Topic Model too complex, easy to overfit

7 Key Idea A Topic is basically a group of correlated words and the correlation is revealed by word co-occurrence patterns in documents why not directly model the word co-occurrences for topic learning? Conventional Topic models suffer from the problem of severe sparse patterns in short documents why not use the rich global word co-occurrence patterns for better revealing topics instead?

8 Biterm Topic Model (BTM) Model the generation of biterms with latent topic structure a topic ~ a distribution over words a corpus ~ a mixture of topics a biterm ~ two words drawn from one topic

9 Generation Procedure of Biterms

10 Inferring Topics in a Document Assumption the topic proportions of a document equals to the expectation of the topic proportions of biterms in it where

11 Parameters Inference Gibbs Sampling sample topic for each biterm parameters estimate BTM is more memory-efficient than LDA

12 Experiments: Datasets Tweets2011 (short text) Question (short text) 20Newsgroup (normal text) #documents 4,230,578189,08018,828 #words 98,85726,56542,697 #users 2,039,877-- #categories avg doc length (after pre-processing)

13 Experiments: Tweets2011 Collection Topic quality Metric: average coherence score (Mimno'11) on the top T words A larger coherence score means the topics are more coherent D. Mimno, H. Wallach, E. Talley, M. Leenders, and A. McCallum. Optimizing semantic oherence in topic models. EMNLP 2011

14 Experiments: Tweets2011 Collection Quality of topic proportions of documents (i.e. P(z|d)) select 50 frequent and meanful hashtags as class labels organize documents with the same hashtag into a cluster measure: H score smaller value indicates better agreement with human labeled classes

15 Experiments: Question Collection Evaluated by document classification (linear SVM)

16 Experiments: 20Newsgroup Collection (Normal Texts) Biterm extraction any two words co-occurring closely (with distance no larger than a threshold r ) Clustering result

17 Summary A practical but not well-studied problem topic modeling on short texts conventional topic models suffer from the severe data sparsity when modeling the generation of short text messages A generative model: Biterm Topic Model model word co-occurrences to uncover topics fully exploit the rich global word co-occurrens general and effective Furture works better way to infer topic proportations for short text messages explore BTM in real-world applications

18 Thank You ! More Information: