Topic Modeling for Short Texts with Auxiliary Word Embeddings

Slides:



Advertisements
Similar presentations
Topic models Source: Topic models, David Blei, MLSS 09.
Advertisements

Hierarchical Dirichlet Processes
One Theme in All Views: Modeling Consensus Topics in Multiple Contexts Jian Tang 1, Ming Zhang 1, Qiaozhu Mei 2 1 School of EECS, Peking University 2 School.
Probabilistic Clustering-Projection Model for Discrete Data
Generative Topic Models for Community Analysis
Caimei Lu et al. (KDD 2010) Presented by Anson Liang.
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.
Latent Dirichlet Allocation a generative model for text
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
LATENT DIRICHLET ALLOCATION. Outline Introduction Model Description Inference and Parameter Estimation Example Reference.
SIEVE—Search Images Effectively through Visual Elimination Ying Liu, Dengsheng Zhang and Guojun Lu Gippsland School of Info Tech,
Modeling Scientific Impact with Topical Influence Regression James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Example 16,000 documents 100 topic Picked those with large p(w|z)
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,
Text Classification, Active/Interactive learning.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Memory Bounded Inference on Topic Models Paper by R. Gomes, M. Welling, and P. Perona Included in Proceedings of ICML 2008 Presentation by Eric Wang 1/9/2009.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
Learning Geographical Preferences for Point-of-Interest Recommendation Author(s): Bin Liu Yanjie Fu, Zijun Yao, Hui Xiong [KDD-2013]
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
1 A Probabilistic Model for Bursty Topic Discovery in Microblogs Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Jun Xu, Xueqi Cheng CAS Key Laboratory of Web Data.
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Latent Dirichlet Allocation
1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Hierarchical Clustering & Topic Models
Extracting Mobile Behavioral Patterns with the Distant N-Gram Topic Model Lingzi Hong Feb 10th.
Large-Scale Content-Based Audio Retrieval from Text Queries
An Infinite Factor Model Hierarchy Via a Noisy-Or Mechanism
The topic discovery models
Mining Topics in Documents: Standing on the Shoulders of Big Data
Shuang-Hong Yang, Hongyuan Zha, Bao-Gang Hu NIPS2009
Classification of unlabeled data:
Multimodal Learning with Deep Boltzmann Machines
Summary Presented by : Aishwarya Deep Shukla
Lecture 15: Text Classification & Naive Bayes
Efficient Estimation of Word Representation in Vector Space
The topic discovery models
Language Models for Information Retrieval
Latent Dirichlet Analysis
Hierarchical Topic Models and the Nested Chinese Restaurant Process
Find API Usage Patterns
Data Integration for Relational Web
The topic discovery models
Stochastic Optimization Maximization for Latent Variable Models
Word Embedding Word2Vec.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Resource Recommendation for AAN
Michal Rosen-Zvi University of California, Irvine
CS246: Latent Dirichlet Analysis
Junghoo “John” Cho UCLA
Topic Models in Text Processing
INF 141: Information Retrieval
Rachit Saluja 03/20/2019 Relation Extraction with Matrix Factorization and Universal Schemas Sebastian Riedel, Limin Yao, Andrew.
Unsupervised Learning of Narrative Schemas and their Participants
Unsupervised learning of visual sense models for Polysemous words
Jinwen Guo, Shengliang Xu, Shenghua Bao, and Yong Yu
GhostLink: Latent Network Inference for Influence-aware Recommendation
Dataset statistic Images:164k Instance segmentation masks:2.2 million
Presentation transcript:

Topic Modeling for Short Texts with Auxiliary Word Embeddings Ruiyuan Jiang

Outline Problem Statement Previous Methods GPU-DMM Model Experiment & Evaluation

Problem Statement Short texts have become a fashionable form of information on the Internet. Conventional topics modeling techniques(PLSA & LDA) are not sufficient for short texts corpus Large performance degradation due to limited word co-occurrence pattern. Several Strategies have been proposed, however, every method has its restrictions. Metadata information missing Insufficient word-occurrence information E.g. Web page snippets, news headlines, tweets, facebook status...etc. So how to discover the latent topics from short texts effectively and efficiently become important for applications such as user interest profiling topic detection ,comment summarization , content characterizing , and classification . They are designed to implicitly capture word cooccurrence patterns at document-level, to reveal topic structures. Because of the length of each document, conventional topic models suffer a lot. Problem is that metadata is no always aviable.

Merging short texts into long pseudo-documents Aggregate tweets from the same user as a pseudo-document before performing standard LDA model. Aggregate metadata for short text aggregation include hashtags, timestamps, and named entities.

New topic models for short texts 1. Biterm topic model which explicitly model the generation of word co- occurrence patterns instead of single words as do in many topic models. 2. Self-aggregation based topic model (SATM) assumes that each short text is a segment of along pseudo-document and shares the same topic proportion of the latter. So the aggregation is based on the topical similarity of the texts.

GPU-DMM Model 1. Dirichlet Mixture Model(DMM) 2. Auxiliary Word Embeddings 3. Incorporating Word Embeddings by GPU Generalized Polya Urn Model Word Filtering Model Inference the proposed GPU-DMM extends the DirichletMultinomial Mixture (DMM) model by incorporating the learned word relatedness from large text corpus through the generalized Polya urn (GPU) model in topic inferences.

Dirichlet Mixture Model Dirichlet Mixture Model is a generative probabilistic model with the assumption that a document is generated from a single topic. Goal: Acquire , where w is represents word, k means topic. Process: D:documents V:size of vocabulary K:pre-defined latent topics Nd: words in document d : Dirichlet priors That is, all the words within a document are generated by using the same topic distribution.

Dirichlet Mixture Model The hidden variables in the generative process can be approximated by applying Gibbs sampling. Sampling topic z for each document according to the following conditional distribution: Finally, the posterior distribution is calculated by using point estimation: : term frequency of word w in document d. :the number of documents assigned to topic k. = ::the number of times that word w is assigned to topic k Following another paper.

Auxiliary Word Embeddings Auxiliary Words are learned from external corpus. Find the Words that co-occur with the sample word from the pre-trained word embeddings. Semantically related words may not co-occur frequently in the short text collection The word embeddings used here are learned from the Google News corpus with 100 billion words of vocabulary size of 3 million. Table 1 lists 5 example words in the left-hand The number following each the document frequency of the word in the dataset. In the right-hand column, for each word, two semantically related words obtained from pre-trained word embeddings. The two numbers following each word are (i) the document frequency of the word, and (ii) the number of co-occurrences with the example word on the left-hand side, in the dataset.

Generalized Polya Urn Model Intuition: In terms of colored balls in a urn, where the probability of seeing a ball of each color is linearly proportional to the number of balls of that color in the urn. In a Polya urn model, when a ball of a particular color is sampled from the urn, the sampled ball along with certain numbers new ball of similar color are put back into the urn. Thus, the set of balls of similar colors are promoted during the process. Formally: Measure the semantic relatedness between two words by cosine similarity Build a semantic relatedness Matrix based on the similarity obtained. Build a promotion Matrix based on the semantic relatedness Matrix given a ball of word w, the balls of similar colors refer to the semantically related words to word w. As the result, sampling a word w in topic t not only increases the probability of w itself under topic t, but also increases the association between the topic and w’s semantically related words.

Word Filtering Idea: Prevent adding irrelevant semantic related word into the document. For example, “info website web cern consortium server project world copy first”. Probabilistic sampling strategy proposed: The intuition behind these equation is that If one word is highly relevant to the topic, when the sample topic of the document is actually this topic , this word has more chance being picked and going through the word embedding process.

Gibbs sampling process for GPU-DMM DMM initialization process Calculating conditional topic distribution Resample the topic for document

Experiment Data sets: BaiduQA: dataset is a collection of 648, 514 questions crawled from a popular Chinese Q&A website. Each question is annotated with a category label by its asker. Web Snippet: dataset contains 12, 340 Web search snippets. Each snippet belongs to one of 8 categories.

Experiment Setup and Evaluation metrics 1. Word Embeddings For the Snippet dataset, they use the pre-trained 300-dimensional word embeddings from the Google News corpus. For the BaiduQA dataset, they train 100-dimensional word embeddings from 7 million Chinese articles crawled from Baike website 2. Methods and Parmater Setting Five models are compared, Biterm Topic Model, Self-Aggregation based Topic Model, Dirichlet Multinomial Mixture, Latent Feature model with DMM, GPU-DMM 3. Evaluation metrics Topic coherence: measure the extent that the most probable words of a topic tend to co-occur together within the same documents Classification Accuracy: An indirect evaluation, calculated by SVM classifier with default parameter

Results on Topic Coherence For baiduQA dataset, GPU-DMM is not the best, author thinks that the processing of Chinese word segmentation may have the influence on the final performance

Results on Classification accuracy

Time Efficiency

Q&A