Phrase Mining and Topic Modeling for Structure Discovery from Text

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Topic models Source: Topic models, David Blei, MLSS 09.

A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Date ： 2014/04/15 Source ： KDD’13 Authors ： Chi Wang, Marina Danilevsky, Nihit.

Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.

Natural Language Processing COLLOCATIONS Updated 16/11/2005.

Outline What is a collocation?

A Joint Model of Text and Aspect Ratings for Sentiment Summarization Ivan Titov (University of Illinois) Ryan McDonald (Google Inc.) ACL 2008.

MICHAEL PAUL AND ROXANA GIRJU UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN A Two-Dimensional Topic-Aspect Model for Discovering Multi-Faceted Topics.

Generative Topic Models for Community Analysis

Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.

Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.

Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models Ramesh Nallapati Joint work with John Lafferty, Amr Ahmed, William.

Latent Dirichlet Allocation a generative model for text

Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Scalable Text Mining with Sparse Generative Models

1 A Topic Modeling Approach and its Integration into the Random Walk Framework for Academic Search 1 Jie Tang, 2 Ruoming Jin, and 1 Jing Zhang 1 Knowledge.

Topic Modeling with Network Regularization Qiaozhu Mei, Deng Cai, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.

Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

Information Retrieval in Practice

Generating Impact-Based Summaries for Scientific Literature Qiaozhu Mei, ChengXiang Zhai University of Illinois at Urbana-Champaign 1.

Webpage Understanding: an Integrated Approach

Modeling Scientific Impact with Topical Influence Regression James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine.

Introduction to Machine Learning for Information Retrieval Xiaolong Wang.

Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Graphical models for part of speech tagging

Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation

1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.

Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.

Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.

Joint Models of Disagreement and Stance in Online Debate Dhanya Sridhar, James Foulds, Bert Huang, Lise Getoor, Marilyn Walker University of California,

Hidden Topic Markov Models Amit Gruber, Michal Rosen-Zvi and Yair Weiss in AISTATS 2007 Discussion led by Chunping Wang ECE, Duke University March 2, 2009.

TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.

Presenter: Shanshan Lu 03/04/2010

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.

On Node Classification in Dynamic Content-based Networks.

Unsupervised Learning of Visual Sense Models for Polysemous Words Kate Saenko Trevor Darrell Deepak.

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.

Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang

Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.

Latent Dirichlet Allocation

1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.

KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

TWC Illuminate Knowledge Elements in Geoscience Literature Xiaogang (Marshall) Ma, Jin Guang Zheng, Han Wang, Peter Fox Tetherless World Constellation.

Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.

Automatic Labeling of Multinomial Topic Models

Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.

Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai DAIS The Database and Information Systems Laboratory.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

A Study of Poisson Query Generation Model for Information Retrieval

KNN & Naïve Bayes Hongning Wang

Statistical NLP: Lecture 7

Online Multiscale Dynamic Topic Models

Clustering of Web pages

Bayesian Inference for Mixture Language Models

Effective Entity Recognition and Typing by Relation Phrase-Based Clustering

Michal Rosen-Zvi University of California, Irvine

Topic Models in Text Processing

Presentation transcript:

Phrase Mining and Topic Modeling for Structure Discovery from Text Strategy 1: Simultaneously Inferring Phrases and Topics Strategy 2: Post Topic Modeling Phrase Construction TurboTopics and KERT (Analysis of Frequent Patterns) Strategy 3: First Phrase Mining then Topic Modeling (TopMine) 1 1

Frequent Pattern Mining for Text Data: Phrase Mining and Topic Modeling Motivation: unigrams (single words) can be difficult to interpret Ex.: The topic that represents the area of Machine Learning learning support vector machines reinforcement learning feature selection conditional random fields classification decision trees : learning reinforcement support machine vector selection feature random : versus

Various Strategies: Phrase-Based Topic Modeling Strategy 1: Generate bag-of-words -> generate sequence of tokens Bigram topical model [Wallach’06], topical n-gram model [Wang et al.’07], phrase discovering topic model [Lindsey et al.’12] Strategy 2: Post bag-of-words model inference, visualize topics with n-grams Label topic [Mei et al.’07], TurboTopic [Blei & Lafferty’09], KERT [Danilevsky et al.’14] Strategy 3: Prior bag-of-words model inference, mine phrases and impose to the bag-of-words model ToPMine [El-kishky et al.’15]

Strategy 1: Simultaneously Inferring Phrases and Topics Bigram Topic Model [Wallach’06] Probabilistic generative model that conditions on previous word and topic when drawing next word Topical N-Grams [Wang et al.’07] Probabilistic model that generates words in textual order Create n-grams by concatenating successive bigrams (Generalization of Bigram Topic Model) Phrase-Discovering LDA (PDLDA) [Lindsey et al.’12] Viewing each sentence as a time-series of words, PDLDA posits that the generative parameter (topic) changes periodically Each word is drawn based on previous m words (context) and current phrase topic

Strategy 1: Topical N-Grams Model (TNG) [Wang et al.’07, Lindsey et al.’12] [white house] [reports [white] [black d1 dD color] … z x 1

TNG: Experiments on Research Papers An interesting comparison A state of the art phrase-discovering topic model

TNG: Experiments on Research Papers (2) An interesting comparison A state of the art phrase-discovering topic model

Strategy 2: Post Topic Modeling Phrase Construction TurboTopics [Blei & Lafferty’09] – Phrase construction as a postprocessing step to Latent Dirichlet Allocation Perform Latent Dirichlet Allocation on corpus to assign each token a topic label Merges adjacent unigrams with same topic label by a distribution- free permutation test on arbitrary-length back-off model End recursive merging when all significant adjacent unigrams have been merged KERT [Danilevsky et al.’14] – Phrase construction as a post-processing step to Latent Dirichlet Allocation Perform frequent pattern mining on each topic Perform phrase ranking on four different criteria

Example of TurboTopics

KERT: Topical Keyphrase Extraction & Ranking [Danilevsky et al. 14] knowledge discovery using least squares support vector machine classifiers support vectors for reinforcement learning a hybrid approach to feature selection pseudo conditional random fields automatic web page classification in a dynamic and hierarchical way inverse time dependency in convex regularized learning postprocessing decision trees to extract actionable knowledge variance minimization least squares support vector machines … learning support vector machines reinforcement learning feature selection conditional random fields classification decision trees : Topical keyphrase extraction & ranking Unigram topic assignment: Topic 1 & Topic 2

Frequent pattern mining Framework of KERT Run bag-of-words model inference, and assign topic label to each token Extract candidate keyphrases within each topic Rank the keyphrases in each topic Popularity: ‘information retrieval’ vs. ‘cross-language information retrieval’ Discriminativeness: only frequent in documents about topic t Concordance: ‘active learning’ vs.‘learning classification’ Completeness: ‘vector machine’ vs. ‘support vector machine’ Frequent pattern mining Comparability property: directly compare phrases of mixed lengths

Comparison of phrase ranking methods KERT: Topical Phrases on Machine Learning Top-Ranked Phrases by Mining Paper Titles in DBLP kpRel [Zhao et al. 11] KERT (-popularity) (-discriminativeness) (-concordance) [Danilevsky et al. 14] learning effective support vector machines classification text feature selection selection probabilistic reinforcement learning models identification conditional random fields feature algorithm mapping constraint satisfaction decision features task decision trees bayesian planning dimensionality reduction trees : Comparison of phrase ranking methods The topic that represents the area of Machine Learning

Strategy 3: First Phrase Mining then Topic Modeling (TopMine) TopMine [El-Kishky et al. VLDB’15] First phrase construction, then topic mining Contrast with KERT: topic modeling, then phrase mining The ToPMine Framework: Perform frequent contiguous pattern mining to extract candidate phrases and their counts Perform agglomerative merging of adjacent unigrams as guided by a significance score—This segments each document into a “bag-of-phrases” The newly formed bag-of-phrases are passed as input to PhraseLDA, an extension of LDA, that constrains all words in a phrase to each sharing the same latent topic

Why First Phrase Mining then Topic Modeling ? With Strategy 2, tokens in the same phrase may be assigned to different topics knowledge discovery using least squares support vector machine classifiers… Knowledge discovery and support vector machine should have coherent topic labels Solution: switch the order of phrase mining and topic model inference [knowledge discovery] using [least squares] [support vector machine] [classifiers] … Techniques Phrase mining and document segmentation Topic model inference with phrase constraints

Phrase Mining: Frequent Pattern Mining + Statistical Analysis Quality phrases Significance score [Church et al.’91]: α(A, B) = (|AB| ̶ |A|)/√|AB| [Markov blanket] [feature selection] for [support vector machines] [knowledge discovery] using [least squares] [support vector machine] [classifiers] …[support vector] for [machine learning]… Phrase Raw freq. True freq. [support vector machine] 90 80 [vector machine] 95 [support vector] 100 20

Collocation Mining Collocation: A sequence of words that occur more frequently than expected Often “interesting” and due to their non-compositionality, often relay information not portrayed by their constituent terms (e.g., “made an exception”, “strong tea”) Many different measures used to extract collocations from a corpus [Dunning 93, Pederson 96] E.g., mutual information, t-test, z-test, chi-squared test, likelihood ratio Many of these measures can be used to guide the agglomerative phrase-segmentation algorithm

ToPMine: Phrase LDA (Constrained Topic Modeling) The generative model for PhraseLDA is the same as LDA Difference: the model incorporates constraints obtained from the “bag-of-phrases” input Chain-graph shows that all words in a phrase are constrained to take on the same topic values [knowledge discovery] using [least squares] [support vector machine] [classifiers] … Topic model inference with phrase constraints

Example Topical Phrases: A Comparison social networks information retrieval web search text classification time series machine learning search engine support vector machines management system information extraction real time neural networks decision trees text categorization : Topic 1 Topic 2 information retrieval feature selection social networks machine learning web search semi supervised search engine large scale information extraction support vector machines question answering active learning web pages face recognition : Topic 1 Topic 2 An interesting comparison A state of the art phrase-discovering topic model ToPMine [El-kishky et al. 14] Strategy 3 (67 seconds) PDLDA [Lindsey et al. 12] Strategy 1 (3.72 hours)

ToPMine: Experiments on DBLP Abstracts An interesting comparison A state of the art phrase-discovering topic model

ToPMine: Topics on Associate Press News (1989) An interesting comparison A state of the art phrase-discovering topic model

ToPMine: Experiments on Yelp Reviews An interesting comparison A state of the art phrase-discovering topic model

Efficiency: Running Time of Different Strategies An interesting comparison A state of the art phrase-discovering topic model Running time: strategy 3 > strategy 2 > strategy 1

Coherence of Topics: Comparison of Strategies An interesting comparison A state of the art phrase-discovering topic model Coherence measured by z-score: strategy 3 > strategy 2 > strategy 1

Phrase Intrusion: Comparison of Strategies An interesting comparison A state of the art phrase-discovering topic model Phrase intrusion measured by average number of correct answers: strategy 3 > strategy 2 > strategy 1

Phrase Quality: Comparison of Strategies An interesting comparison A state of the art phrase-discovering topic model Phrase quality measured by z-score: strategy 3 > strategy 2 > strategy 1

Summary: Strategies on Topical Phrase Mining Strategy 1: Generate bag-of-words → generate sequence of tokens Integrated complex model; phrase quality and topic inference rely on each other Slow and overfitting Strategy 2: Post bag-of-words model inference, visualize topics with n- grams Phrase quality relies on topic labels for unigrams Can be fast; generally high-quality topics and phrases Strategy 3: Prior bag-of-words model inference, mine phrases and impose to the bag-of-words model Topic inference relies on correct segmentation of documents, but not sensitive can be fast; generally high-quality topics and phrases

Recommended Readings M. Danilevsky, C. Wang, N. Desai, X. Ren, J. Guo, J. Han. Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents“, SDM’14 X. Wang, A. McCallum, X. Wei. Topical n-grams: Phrase and topic discovery, with an application to information retrieval, ICDM’07 R. V. Lindsey, W. P. Headden, III, M. J. Stipicevic. A phrase-discovering topic model using hierarchical pitman-yor processes, EMNLP-CoNLL’12. Q. Mei, X. Shen, C. Zhai. Automatic labeling of multinomial topic models, KDD’07 D. M. Blei and J. D. Lafferty. Visualizing Topics with Multi-Word Expressions, arXiv:0907.1013, 2009 M. Danilevsky, C. Wang, N. Desai, J. Guo, J. Han. Automatic construction and ranking of topical keyphrases on collections of short documents, SDM’14 A. El-Kishky, Y. Song, C. Wang, C. R. Voss, J. Han. Scalable Topical Phrase Mining from Text Corpora, VLDB'15 K. Church, W. Gale, P. Hanks, D. Kindle. Chap 6. Using statistics in lexical analysis, 1991 27