Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.

Slides:



Advertisements
Similar presentations
Topic models Source: Topic models, David Blei, MLSS 09.
Advertisements

January 23 rd, Document classification task We are interested to solve a task of Text Classification, i.e. to automatically assign a given document.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011.
Simultaneous Image Classification and Annotation Chong Wang, David Blei, Li Fei-Fei Computer Science Department Princeton University Published in CVPR.
An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.
Generative Topic Models for Community Analysis
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Caimei Lu et al. (KDD 2010) Presented by Anson Liang.
Information Retrieval in Practice
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.
Latent Dirichlet Allocation a generative model for text
Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)
A probabilistic approach to semantic representation Paper by Thomas L. Griffiths and Mark Steyvers.
Presented by Zeehasham Rasheed
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Multi-view Exploratory Learning for AKBC Problems Bhavana Dalvi and William W. Cohen School Of Computer Science, Carnegie Mellon University Motivation.
Distributed Representations of Sentences and Documents
Scalable Text Mining with Sparse Generative Models
Integrating Topics and Syntax Paper by Thomas Griffiths, Mark Steyvers, David Blei, Josh Tenenbaum Presentation by Eric Wang 9/12/2008.
Ontology Learning and Population from Text: Algorithms, Evaluation and Applications Chapters Presented by Sole.
Semantic History Embedding in Online Generative Topic Models Pu Wang (presenter) Authors: Loulwah AlSumait Daniel Barbará
by B. Zadrozny and C. Elkan
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Text Classification, Active/Interactive learning.
Bayesian Sets Zoubin Ghahramani and Kathertine A. Heller NIPS 2005 Presented by Qi An Mar. 17 th, 2006.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
CountryData Technologies for Data Exchange SDMX Information Model: An Introduction.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Memory Bounded Inference on Topic Models Paper by R. Gomes, M. Welling, and P. Perona Included in Proceedings of ICML 2008 Presentation by Eric Wang 1/9/2009.
1 Linmei HU 1, Juanzi LI 1, Zhihui LI 2, Chao SHAO 1, and Zhixing LI 1 1 Knowledge Engineering Group, Dept. of Computer Science and Technology, Tsinghua.
From Social Bookmarking to Social Summarization: An Experiment in Community-Based Summary Generation Oisin Boydell, Barry Smyth Adaptive Information Cluster,
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Analysis of Topic Dynamics in Web Search Xuehua Shen (University of Illinois) Susan Dumais (Microsoft Research) Eric Horvitz (Microsoft Research) WWW 2005.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Algorithmic Detection of Semantic Similarity WWW 2005.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Positional Relevance Model for Pseudo–Relevance Feedback Yuanhua Lv & ChengXiang Zhai Department of Computer Science, UIUC Presented by Bo Man 2014/11/18.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Presented By- Shahina Ferdous, Student ID – , Spring 2010.
Latent Dirichlet Allocation
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Relational Duality: Unsupervised Extraction of Semantic Relations between Entities on the Web Danushka Bollegala Yutaka Matsuo Mitsuru Ishizuka International.
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Hierarchical Semi-supervised Classification with Incomplete Class Hierarchies Bhavana Dalvi ¶*, Aditya Mishra †, and William W. Cohen * ¶ Allen Institute.
Probability and Statistics
Lecture 15: Text Classification & Naive Bayes
Latent Dirichlet Analysis
Statistical NLP: Lecture 9
Probability and Statistics
Michal Rosen-Zvi University of California, Irvine
Topic Models in Text Processing
Text Categorization Berlin Chen 2003 Reference:
Hierarchical, Perceptron-like Learning for OBIE
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth Mark Steyvers Source: ISWC2008 Reporter: Yong-Xiang Chen

2 Abstract Problem: mapping of an entire document or Web page to concepts in a given ontology Proposing a probabilistic modeling framework –Based on statistical topic models (also known as LDA models) –The methodology combines: human-defined concepts data-driven topics The methodology can be used to automatically tag Web pages with concepts –From a known set of concepts –Without any need for labeled documents –Map all words in a document, not just entities

3 Concepts from ontologies Ontology includes –Ontological concepts Associated vocabulary The hierarchical relations between concepts Topics from statistical models and concepts from ontologies both represent “focused” sets of words that relate to some abstract notion

4 Words that relate to “FAMILY”

5 Statistical Topic Models LDA is a state-of-the-art unsupervised learning technique for extracting thematic information Topic model –Words in a document arise via a two-stage process: Word-topic distribution: p(w|z) Topic-document distributions: p(z|d) –Both learned in a completely unsupervised manner Gibbs sampling assign topics to each word in the corpus –The topic variable z plays the role of a low-dimensional representation of the semantic content of a document

6 Topic A topic is in the form of a multinomial distribution over a vocabulary of words Topic can in a loose sense be viewed as a probabilistic representation of a semantic concept

7 How the statistical topic modeling techniques “overlay” probabilities on concepts within ontology? 1.C: amount of human-defined concepts –Each concept cj consists of a finite set of Nj unique words 2.A corpus of documents such as Web pages How to merge these two sources of information based on topic model? –“tagging” documents with concepts from the ontology, but with little or no supervised labeled data available The concept model: replace topic z with concept c Unknown parameters: p(wi|cj) and p(cj |d) Goal: to estimate these from an appropriate corpus

8 Concept model VS Topic model In the concept model –The words that belong to a concept are defined by a human a priori – Limited to a small subset of the overall vocabulary In the topic model –All words in the vocabulary can be associated with any particular topic but with different probabilities

9 Treat concepts as “topics with constraints” The constraints consist of setting words that are not a priori mentioned in a concept to have probability 0 Use Gibbs sampling to assign concepts to words in documents –The additional constraint is that a word can only be assigned to a concept that it is associated with in the ontology

10 Estimate unknown parameters To estimate p(wi|cj) –Count how many words in the corpus were assigned by the sampling algorithm to concept cj and normalize these counts to arrive at the probability distribution p(wi|cj) To estimate p(cj |d) –Count how many times each concept is assigned to a word in document d and again normalize and smooth the counts to obtain p(cj |d)

11 Example Learned probabilities for words (ranked highest by probability) for two different concepts from the CIDE ontolgy, after training on the TASA corpus

12 Usage of concept model Use the learned probabilistic representation of concepts to map new documents into concepts within an ontology Use the semantic concepts to improve the quality of data-driven topic models

13 Variations of the concept model framework Concept L Concept U (concept-uniform) –The word-concept probabilities p(wi|cj) are defined to be uniform for all words within a concept –Baseline model Concept F (concept-fixed) –The word-concept probabilities are available a priori as part of the concept definition For both of these models, Gibbs sampling used as before, but the p(w|c) probabilities are held fixed and not learned Concept LH, Concept FH –Incorporate hierarchical information such as a concept tree –An internal concept node is associated with its own words and all the words associated with its children ConceptL+Topics, ConceptLH+Topics –Incorporation of unconstrained data-driven topics alongside the concepts –Allowing the Gibbs sampling procedure to either assign a word to a constrained concept or to one of the unconstrained topics

14 Experiment setting (1/2) Text corpus – Touchstone Applied Science Associates (TASA) dataset –D = 37, 651 documents with passages excerpted from educational texts –Divided into 9 different educational topics –Only focus on the documents classified as SCIENCE(D = 5356, word tokens = 1.7M ) and SOCIAL STUDIES(D = 10, 501, word tokens = 3.4M)

15 Experiment setting (2/2) Concepts –The Open Directory Project (ODP), a humanedited hierarchical directory of the web Contains descriptions and urls on a large number of hierarchically organized topics Extracted all the topics in the SCIENCE subtree, which consists of C =10, 817 nodes –The Cambridge International Dictionary of English (CIDE) Consists of C = 1923 hierarchically organized semantic categories The concepts vary in the number of the words with a median of 54 words and a maximum of Each word can be a member of multiple concepts

16 Tagging Documents with Concepts Assigning likely concepts to each word in a document, depending on the context of the document The concept models assign concepts at the word level, so the document can be summarized at multiple levels of granularity –Snippets, sections, whole Web pages…

17 Using the ConceptU model to automatically tag a Web page with CIDE concepts

18 Using the ConceptU model to automatically tag a Web page with ODP concepts

19 Tagging at the word level using the ConceptL model

20 Language Modeling Experiments Preplexity –Lower scores are better since they indicate that the model’s distribution is closer to that of the actual text –90% of the documents being used for training –10%for computing test perplexity

21 Perplexity for various models

22 Topic model vs ConceptsLH+Topics

23 Varying the training data

24 Conclusion The model can automatically place words and documents in a text corpus into a set of human-defined concepts Illustrated how concepts can be “tuned” to a corpus to obtain a probabilistic language model leading to improved language models