Hierarchical Dirichlet Trees for Information Retrieval Gholamreza Haffari Simon Fraser University Yee Whye Teh University College London NAACL talk, Boulder,

Slides:

Advertisements

Similar presentations

A Tutorial on Learning with Bayesian Networks

Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.

A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao, and ChengXiang Zhai University of Illinois at Urbana Champaign SIGIR 2004 (Best paper.

Hierarchical Dirichlet Processes

Fast Algorithms For Hierarchical Range Histogram Constructions

SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.

1 Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda.

Cumulative Progress in Language Models for Information Retrieval Antti Puurula 6/12/2013 Australasian Language Technology Workshop University of Waikato.

Learning Visual Similarity Measures for Comparing Never Seen Objects Eric Nowak, Frédéric Jurie CVPR 2007.

Information Retrieval Models: Probabilistic Models

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

Incorporating Language Modeling into the Inference Network Retrieval Framework Don Metzler.

Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.

Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.

Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.

Maximum likelihood (ML)

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

Latent Dirichlet Allocation a generative model for text

Basic Data Mining Techniques

1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”

Using CTW as a language modeler in Dasher Phil Cowans, Martijn van Veen Inference Group Department of Physics University of Cambridge.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.

Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.

Language Modeling Approaches for Information Retrieval Rong Jin.

Maximum likelihood (ML)

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?

MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.

Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Active Learning for Statistical Phrase-based Machine Translation Gholamreza Haffari Joint work with: Maxim Roy, Anoop Sarkar Simon Fraser University NAACL.

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.

Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.

Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.

A General Optimization Framework for Smoothing Language Models on Graph Structures Qiaozhu Mei, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.

Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Date : 2012/10/25 Author : Yosi Mass, Yehoshua Sagiv Source : WSDM’12 Speaker : Er-Gang Liu Advisor : Dr. Jia-ling Koh 1.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Randomized Algorithms for Bayesian Hierarchical Clustering

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

1 Using The Past To Score The Present: Extending Term Weighting Models with Revision History Analysis CIKM’10 Advisor ： Jia Ling, Koh Speaker ： SHENG HONG,

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Semantic v.s. Positions: Utilizing Balanced Proximity in Language Model Smoothing for Information Retrieval Rui Yan†, ♮, Han Jiang†, ♮, Mirella Lapata‡,

A Language Modeling Approach to Information Retrieval 한 경 수  Introduction  Previous Work  Model Description  Empirical Results  Conclusions.

1 A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao and ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.

Automatic Labeling of Multinomial Topic Models

Bushy Binary Search Tree from Ordered List. Behavior of the Algorithm Binary Search Tree Recall that tree_search is based closely on binary search. If.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.

Information Retrieval Models: Probabilistic Models

Language Models for Information Retrieval

Matching Words with Pictures

Topic Models in Text Processing

Compact routing schemes with improved stretch

CS590I: Information Retrieval

INF 141: Information Retrieval

Switching Lemmas and Proof Complexity

Presentation transcript:

Hierarchical Dirichlet Trees for Information Retrieval Gholamreza Haffari Simon Fraser University Yee Whye Teh University College London NAACL talk, Boulder, June 2009

2 Outline Information Retrieval (IR) using Language Models –Hierarchical Dirichlet Document (HDD) model Hierarchical Dirichlet Trees –The model, Inference – Learning the Tree and Parameters Experiments –Experimental Results, Analysis Conclusion

3 Outline Information Retrieval (IR) using Language Models –Hierarchical Dirichlet Document (HDD) model Hierarchical Dirichlet Trees –The model, Inference – Learning the Tree and Parameters Experiments –Experimental Results, Analysis Conclusion

44 ad-hoc Information Retrieval In ad-hoc information retrieval, we are given –a collection of documents d 1,..,d n –a query q The task is to return a (sorted) list of documents based on their relevance to q

55 A Language Modeling approach to IR Build a language model (LM) for each document d i Sort documents based on the probability of generating the query q using their language model P(q|LM(d i ) ) For simplification, we can make the bag-of-words assumption; i.e. the terms are generated independently and identically –So our LM is a multinomial distribution whose dimension is the size of the dictionary

6 The Collection Model Because of the sparsity, training a LM using a single document gives poor performance We should smooth the document LMs using a collection-wide model –A Dirichlet distribution can be used to summarize collection-wide information –This Dirichlet distribution is used as a prior for documents’ language models

7 The Hierarchical Dirichlet Distribution Uniform distribution (Figure is taken from Phil Cowans, 2004) 00 11 22 33 44 55

8 The Hierarchical Dirichlet Distribution The HDD model is intuitively appealing –It is reasonable to assume that the LM of individual documents vary (to some extent) about a common model By making the common mean a random variable, instead of fixing it beforehand, information is shared across documents –It leads to an inverse document frequency effect (Cowans, 2004) But the model has a deficiency – How can we tell it if a pair of words must have positive/negative correlation in the learned LM ? (an effect similar to query expansion)

9 Outline Information Retrieval (IR) using Language Models –Hierarchical Dirichlet Document (HDD) model Hierarchical Dirichlet Trees –The model, Inference, –Learning the Tree and Parameters Experiments –Experimental Results, Analysis Conclusion

10 Injecting Prior Knowledge Represent the words inter-dependencies with a binary tree –Correlated words are placed nearby in the tree and at the leaf level –Can use WordNet or a word clustering algorithm The tree can represent a multinomial distribution over words (we will see it shortly) which we call multinomial-tree –In the model, replace the flat multinomial distributions with these multinomial-tree distributions Instead of the Dirichlet distribution, we use a prior which is called Dirichlet-Tree distribution

11 Multinomial-Tree Distribution Given a tree and the probability of choosing one of a node’s children –The prob of reaching a particular leaf is the product of probabilities on the unique path from the root to that leaf –We call the multinomial at the leaf level as multinomial-tree distribution (.14,.06,.8)

12 Dirichlet-Tree Distribution p1p1 p2p2 p3p3 Put a Dirichlet distribution over each node’s probability distribution on selecting its children Dirichlet(.2,.8) Dirichlet(.3,.7) The resulting prior over the multinomial distribution at the leaf level is called Dirichlet-Tree distribution

13 Hierarchical Dirichlet Tree Model The parent-child nodes (k,l) on the path from the root of the tree to a word at the leaf level For each node k in the tree

14 Inference in the HDT Model We do approximate inference by making the minimum oracle assumption –Each time you see a word in a document, increment the count of the nodes on its path to the root (in the local tree) by one –The nodes on the path from root to a word (in the global tree) are asked only the first time that the term is seen in each document, and never asked subsequently root w The Global Tree root w The Local Tree … …

15 The Document Score Our document score can be thought as fixing  0k to a good value beforehand and then integrating out  dk Hence the score for the document d, i.e. the relevance score, is nlnl nknk nlnl nknk

16 Learning the Tree Structure We used three agglomerative clustering algorithms to build the tree structure over the vocabulary words –Brown clustering (BCluster) –Distributional clustering (DCluster) –Probabilistic hierarchical clustering (PCluster) Since the inference involves visiting the nodes on the path from a word to the root, we would like to have low average depth for the leaf nodes –We introduced a tree simplification operator which tend to change the structure of the tree but not loosing so much information –Contract a subset of nodes in the tree which have a particular property

17 Simplification of the Tree Let be the length of the path from a node k to the closest leaf node – denotes nodes just above a leaf node – denotes nodes further up the tree If we have long branches in the tree, we can keep the root and make the leaves the immediate children of this node –It can be achieved algorithmically by contracting the nodes with Suppose the structure near to the surface of the tree is important, but the other internal nodes are less important –Nodes with can be contracted

18 Learning the Parameters We constrain the HDT model to be centered over the flat HDD model –We just allow learning the hyper-parameters  –A Gamma prior is put over , and the MAP is found by optimizing the objective function using LBFGS We set  0,k values so that the tree induces the same distribution on the document LMs as the flat HDD model 0303 0202  01  01 +  02 The Global Tree

19 Outline Information Retrieval (IR) using Language Models –Hierarchical Dirichlet Document (HDD) model Hierarchical Dirichlet Trees –The model, Inference –Learning the Tree and Parameters Experiments –Experimental Results, Analysis Conclusion

20 Experiments We present results on two datasets The baseline methods: (1) The flat HDD model (Cowans 2004), (2) Okapi BM25, (3) Okapi BM25 with query expansion The comparison criteria –Top-10 precision –Average precision Dataset# docs# queriesdictionary size Medline Cranfield

21 Results

22 Precision-Recall graph for Cranfield recall precision

23 Analysis  k >  0k  parent(k) means positive correlation in selecting the children of the node parent(k) – If a child has been selected, it’s likely to select more children This coincides with intuition – BCluster and DCluster produce trees which put similar-meaning words nearby – PCluster tends to put words with high co-occurrence nearby

24 Examples of the Learned Trees

25 Conclusion We presented a hierarchical Dirichlet tree model for information retrieval –It can inject (semantical or syntactical) word relationships as the domain knowledge into a probabilistic model The model uses a tree which captures the relationship among the words –We investigated the effect of different tree building algorithms and their simplification Future research includes –Scaling up the method for larger datasets –Using WordNet or Wikipedia to build the tree

26 Merci Thank You

27 Inference in HDD Model Let the common mean m be –The oracle is asked the first time that each term is seen in each document, and never asked subsequently After integrating out document j LM parameters, the score is –The blue part gives an effect similar to term-frequency inverse- document-frequency (TF-IDF). –The red part normalizes for the document length.

28 Learning the Parameters We put a Gamma(b  k +1,b) prior over the values of the hyper-parameters  k, where  k is the mode –Setting  k as follows reduces the model to HDD in the mode of the Gamma distribution We used LBFGS to find the MAP values where the derivative of the objective function wrt  k is