Download presentation

Presentation is loading. Please wait.

Published byKeenan Cape Modified over 2 years ago

1
Hierarchical Dirichlet Trees for Information Retrieval Gholamreza Haffari Simon Fraser University Yee Whye Teh University College London NAACL talk, Boulder, June 2009

2
2 Outline Information Retrieval (IR) using Language Models –Hierarchical Dirichlet Document (HDD) model Hierarchical Dirichlet Trees –The model, Inference – Learning the Tree and Parameters Experiments –Experimental Results, Analysis Conclusion

3
3 Outline Information Retrieval (IR) using Language Models –Hierarchical Dirichlet Document (HDD) model Hierarchical Dirichlet Trees –The model, Inference – Learning the Tree and Parameters Experiments –Experimental Results, Analysis Conclusion

4
44 ad-hoc Information Retrieval In ad-hoc information retrieval, we are given –a collection of documents d 1,..,d n –a query q The task is to return a (sorted) list of documents based on their relevance to q

5
55 A Language Modeling approach to IR Build a language model (LM) for each document d i Sort documents based on the probability of generating the query q using their language model P(q|LM(d i ) ) For simplification, we can make the bag-of-words assumption; i.e. the terms are generated independently and identically –So our LM is a multinomial distribution whose dimension is the size of the dictionary

6
6 The Collection Model Because of the sparsity, training a LM using a single document gives poor performance We should smooth the document LMs using a collection-wide model –A Dirichlet distribution can be used to summarize collection-wide information –This Dirichlet distribution is used as a prior for documents’ language models

7
7 The Hierarchical Dirichlet Distribution Uniform distribution (Figure is taken from Phil Cowans, 2004) 00 11 22 33 44 55

8
8 The Hierarchical Dirichlet Distribution The HDD model is intuitively appealing –It is reasonable to assume that the LM of individual documents vary (to some extent) about a common model By making the common mean a random variable, instead of fixing it beforehand, information is shared across documents –It leads to an inverse document frequency effect (Cowans, 2004) But the model has a deficiency – How can we tell it if a pair of words must have positive/negative correlation in the learned LM ? (an effect similar to query expansion)

9
9 Outline Information Retrieval (IR) using Language Models –Hierarchical Dirichlet Document (HDD) model Hierarchical Dirichlet Trees –The model, Inference, –Learning the Tree and Parameters Experiments –Experimental Results, Analysis Conclusion

10
10 Injecting Prior Knowledge Represent the words inter-dependencies with a binary tree –Correlated words are placed nearby in the tree and at the leaf level –Can use WordNet or a word clustering algorithm The tree can represent a multinomial distribution over words (we will see it shortly) which we call multinomial-tree –In the model, replace the flat multinomial distributions with these multinomial-tree distributions Instead of the Dirichlet distribution, we use a prior which is called Dirichlet-Tree distribution

11
11 Multinomial-Tree Distribution Given a tree and the probability of choosing one of a node’s children –The prob of reaching a particular leaf is the product of probabilities on the unique path from the root to that leaf –We call the multinomial at the leaf level as multinomial-tree distribution.2.8.7.3.14.06.8 (.14,.06,.8)

12
12 Dirichlet-Tree Distribution p1p1 p2p2 p3p3 Put a Dirichlet distribution over each node’s probability distribution on selecting its children Dirichlet(.2,.8) Dirichlet(.3,.7) The resulting prior over the multinomial distribution at the leaf level is called Dirichlet-Tree distribution

13
13 Hierarchical Dirichlet Tree Model The parent-child nodes (k,l) on the path from the root of the tree to a word at the leaf level For each node k in the tree

14
14 Inference in the HDT Model We do approximate inference by making the minimum oracle assumption –Each time you see a word in a document, increment the count of the nodes on its path to the root (in the local tree) by one –The nodes on the path from root to a word (in the global tree) are asked only the first time that the term is seen in each document, and never asked subsequently root w The Global Tree root w The Local Tree … …

15
15 The Document Score Our document score can be thought as fixing 0k to a good value beforehand and then integrating out dk Hence the score for the document d, i.e. the relevance score, is nlnl nknk nlnl nknk

16
16 Learning the Tree Structure We used three agglomerative clustering algorithms to build the tree structure over the vocabulary words –Brown clustering (BCluster) –Distributional clustering (DCluster) –Probabilistic hierarchical clustering (PCluster) Since the inference involves visiting the nodes on the path from a word to the root, we would like to have low average depth for the leaf nodes –We introduced a tree simplification operator which tend to change the structure of the tree but not loosing so much information –Contract a subset of nodes in the tree which have a particular property

17
17 Simplification of the Tree Let be the length of the path from a node k to the closest leaf node – denotes nodes just above a leaf node – denotes nodes further up the tree If we have long branches in the tree, we can keep the root and make the leaves the immediate children of this node –It can be achieved algorithmically by contracting the nodes with Suppose the structure near to the surface of the tree is important, but the other internal nodes are less important –Nodes with can be contracted

18
18 Learning the Parameters We constrain the HDT model to be centered over the flat HDD model –We just allow learning the hyper-parameters –A Gamma prior is put over , and the MAP is found by optimizing the objective function using LBFGS We set 0,k values so that the tree induces the same distribution on the document LMs as the flat HDD model 0303 0202 01 01 + 02 The Global Tree

19
19 Outline Information Retrieval (IR) using Language Models –Hierarchical Dirichlet Document (HDD) model Hierarchical Dirichlet Trees –The model, Inference –Learning the Tree and Parameters Experiments –Experimental Results, Analysis Conclusion

20
20 Experiments We present results on two datasets The baseline methods: (1) The flat HDD model (Cowans 2004), (2) Okapi BM25, (3) Okapi BM25 with query expansion The comparison criteria –Top-10 precision –Average precision Dataset# docs# queriesdictionary size Medline14002254227 Cranfield1033308800

21
21 Results

22
22 Precision-Recall graph for Cranfield recall precision

23
23 Analysis k > 0k parent(k) means positive correlation in selecting the children of the node parent(k) – If a child has been selected, it’s likely to select more children This coincides with intuition – BCluster and DCluster produce trees which put similar-meaning words nearby – PCluster tends to put words with high co-occurrence nearby 0. 9044 0. 7977 0.3344

24
24 Examples of the Learned Trees

25
25 Conclusion We presented a hierarchical Dirichlet tree model for information retrieval –It can inject (semantical or syntactical) word relationships as the domain knowledge into a probabilistic model The model uses a tree which captures the relationship among the words –We investigated the effect of different tree building algorithms and their simplification Future research includes –Scaling up the method for larger datasets –Using WordNet or Wikipedia to build the tree

26
26 Merci Thank You

27
27 Inference in HDD Model Let the common mean m be –The oracle is asked the first time that each term is seen in each document, and never asked subsequently After integrating out document j LM parameters, the score is –The blue part gives an effect similar to term-frequency inverse- document-frequency (TF-IDF). –The red part normalizes for the document length.

28
28 Learning the Parameters We put a Gamma(b k +1,b) prior over the values of the hyper-parameters k, where k is the mode –Setting k as follows reduces the model to HDD in the mode of the Gamma distribution We used LBFGS to find the MAP values where the derivative of the objective function wrt k is

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google