Collection Fusion in Carrot2

Collection Fusion in Carrot2
Mithun Sheshagiri

Acknowledgements Prof. Scott Cost Srikanth Kallurkar Hemali Majithia

Overview Collection Fusion Problem in IR Possible solutions
Equal Distribution Assumption Comparable similarities Modeling Relevant Document Distribution Query Clustering Carrot2 System Query Routing in Carrot2

Overview Collection Fusion in Carrot2 Future Work Conclusions
References

The Collection Fusion Problem
Centralized Indexing and Retrieval. Distributed IR Systems The Collection Fusion Problem Determining the number of documents that need to be retrieved from each sub-collection Interleaving the documents returned by each sub-collection

Possible Solutions Equal Distribution Assumption
Assumes that relevant documents are distributed equally across all sub-collections Comparable similarities Documents in the final result are listed as though the similarities are normalized across sub-collections. Similarity values are dependant on sub-collections A rare but not so relevant document can have higher ranking

Possible Solutions Modeling Relevant Document Distribution
The document distribution model is built using training queries. The document distribution for a query q is obtained by averaging the number of relevant documents retrieved by the k nearest queries. This is done for all sub-collections. These document distributions along with the total number of documents to be retrieved is passed to a maximization procedure.

Possible Solutions Modeling Relevant Document Distribution
This maximization procedure calculates a cut-off value for each sub-collection.

Possible Solutions Query Clustering
Query clusters are formed by grouping training queries which return some identical documents. A weight is assigned to each cluster. Weight is computed based on the number of relevant documents returned by the queries belonging to the cluster. The centroid of the query cluster is calculated by averaging the query vectors belonging to that query cluster.

Possible Solutions Query Clustering
The cluster whose centroid is most similar to the user query is selected and its weight is returned. The set of weights returned by all the sub-collections are used to apportion the retrieved set. wi (N) wi wi: Weight returned by the cluster N : Number of documents in the final result

Carrot2 System Carrot2 is a agent based distributed IR system.
Uses Jackal Communication Infrastructure KQML is used by agents for communication Agents interface with IR engine through a wrapper Wrapper provides functionality to index documents as well as metadata

Carrot2 System Metadata is a reduced representation of the sub-collection. (8-10)% Metadata is a vector consisting of N-grams (terms) and the number of documents that contain it. On start-up an agent is allotted a sub-collection. Every agent has an associated metadata object. An agent also has access to a metadata pool.

Query Routing in Carrot2
Query is submitted to a Query Manager. Query manager picks an agent from a list of agents returned by the Collection Manager. Every agent queries its metadata pool and makes a decision. Query its local collection. Forward the query. Combination of both.

Query Routing in Carrot2
The process ends when There are no more agents that have not already received the query. The number of times the query has been forwarded has reached a threshold value.

Collection fusion in Carrot2
An approach similar to query clustering. Query cluster Metadata object Representations of sub-collections Both have a weight/similarity which is an indication of the relevance of the documents in the sub-collection to the given query. The similarity values of the metadata objects can be used to apportion the total number of documents that need to be returned.

Requirement for implementation Access to the metadata object of all participating sub-collections (C2 agents). Using the metadata pool of one agent when the metadata objects are distributed in broadcast mode. (Flooding strategy) A new agent which accesses the metadata objects of all participating agents.

Similarity value is appended to the result returned by each agent. The interleaving can be done by rolling a C-faced die which is biased by the number of documents that are still to be picked from the original result set.

Future Work The suitability of the proposed technique to the C2 system should be experimentally verified. This technique makes use of existing entities and information, implementation can be done with minimal changes to the existing architecture.

Conclusion Combination of query clustering like approach along with probabilistic interleaving is a good candidate for collection fusion in C2 Decentralized nature Use of existing entities Easy to implement Less prone to scalability issues.

References Ellen M. Voorhees, Narendra Gupta, and Ben JohnsonLaird. Learning collection fusion strategies. James P Callan, Zhihong Lu and Bruce Croft Searching Distributed Collections With Inference Networks. E. M. Voorhees, N. K. Gupta, and B. JohnsonLaird. The collection fusion problem.

Collection Fusion in Carrot2

Similar presentations

Presentation on theme: "Collection Fusion in Carrot2"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Collection Fusion in Carrot2

Similar presentations

Presentation on theme: "Collection Fusion in Carrot2"— Presentation transcript:

Similar presentations

About project

Feedback