The Effect of Collection Organization and Query Locality on IR Performance 2003/07/28 Park,

Slides:



Advertisements
Similar presentations
The Replica Location Service In wide area computing systems, it is often desirable to create copies (replicas) of data objects. Replication can be used.
Advertisements

Dissemination-based Data Delivery Using Broadcast Disks.
Chapter 5: Introduction to Information Retrieval
03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
FindAll: A Local Search Engine for Mobile Phones Aruna Balasubramanian University of Washington.
Scalable Content-aware Request Distribution in Cluster-based Network Servers Jianbin Wei 10/4/2001.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Web Logs and Question Answering Richard Sutcliffe 1, Udo Kruschwitz 2, Thomas Mandl University of Limerick, Ireland 2 - University of Essex, UK 3.
A Distributed Search Service for Peer-to-Peer File Sharing in Mobile Application Presented by Tony Sung On Loy, MC Lab, CUHK IE 1 A Distributed Search.
Parallel and Distributed IR
On Fairness, Optimizing Replica Selection in Data Grids Husni Hamad E. AL-Mistarihi and Chan Huah Yong IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
Information Retrieval
Distributed Information Retrieval Jamie Callan Carnegie Mellon University
WHAT HAVE WE DONE SO FAR?  Weeks 1 – 8 : various components of an information retrieval system  Now – look at various examples of information retrieval.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Overview of Search Engines
Achieving Load Balance and Effective Caching in Clustered Web Servers Richard B. Bunt Derek L. Eager Gregory M. Oster Carey L. Williamson Department of.
NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Master Thesis Defense Jan Fiedler 04/17/98
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Full-Text Search in P2P Networks Christof Leng Databases and Distributed Systems Group TU Darmstadt.
Efficient Peer to Peer Keyword Searching Nathan Gray.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
Enabling Peer-to-Peer SDP in an Agent Environment University of Maryland Baltimore County USA.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Performance Measurement. 2 Testing Environment.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
Recommending Adaptive Changes for Framework Evolution Barthélémy Dagenais and Martin P. Robillard ICSE08 Dec 4 th, 2008 Presented by EJ Park.
Modern Information Retrieval
Document Clustering and Collection Selection Diego Puppin Web Mining,
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
On the Placement of Web Server Replicas Yu Cai. Paper On the Placement of Web Server Replicas Lili Qiu, Venkata N. Padmanabhan, Geoffrey M. Voelker Infocom.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,
Information Retrieval in Practice
Finding Replicated web collections
Collection Fusion in Carrot2
Memory Management for Scalable Web Data Servers
Multimedia Information Retrieval
Paraskevi Raftopoulou, Euripides G.M. Petrakis
Panagiotis G. Ipeirotis Luis Gravano
Information Retrieval and Web Design
Relax and Adapt: Computing Top-k Matches to XPath Queries
Presentation transcript:

The Effect of Collection Organization and Query Locality on IR Performance 2003/07/28 Park,

Contents Introduction of Distributed IR Related Works System Architecture Query Locality Experiments Conclusions

Introduction(1) Distributed IR System 연구 목적  Content 증가에 따른 검색 성능의 유지 향상 Decrease query response time, maintain effectiveness 관련 연구  Caching ( Martin and Russell,1991; Markatos,1999)  Collection selection ( Voorhees,1995; Callan,1995; French,1999; Xu and Croft, 1999)  Partial replication (Lu and Mackinley, 1999)

Introduction(2) In this paper Use previous works  Collection selection, partial replication Use collection organization  Determine when and how to use collection selection and replication  Classify collection organizations as either by topic, source, or random

Related Works Architecture IR versus database systems Unstructured data versus structured data IR versus the web Static collection : case law, journal articles,,, Caching Collection selection

Related Works : Architecture(1) architecture for parallel and distributed IR Harman et al.,1991 Show the feasibility of a distributed IR system by developing a prototype architecture Burkowski,1990, Burkowski et al., 1995 Simulation study which measures the retrieval performance of a distributed IR system Two strategies for distributing a fixed workload  Equally distribute the text collection  Split servers into query evaluation group and document retrieval

Related Works : Architecture(2) Couvreur et al.,1994 Analyze the performance and cost factors Three different hardware architectures Hawking,1997 Design and implement a parallel IR system, PADRE97, on a collection of workstations  Central process : check user command, broadcast to the IR engines and merge results

Related Works : Architecture(3) Cahoon and Mckinley,1996 & Cahoon,1999 Distributed IR system based on INQUERY Collection  Uniformly distributed  Up to 128GB using a variety of workloads Measure performance as a function of system parameters such as client command rate, number of document collections,,,

Related Works : Caching Markatos, 1999 Caches web queries and their results  Require exact match ( 단점 )  Increase locality by determining query similarity to replicas

Related Works : Collection selection(1) Working on how to select the most relevant collection for a given query Danzig et al., 1991  Use a hierarchy of brokers to maintain indices  Support Boolean keyword matching Voorhees et al., 1995  Exploit similarity between a new query and relevance judgments for previous queries

Related Works : Collection selection(2) Callan et al., 1995  Adapt the document inference network to ranking collections by replacing the document node with the collection node  Store the collection ranking inference network with document frequencies and term frequencies Xu an Croft, 1999  Propose cluster-based language model for collection selection  Apply clustering algorithms to organize document into collections based on topics, and then apply the approach of Callan et al.,1995 to select the most relevant collections

System Architecture(1) Architecture for a distributed information retrieval system base on INQUERY Client 1 Client 2 Client 3 Client m Connection Broker INQUERY Server 1 INQUERY Server 2 INQUERY Server 3 Collections INQUERY Server n

System Architecture(2) use collection selector Client 1 Client 2 Client 3 Client m Connection Broker Collection Selector INQUERY Server 1 INQUERY Server 2 INQUERY Server 3 Collections INQUERY Server n

System Architecture(3) replica selector and collection selector Client 1 Client 2 Client 3 Client m Connection Broker Replica Selector Collection Selector INQUERY Server 1 INQUERY Server k Original Collections INQUERY Server K+1 INQUERY Server p Replica 1 INQUERY Server n Replica q

System Architecture(4) Collection Set of documents No overlaps between documents in any two collections Organized either by topic, source(for example, newspaper, journals,,, ), or randomly Connection Brokers A process that keeps track of all registered clients and INQUERY servers

System Architecture(5) Connection Brokers A process that keeps track of all registered clients and INQUERY servers  Forward command to the appropriate servers  Maintain intermediate result Merge result with other results Send the final result to the client

System Architecture(6) Collection selector Choose the most relevant collections from some set of collections on a query-by-query basis Maintain a collection selection database with collection level information for each collection

System Architecture(7) Replica selector Replicate a portion of the original collection (if the same or related queries repeat) Build a partial replica for the whole  Subset of the original collection

Collection Organization Collection access skew When queries are relevant to a few collections and collection selection concentrates queries in these collections Model using a Zipf-like function  Z(i) = c/i 1- , where c=1/  (1/j 1-  ), 1 <= i <= C

Query Locality If users repeatedly issue queries on the same topics, a set of document will receive more hits, which results in query locality Partial replication off-load the services on original collections Correlation with collection access skew  If query locality is low, collections accessed uniformly  If query locality is high, collection access may range from uniform to highly skewed

Experiments(1) Demonstrate the performance impact of collection organization and query locality 256 GB of data using 9 servers 8 servers : store the original collections 9 th server : store 32 GB partial replica or partition the data further Include collection selector and replica selector in the connection broker

Experiments(2) Random Organization(1) Randomly partition data over collections

Experiments(3) Random Organization(2) Randomly partition data over collections

Experiments(4) Source Organization Collections are organized by source

Experiments(5) Source Organization Collections are organized by source

Experiments(6) Topic organization Collections are organized by source

Experiments(7) Topic organization Collections are organized by source

Conclusions Effect of query locality and collection organization on the design and performance of IR system Collection selection improves performance significantly when either collection access is fairly uniform or collections are organized based on topics Query locality enables partial replication to improve performance over collection selection with partitioning