Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Effect of Collection Organization and Query Locality on IR Performance 2003/07/28 Park,

Similar presentations


Presentation on theme: "The Effect of Collection Organization and Query Locality on IR Performance 2003/07/28 Park,"— Presentation transcript:

1 The Effect of Collection Organization and Query Locality on IR Performance 2003/07/28 Park, Dae-Won(bluepepe@pusan.ac.kr)

2 Contents Introduction of Distributed IR Related Works System Architecture Query Locality Experiments Conclusions

3 Introduction(1) Distributed IR System 연구 목적  Content 증가에 따른 검색 성능의 유지 향상 Decrease query response time, maintain effectiveness 관련 연구  Caching ( Martin and Russell,1991; Markatos,1999)  Collection selection ( Voorhees,1995; Callan,1995; French,1999; Xu and Croft, 1999)  Partial replication (Lu and Mackinley, 1999)

4 Introduction(2) In this paper Use previous works  Collection selection, partial replication Use collection organization  Determine when and how to use collection selection and replication  Classify collection organizations as either by topic, source, or random

5 Related Works Architecture IR versus database systems Unstructured data versus structured data IR versus the web Static collection : case law, journal articles,,, Caching Collection selection

6 Related Works : Architecture(1) architecture for parallel and distributed IR Harman et al.,1991 Show the feasibility of a distributed IR system by developing a prototype architecture Burkowski,1990, Burkowski et al., 1995 Simulation study which measures the retrieval performance of a distributed IR system Two strategies for distributing a fixed workload  Equally distribute the text collection  Split servers into query evaluation group and document retrieval

7 Related Works : Architecture(2) Couvreur et al.,1994 Analyze the performance and cost factors Three different hardware architectures Hawking,1997 Design and implement a parallel IR system, PADRE97, on a collection of workstations  Central process : check user command, broadcast to the IR engines and merge results

8 Related Works : Architecture(3) Cahoon and Mckinley,1996 & Cahoon,1999 Distributed IR system based on INQUERY Collection  Uniformly distributed  Up to 128GB using a variety of workloads Measure performance as a function of system parameters such as client command rate, number of document collections,,,

9 Related Works : Caching Markatos, 1999 Caches web queries and their results  Require exact match ( 단점 )  Increase locality by determining query similarity to replicas

10 Related Works : Collection selection(1) Working on how to select the most relevant collection for a given query Danzig et al., 1991  Use a hierarchy of brokers to maintain indices  Support Boolean keyword matching Voorhees et al., 1995  Exploit similarity between a new query and relevance judgments for previous queries

11 Related Works : Collection selection(2) Callan et al., 1995  Adapt the document inference network to ranking collections by replacing the document node with the collection node  Store the collection ranking inference network with document frequencies and term frequencies Xu an Croft, 1999  Propose cluster-based language model for collection selection  Apply clustering algorithms to organize document into collections based on topics, and then apply the approach of Callan et al.,1995 to select the most relevant collections

12 System Architecture(1) Architecture for a distributed information retrieval system base on INQUERY Client 1 Client 2 Client 3 Client m Connection Broker INQUERY Server 1 INQUERY Server 2 INQUERY Server 3 Collections INQUERY Server n

13 System Architecture(2) use collection selector Client 1 Client 2 Client 3 Client m Connection Broker Collection Selector INQUERY Server 1 INQUERY Server 2 INQUERY Server 3 Collections INQUERY Server n

14 System Architecture(3) replica selector and collection selector Client 1 Client 2 Client 3 Client m Connection Broker Replica Selector Collection Selector INQUERY Server 1 INQUERY Server k Original Collections INQUERY Server K+1 INQUERY Server p Replica 1 INQUERY Server n Replica q

15 System Architecture(4) Collection Set of documents No overlaps between documents in any two collections Organized either by topic, source(for example, newspaper, journals,,, ), or randomly Connection Brokers A process that keeps track of all registered clients and INQUERY servers

16 System Architecture(5) Connection Brokers A process that keeps track of all registered clients and INQUERY servers  Forward command to the appropriate servers  Maintain intermediate result Merge result with other results Send the final result to the client

17 System Architecture(6) Collection selector Choose the most relevant collections from some set of collections on a query-by-query basis Maintain a collection selection database with collection level information for each collection

18 System Architecture(7) Replica selector Replicate a portion of the original collection (if the same or related queries repeat) Build a partial replica for the whole  Subset of the original collection

19 Collection Organization Collection access skew When queries are relevant to a few collections and collection selection concentrates queries in these collections Model using a Zipf-like function  Z(i) = c/i 1- , where c=1/  (1/j 1-  ), 1 <= i <= C

20 Query Locality If users repeatedly issue queries on the same topics, a set of document will receive more hits, which results in query locality Partial replication off-load the services on original collections Correlation with collection access skew  If query locality is low, collections accessed uniformly  If query locality is high, collection access may range from uniform to highly skewed

21 Experiments(1) Demonstrate the performance impact of collection organization and query locality 256 GB of data using 9 servers 8 servers : store the original collections 9 th server : store 32 GB partial replica or partition the data further Include collection selector and replica selector in the connection broker

22 Experiments(2) Random Organization(1) Randomly partition data over collections

23 Experiments(3) Random Organization(2) Randomly partition data over collections

24 Experiments(4) Source Organization Collections are organized by source

25 Experiments(5) Source Organization Collections are organized by source

26 Experiments(6) Topic organization Collections are organized by source

27 Experiments(7) Topic organization Collections are organized by source

28 Conclusions Effect of query locality and collection organization on the design and performance of IR system Collection selection improves performance significantly when either collection access is fairly uniform or collections are organized based on topics Query locality enables partial replication to improve performance over collection selection with partitioning


Download ppt "The Effect of Collection Organization and Query Locality on IR Performance 2003/07/28 Park,"

Similar presentations


Ads by Google