Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Distributed Indexing Strategy for Efficient XML Retrieval Efficiency Issues in Information Retrieval Workshop 30th European Conference on Information.

Similar presentations


Presentation on theme: "A Distributed Indexing Strategy for Efficient XML Retrieval Efficiency Issues in Information Retrieval Workshop 30th European Conference on Information."— Presentation transcript:

1 A Distributed Indexing Strategy for Efficient XML Retrieval Efficiency Issues in Information Retrieval Workshop (EIIR) @ 30th European Conference on Information Retrieval (ECIR), Glasgow, GB, March/April 2008 Judith Winter Institute for Informatics / Telematics Group J. W. Goethe-University / Frankfurt am Main, Germany

2 Judith Winter: A Distributed Indexing Strategy for Efficient XML Retrieval 2 A Distributed Indexing Strategy for Efficient XML Retrieval Overview 1.Introduction 2.A search engine for XML IR in P2P 3.Indexing techniques 4.Outlook on current implementation 5.Questions and discussion 1. Introduction

3 Judith Winter: A Distributed Indexing Strategy for Efficient XML Retrieval 3 XML Information Retrieval in Peer-to-Peer Systems structured documents more precise search based on c/s architectures distributed autonomous peers growing amount of XML-documents vague queries relevance-ranking XML- Retrieval Information Retrieval Peer-to-Peer Challenges: bandwith consumption / communication overhead only selected information available 1.Introduction 2.Architecture 3.Indexing 4.Outlook

4 Judith Winter: A Distributed Indexing Strategy for Efficient XML Retrieval 4 Queries: content-and-structure (CAS) Indexing: include structure Fixed limit for posting list sizes; pre-computing of posting lists for popular term combinations  highly discriminative keys (HDKs) Hybrid indexing: globally or locally (distributing summaries) depending on peer status Pruning posting lists by considering structural information Ranking: extended vector space model Results/Retrieval units: document or passage retrieval System characteristics: 1.Introduction 2.Search engine 3.Indexing 4.Outlook

5 Judith Winter: A Distributed Indexing Strategy for Efficient XML Retrieval 5 Index storage component local index distributed index INFORMATION RETRIEVAL PEER-TO-PEER APPLICATION Retrieval component Ranking component P2P component variant of DHT-algorithm (Kademlia/Chord) Document index Retrieval unit index documents d n query q results for q term statistics for retrieval units(d) Graphical User Interface Indexing Indexing component Frequent XTerm index HDK index Querying & result presentation P2P network Document index HDK index frequencies Retrieval unit index File system local documents 1.Introduction 2.Search engine 3.Indexing 4.Outlook

6 Judith Winter: A Distributed Indexing Strategy for Efficient XML Retrieval 6 Use of XTerms: (content, structure)-tuples Rare tuple-combinations: Highly Discriminative Keys (HDKs) Over 80% multiterm queries  precomputed key-combinations If key is frequent (frequency exceeds threshold): combine with other frequent keys of same window (e.g. same XML element) Example HDK-based indexing: apple\book\chapter  dok1(14.5), dok2(12.4) \magazine\p  dok2(5.3), dok3(2.7), dok4(0.7) chips \book  dok4(18.4), dok1(2.3), dok2(2.1), dok3(1.5) 1.Introduction 2.Search Engine 3.Indexing 4.Outlook

7 Judith Winter: A Distributed Indexing Strategy for Efficient XML Retrieval 7 Entries sorted by score t (d i ); choose k best entries for XTerm t Considers document d i, best retrieval unit ru best, and peer p i Weighting function w: BM25f-based PeerScore: high for peers with good collections regarding t and with good performance metrics Pruning posting lists (FrequentXTermIndex): 1.Introduction 2.Search Engine 3.Indexing 4.Outlook

8 Judith Winter: A Distributed Indexing Strategy for Efficient XML Retrieval 8 Indexing depending on status of peer: Exhaustive indexing: per document Quick indexing: per peer (summaries, e.g. tf per peer) Peer status considers: Response times Available bandwidth Open IP address (vs. NAT-bound) Latency CPU/Memory … Online time ( 65% of the peers joined the system online only once, >20% of all connections lasted <1 minute, 60% of the peers kept active <10 min) Hybrid indexing: 1.Introduction 2.Search Engine 3.Indexing 4.Outlook

9 Judith Winter: A Distributed Indexing Strategy for Efficient XML Retrieval 9 Implementation of SPIRIX: Search Engine for P2P Information Retrieval in XML-Documents Indexing based on Terrier (centralized approach for text documents, Uni Glasgow) P2P-complex: Based on Kademlia/Chord, Collects peer characteristics, Adapted to special requirements of XML IR Ranking: Extension of the vector space model, BM25f-based weighing Outlook on current implementation: 1.Introduction 2.Search Engine 3.Indexing 4.Outlook

10 Judith Winter: A Distributed Indexing Strategy for Efficient XML Retrieval 10 A Distributed Indexing Strategy for Efficient XML Retrieval 1.Introduction 2.Architecture for XML IR in P2P 3.Indexing techniques 4.Outlook on current implementation 5.Questions and discussion


Download ppt "A Distributed Indexing Strategy for Efficient XML Retrieval Efficiency Issues in Information Retrieval Workshop 30th European Conference on Information."

Similar presentations


Ads by Google