G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval Query-Driven Indexing for P2P Text Retrieval The Future of Web Search 19.07.2007 Bertinoro,

Slides:

Advertisements

Similar presentations

Google News Personalization: Scalable Online Collaborative Filtering

Advertisements

Chapter 5: Introduction to Information Retrieval

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in.

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

CHORD – peer to peer lookup protocol Shankar Karthik Vaithianathan & Aravind Sivaraman University of Central Florida.

PDPTA03, Las Vegas, June S-Chord: Using Symmetry to Improve Lookup Efficiency in Chord Valentin Mesaros 1, Bruno Carton 2, and Peter Van Roy 1 1.

Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,

Search and Replication in Unstructured Peer-to-Peer Networks Pei Cao, Christine Lv., Edith Cohen, Kai Li and Scott Shenker ICS 2002.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Denial-of-Service Resilience in Peer-to-Peer Systems D. Dumitriu, E. Knightly, A. Kuzmanovic, I. Stoica and W. Zwaenepoel Presenter: Yan Gao.

A Distributed Indexing Strategy for Efficient XML Retrieval Efficiency Issues in Information Retrieval Workshop 30th European Conference on Information.

P2p, Spring 05 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems March 29, 2005.

1 Distributed, Automatic File Description Tuning in Peer-to-Peer File-Sharing Systems Presented by: Dongmei Jia Illinois Institute of Technology April.

ACL, June Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

Routing of Structured Queries in Large-Scale Distributed Systems Workshop on Large-Scale Distributed Systems for Information Retrieval ACM.

Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.

SCALLOP A Scalable and Load-Balanced Peer- to-Peer Lookup Protocol for High- Performance Distributed System Jerry Chou, Tai-Yi Huang & Kuang-Li Huang Embedded.

Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.

Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.

Text-Based Content Search and Retrieval in ad hoc P2P Communities Francisco Matias Cuenca-Acuna Thu D. Nguyen

EPFL-I&C-LSIR [P-Grid.org] Workshop on Distributed Data and Structures ’04 NCCR-MICS [IP5] presented by Anwitaman Datta Joint work with Karl Aberer and.

ICDE A Peer-to-peer Framework for Caching Range Queries Ozgur D. Sahin Abhishek Gupta Divyakant Agrawal Amr El Abbadi Department of Computer Science.

Ecole Polytechnique Fédérale de Lausanne, Switzerland Efficient processing of XPath queries with structured overlay networks Gleb Skobeltsyn, Manfred Hauswirth,

Peer-to-peer file-sharing over mobile ad hoc networks Gang Ding and Bharat Bhargava Department of Computer Sciences Purdue University Pervasive Computing.

Freenet. Anonymity  Napster, Gnutella, Kazaa do not provide anonymity  Users know who they are downloading from  Others know who sent a query  Freenet.

Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.

G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval Query-Driven Indexing for Scalable P2P Text Retrieval Infoscale’07, June 6-8, 2007.

Query-Driven Indexing for Peer-to-Peer Text Retrieval ** WWW 2007 Banff, Canada Contact: Gleb Skobeltsyn Contact: Gleb Skobeltsyn

MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.

Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.

Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.

Using the Small-World Model to Improve Freenet Performance Hui Zhang Ashish Goel Ramesh Govindan USC.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.

Full-Text Search in P2P Networks Christof Leng Databases and Distributed Systems Group TU Darmstadt.

« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.

April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.

Efficient Peer to Peer Keyword Searching Nathan Gray.

Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.

AlvisP2P : Scalable Peer-to-Peer Text Retrieval in a Structured P2P Network Toan Luu, Gleb Skobeltsyn, Fabius Klemm, Maroje Puh, Ivana Podnar Zarko, Martin.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana.

National Institute of Advanced Industrial Science and Technology Query Processing for Distributed RDF Databases Using a Three-dimensional Hash Index Akiyoshi.

Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.

Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.

Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

DHT-based unicast for mobile ad hoc networks Thomas Zahn, Jochen Schiller Institute of Computer Science Freie Universitat Berlin 報告 : 羅世豪.

AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.

1. Efficient Peer-to-Peer Lookup Based on a Distributed Trie 2. Complex Queries in DHT-based Peer-to-Peer Networks Lintao Liu 5/21/2002.

Taxonomy Caching: A Scalable Low- Cost Mechanism for Indexing Remote Contents in Peer-to-Peer Systems Kjetil Nørvåg Norwegian University of Science and.

Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

Bandwidth-Efficient Continuous Query Processing over DHTs Yingwu Zhu.

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index Web Text Retrieval with a P2P Query-Driven Index Gleb Skobeltsyn EPFL, Lausanne Switzerland.

P2P Search COP6731 Advanced Database Systems. P2P Computing  Powerful personal computer Share computing resources P2P Computing  Advantages: Shared.

P2P Search COP P2P Search Techniques Centralized P2P systems  e.g. Napster, Decentralized & unstructured P2P systems  e.g. Gnutella.

NCLAB 1 Supporting complex queries in a distributed manner without using DHT NodeWiz: Peer-to-Peer Resource Discovery for Grids Sujoy Basu, Sujata Banerjee,

P2P Content Search: Give the Web Back to the People Matthias Bender Sebastin Michel Peter Triantafillou Gerhard Weikum Christian Zimmer Mariam John CSE.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Incrementally Improving Lookup Latency in Distributed Hash Table Systems Hui Zhang 1, Ashish Goel 2, Ramesh Govindan 1 1 University of Southern California.

Gleb Skobeltsyn Flavio Junqueira Vassilis Plachouras

Information Retrieval in Practice

Efficient Processing of Top-k Spatial Preference Queries

Presentation transcript:

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval Query-Driven Indexing for P2P Text Retrieval The Future of Web Search Bertinoro, Italy Gleb Skobeltsyn EPFL, Switzerland June 19, 2007 Joint work with: Toan Luu Ivana Podnar Žarko Martin Rajman Karl Aberer Alvis Alvis

DHT Goal goal scalableOur goal is to achieve scalable full-text retrieval with structured P2P networks (DHTs) G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval Each peer: Provides resources (bandwidth, storage) Searches the whole network Publishes its own documents 2 / 29

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval Naïve (single-term) approach... is to distribute the global inverted index in a DHT using term partitioning: K I Query: “epfl & gleb” h(“epfl”)-{d 1,d 2 } h(“gleb”)-{d 2,d 3 } h(t’)-{d 4,d 5 } K I This slide was borrowed from B. T. Loo, J. M. Hellerstein, R. Huebsch, S. Shenker, I. Stoica presentation: Enhancing P2P File-Sharing with an Internet-Scale Query Processor {d 1,d 2 } {d 2 } 3 / 29

Single-term vs. multi-term P2P indexing How to choose keys to keep a satisfactory retrieval quality? voc. size could grow exponentially! G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 4 / 29

Multi-term indexing: framework G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval responsible DHTEach peer is responsible for a set of keys assigned by the underlying DHT using the standard hashing mechanism keyEach key corresponds to a term or a set of terms truncated posting list (TPL) DF max top-rankedEach key is assigned to a truncated posting list (TPL) that stores at most DF max top-ranked document references  Distributed index contains {key,TPL} pairs optimizedThe indexing load is handled by an optimized DHT layer: F. Klemm, J.-Y. Le Boudec, D. Kostic, K. Aberer Improving the Throughput of Distributed Hash Tables Using Congestion-Aware Routing, in IPTPS'07 5 / 29

Single-term vs. multi-term P2P indexing How to choose keys to keep a satisfactory retrieval quality? voc. size could grow exponentially! G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 6 / 29

Multi-term indexing techniques Indexing with Highly Discriminative Keys (HDKs), based on: –Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys I. Podnar, M. Rajman, T. Luu, F. Klemm, K. Aberer in ICDE’07 –Beyond term indexing: A P2P framework for Web information retrieval I. Podnar, M. Rajman, T. Luu, F. Klemm, K. Aberer Informatica, vol. 30, no. 2, Query-Driven Indexing (QDI), based on: –Web Text Retrieval with a P2P Query-Driven Index G. Skobeltsyn, T. Luu, I. Podnar Žarko, M. Rajman, K. Aberer in SIGIR’07 –Query-Driven Indexing for Scalable Peer-to-Peer Text Retrieval G. Skobeltsyn, T. Luu, I. Podnar Žarko, M. Rajman, K. Aberer in Infoscale’07 G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 7 / 29

Indexing with HDK Data-Driven key generation: kEach time a new document is indexed, some posting lists for a key k can reach the max size of DF max triggers −It triggers the generation of new keys (k + other frequent keys) Use a number of filters to reduce the number of keys, e.g.: closew −Proximity Filter: a document qualifies for a key t1&t2 if t1 is close to t2 (specified by a window size w). G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 8 / 29

Indexing with HDK Pro’sPro’s: –ICDE’07 paper proves that the number of keys grows linearly –Elegant key generation mechanism –Low bandwidth while query processing (PL’s of limited size) Con’sCon’s: –Practically the number of keys is LARGE: 68M for 0.6M docs –High bandwidth consumption at indexing ProblemProblem: –Too many keys are superfluous (almost never used) G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 9 / 29

Query Driven Indexing Lets index only what is queried! G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 10 / 29

Contents Introduction Single-term vs. multi term indexing HDK approach for indexing Query-driven approach for indexing/retrieval –Indexing structure –Example –Scalability –Evaluation Conclusion G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 11 / 29

Query-Driven Index (QDI) Too-Many-KeysQuery-Driven Indexing strategy solves the “Too-Many-Keys” problem: –Avoids maintenance of superfluous keys –Generates only such keys that are requested by users –Utilizes query-log to discover such keys ProblemsProblems –Indexing of a new key requires a bandwidth-efficient mechanism to obtain the top-k posting list associated with the key Smart Broadcast (ONM) Smart Broadcast (ONM) or Conventional intersection like TA, but less frequent Conventional intersection like TA, but less frequent –Incomplete index causes degradation of query results quality Show that the degradation is low Show that the degradation is low G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 12 / 29

Which keys to index? Each single-term found in the document collection has to be indexed. basic single term index –We call all single-term keys a basic single term index. –The posting lists are truncated at DF max. non-superfluousactivatedA key k is non-superfluous and can be activated iff: –k is popular: QF(k) ≥QF min, where QF(k) is the popularity of the key k derived from the available query log and QF min is a parameter for our model (popularity filter). –k contains from 2 to s max terms: 2≤|k|≤ s max, where s max is a parameter of our model (size filter). –all immediate sub-keys of k (of size |k-1|) are indexed and their associated postings lists are truncated (redundancy filter). G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 13 / 29

QDI: Retrieval G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval abc abc abbcac Single term index is generated Process abc 1)Probe P abc 2)Probe P ab P bc and P ac 3)Probe P a P b and P c 4)Obtain top-DF max results for a, b and c (ranked w.r.t a, b and c respectively) 5)Contact peers in the list, re-rank the obtained results w.r.t abc 6)Output top-10 Inc. the QF for ab, bc and ac Activate (index) ac peer ?abc nothing ?abc nothing ?abc +1 DF max popular 14 / 29

QDI: Retrieval 2 G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval abc abbcac abc Assume the frequency of b is below DF max Note, how the redundancy filter would simplify the lattice in such a case (grayed nodes cannot be activated) DF max abc abbc 15 / 29

QDI: Retrieval 3 G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval abc abbcac abc Single term index is generated and ac is indexed Process abc 1)Probe P abc 2)Probe P ab P bc and P ac – obtain the result for ac 3)Probe P b and obtain the result for b 4)Contact all peers in the list to re-rank the obtained results w.r.t abc 5)Output top-10 Inc. the QF for ab, bc and ac peer ?abc nothing ?abc nothing ?abc / 29

Indexing on-demand … used to activate a new multi-term key ONM is a “smart” broadcast with the following features: –It is based on the shower multicast [2]: each peer within a specified range is contacted only once –Notifications are small and low-priority => piggybacking –Broadcast is split into several multicast sessions, each time pruning low-score documents –It uses the high-performance DHT layer [3] [2] A. Datta, M. Hauswirth, R. Schmidt, R. John, K. Aberer: Range Queries in Tree-Structured Overlays, in P2P’05 [3] F. Klemm, J.-Y. Le Boudec, D. Kostic, K. Aberer: Improving the Throughput of Distributed Hash Tables Using Congestion-Aware Routing, in IPTPS'07 G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 17 / 29

Scalability The retrieval traffic is bounded by a constant due to truncated posting lists (depends on DF max and a query size) The indexing traffic depends on the number of keys to be activated. linearly –The number of keys in the HDK approach (UPPER BOUND) is proven to grow linearly with the number of peers, if each peer provides a limited number of documents does not depend on the document collection size –The number of keys does not depend on the document collection size but only on the size of the query log indexing traffic retrieval quality –We can use the QF min parameter to adjust the tradeoff: indexing traffic retrieval quality G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 18 / 29

Contents Introduction Single-term vs. multi term indexing HDK approach for indexing Query-driven approach for indexing/retrieval –Indexing structure –Example –Scalability –Evaluation Conclusion G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 19 / 29

AOL logs 17M Queries from March, April, May 2006 (92 days) 650K anonymous user sessions Extracted all unique queries from each user session: G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval … :50:30 wearthbow.com native.cheyenne origin :50:30 l6 screensaver :50:30 horses for sale in tn ky :50:30 bank of america.com :50:30 ask :50:29 del rosa lanes :50:28 airlines.com :50:28 find holy women of the bible :50:27 trains :50:27 todaysmiricles :50:27 constition :50:26 german grocceries in las vegas nv :50:25 porn :50:25 northwest indiana :50:24 united.eprize.net :50:24 jessica laguna … <-0.7Gb 20 / 29

Distribution of combinations in the AOL logs G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 21 / 29

TREC Experiment WT10G collection (~1.69 M docs) 100 TREC queries (from TREC Web Track 9 & 10) Query statistics generated form 17M AOL queries Using Okapi-BM25 weighting schema to compute ranking score QF min = 1, 3, 5, ∞ DF max = 100, 500 s max =3 G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval DF max =100DF max =500 ST-BM25 QF min =∞QF min =5QF min =3QF min =1QF min =∞QF min =5QF min =3QF min = Precision is similar to centralized indexing TREC: Precision at Top Ranked Pages (table) 22 / 29

Overlap experiment Use the query-log to build the index (days 1..91) Choose randomly 2K test queries from the day 92 query its combinationsAnswer each test query with Google and compare to the union of top- DF max Google results for each of its combinations that are indexed according to the logs. Mimics our P2PIR system if Google’s ranking is used. Example: Original query Non-superfluous (indexed) combinations X X G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 23 / 29

Overlap example G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval what did babe ruth do in the 1920” >id=481, q=“what did babe ruth do in the 1920” “1920 babe ruth”, qf=0 ----> 100% “1920 babe”, qf= > 9% 1920 ruth”33% +++“1920 ruth”, qf= > 33% babe ruth” 69% +++“babe ruth”, qf= > 69% ---“1920”, qf= > 1% ---“babe”, qf= > 2% ---“ruth”, qf= > 7% % Size: 192, Keys used: 2, 94% Cut-n-paste from the simulation log: 24 / 29

Google experiment: impact of s max, DF max G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval impact of S max for all possible combinations (QF min =0) Impact of DF max with QF min =1, S max =3 25 / 29

Google experiment: impact of QF min G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval impact of QF min (DF max =600)Number of keys for different QF min Does not depend on the document collection size HDK approach would require ~65M keys for 650K documents Does not depend on the document collection size HDK approach would require ~65M keys for 650K documents >30% of badly performing queries are misspells => real quality is higher 26 / 29

Google experiment: impact of the log size G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval impact of the log size (Qf min =1, DF max =600) 27 / 29

Conclusions query-driven indexing strategyWe presented the query-driven indexing strategy for scalable web text retrieval with structured P2P networks: and –Stores posting lists in a DHT for terms and term combinations at most –Stores at most DF max top document references in a posting list statistics –Efficiently collects the query statistics in a distributed fashion popular –Based on this statistics activates (indexes) only popular keys no –Computes the result of a multi-term query based only on the index entries available at the moment – no costly intersections We also showed that: good retrieval quality –With real query-logs our approach achieves good retrieval quality tradeoff –The QF min parameter adjusts the traffic/quality tradeoff G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 28 / 29

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval Last slide Thank you for your attention! Questions? 29 / 29 AlvisP2P - to appear in July at